Closed galerykaeser closed 3 years ago
Update of the documentation:
sacct
Possible Errors | Safety Measures |
---|---|
Job submission with sbatch is not successful | If enforce_order flag is not set, continue with next successor batch; else, abort search by returning the current state |
|
Slurm task is in a state other than PENDING, RUNNING or COMPLETED when polled | If enforce_order flag is not set, ignore failed tasks; else, only consider tasks before the first failed one |
Evaluation result file is not present (something went wrong in the execution of the evaluation script on the compute node) | After waiting for the file for a maximum of 60 s (checking every 3 s), the corresponding task is ignored (without enforce_order ) or the search is aborted by returning the current state (with enforce_order ) |
ulimit
(setting the soft limit to 98 % of the product of the slurm parameters cpus-per-task
and mem-per-cpu
)./script.py --evaluate /path/to/state-dump
) inside a sub-shell and redirects the stdout and stderr outputs to log filesPossible Errors | Safety Measures |
---|---|
Space character in /path/to/state causes error in execution of the evaluation script |
Check script path for spaces in the beginning of the grid search |
Possible Errors | Safety Measures |
---|---|
Evaluation consumes too much memory | Memory limit set in the bash script with ulimit causes the evaluation script in the sub-shell to terminate on a memory error |
Evaluation takes too much time | Run classes (that are meant to define the program executions to be evaluated) have a mandatory argument _timelimit that is set in each run when its command is started as a subprocess (using resource.setrlimit from Python); therefore, the subprocess of the run will always terminate latest after its time limit expired |
./fast-downward.py domain.pddl problem.pddl --search astar(lmcut())
Possible Errors | Safety Measures |
---|---|
Any error that can occur during the execution of the run | Not needed, as any behavior of the program is captured via the produced outputs and returncode, which are then parsed and processed as part of the evaluation |
evaluate
function in main Python script | OOT, OOM | If the order of successors is not important, set theresult
variable to False and output error information to the main script's log file. Else, output error information to main script's log file and abort search.ulimit
input, space in pathresult
variable to False and output error information to the main script's log file. Else, output error information to main script's log file and abort search.