Error handling on the grid - Githubissues

aibasel / machetli

GNU General Public License v3.0

6 stars 3 forks source link

Error handling on the grid #14

Closed galerykaeser closed 3 years ago

galerykaeser commented 3 years ago

Level	Error(s)	Approach
`evaluate` function in main Python script \| OOT, OOM \| If the order of successors is not important, set the `result` variable to False and output error information to the main script's log file. Else, output error information to main script's log file and abort search.
SBATCH commands	Invalid data filled into template (error during array job submission)	Output error information to main script's log file and abort search.
Bash script of array job	Invalid `ulimit` input, space in path	Output error information to main script's log file and abort search.
Main Python script interacting with Slurm	Array job randomly killed or not responding after defined amount of time	If the order of successors is not important, set the `result` variable to False and output error information to the main script's log file. Else, output error information to main script's log file and abort search.

galerykaeser commented 3 years ago

Update of the documentation:

Process Hierarchy and Safety Measures

Level 1: Main Search Script (Login Node)

Generates state successors
Submits slurm array jobs to evaluate successors (default array length up to 200)
Polls array job status with sacct
Parses evaluation results and continues search

Possible Errors	Safety Measures
Job submission with `sbatch` is not successful \| If `enforce_order` flag is not set, continue with next successor batch; else, abort search by returning the current state
Slurm task is in a state other than PENDING, RUNNING or COMPLETED when polled	If `enforce_order` flag is not set, ignore failed tasks; else, only consider tasks before the first failed one
Evaluation result file is not present (something went wrong in the execution of the evaluation script on the compute node)	After waiting for the file for a maximum of 60 s (checking every 3 s), the corresponding task is ignored (without `enforce_order`) or the search is aborted by returning the current state (with `enforce_order`)

Level 2: Bash Script for Single Slurm Task (Compute Node)

Limits memory of child processes with ulimit (setting the soft limit to 98 % of the product of the slurm parameters cpus-per-task and mem-per-cpu)
Executes the evaluation script (similar to: ./script.py --evaluate /path/to/state-dump) inside a sub-shell and redirects the stdout and stderr outputs to log files

Possible Errors	Safety Measures
Space character in `/path/to/state` causes error in execution of the evaluation script	Check script path for spaces in the beginning of the grid search

Level 3: Evaluation Script Executed in a Sub-Shell of the Bash Script

Parses the corresponding state from its dump and runs the evaluation defined by the user (generally by executing the Run instances in the state and processing the parsed output streams in a meaningful way)

Possible Errors	Safety Measures
Evaluation consumes too much memory	Memory limit set in the bash script with `ulimit` causes the evaluation script in the sub-shell to terminate on a memory error
Evaluation takes too much time	Run classes (that are meant to define the program executions to be evaluated) have a mandatory argument _timelimit that is set in each run when its command is started as a subprocess (using `resource.setrlimit` from Python); therefore, the subprocess of the run will always terminate latest after its time limit expired

Level 4: Run executed as subprocess of the evaluation

Any program whose execution the user wants to analyze, e.g., ./fast-downward.py domain.pddl problem.pddl --search astar(lmcut())
Time and memory constraints are already handled by levels 2 and 3

Possible Errors	Safety Measures
Any error that can occur during the execution of the run	Not needed, as any behavior of the program is captured via the produced outputs and returncode, which are then parsed and processed as part of the evaluation