flatironinstitute / disBatch

Tool to distribute a list of computational tasks over a pool of compute resources. The pool can grow or shrink.
Apache License 2.0
39 stars 8 forks source link

Monitor engine processes somehow #6

Closed dylex closed 7 years ago

dylex commented 7 years ago

Should try to monitor the health of the engine child processes somehow. In the SSH case, we have access to all of them to check, but for slurm, we only have the overall srun as a single child. Even so, should at least monitor what we have. (In the slurm case, could also switch this to separate invocations.)

It was observed (on gordon) that when slurm kills the job (e.g., out of memory), it kills the srun, but not the sbatch. As a result, we have a zombie srun under disBatch, and disBatch just sitting waiting for tasks to come back that never will. If we caught the exit of srun, we could handle this better, at least by exiting (though ideally by reporting the errors on outstanding tasks first).

dylex commented 7 years ago

I'll try adding a SIGCHLD handler for this.