Monitor engine processes somehow

Should try to monitor the health of the engine child processes somehow. In the SSH case, we have access to all of them to check, but for slurm, we only have the overall srun as a single child. Even so, should at least monitor what we have. (In the slurm case, could also switch this to separate invocations.)

It was observed (on gordon) that when slurm kills the job (e.g., out of memory), it kills the srun, but not the sbatch. As a result, we have a zombie srun under disBatch, and disBatch just sitting waiting for tasks to come back that never will. If we caught the exit of srun, we could handle this better, at least by exiting (though ideally by reporting the errors on outstanding tasks first).

flatironinstitute / disBatch

Monitor engine processes somehow #6