LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

python: Generalize scr watchdog #402

Open adammoody opened 3 years ago

adammoody commented 3 years ago

The current watchdog implementation is specific to SLURM. It assumes the MPI job was launched with srun, and it queries for the corresponding jobstep id. Then when it detects the job is hanging, it runs scancel on the jobstep id.

We need to generalize this to support job launchers other than SLURM. I think that most MPI job launchers will tear down an MPI job if one sends a signal like SIGINT or SIGKILL to the pid of the job launcher command (mpirun, jsrun, etc). We should implement that method as the default and test it on SLURM srun, LSF jsrun, SLURM with MVAPICH2 mpirun_rsh.

As a bonus, we could see whether it's possible to keep the existing SLURM-specific method as a specialization on the JobLauncher class if one used srun to launch a job.

Let's move the jobstep_id and kill_jobstep methods out of the resource manager class. Those should go in the job luancher class I think.

chaseleif commented 3 years ago

We can use the subprocess.Popen to await a return value and kill the process. I added a watchdog test with an MPI program that just sleeps for 60 seconds (originally longer.) This looks like it works good on LSF+lrun, SLURM+srun. The jobstep id (if available) and kill command aren't used elsewhere in the scripts. Unless these are desired to be left as standalone scripts, I think these methods can be removed and we can just do the watchdog as a method