python: Generalize scr watchdog

LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

Other

99 stars 36 forks source link

The current watchdog implementation is specific to SLURM. It assumes the MPI job was launched with srun, and it queries for the corresponding jobstep id. Then when it detects the job is hanging, it runs scancel on the jobstep id.

We need to generalize this to support job launchers other than SLURM. I think that most MPI job launchers will tear down an MPI job if one sends a signal like SIGINT or SIGKILL to the pid of the job launcher command (mpirun, jsrun, etc). We should implement that method as the default and test it on SLURM srun, LSF jsrun, SLURM with MVAPICH2 mpirun_rsh.

As a bonus, we could see whether it's possible to keep the existing SLURM-specific method as a specialization on the JobLauncher class if one used srun to launch a job.

Let's move the jobstep_id and kill_jobstep methods out of the resource manager class. Those should go in the job luancher class I think.

LLNL / scr

python: Generalize scr watchdog #402