Open adammoody opened 3 years ago
We can use the subprocess.Popen to await a return value and kill the process. I added a watchdog test with an MPI program that just sleeps for 60 seconds (originally longer.) This looks like it works good on LSF+lrun, SLURM+srun. The jobstep id (if available) and kill command aren't used elsewhere in the scripts. Unless these are desired to be left as standalone scripts, I think these methods can be removed and we can just do the watchdog as a method
The current watchdog implementation is specific to SLURM. It assumes the MPI job was launched with
srun
, and it queries for the corresponding jobstep id. Then when it detects the job is hanging, it runsscancel
on the jobstep id.We need to generalize this to support job launchers other than SLURM. I think that most MPI job launchers will tear down an MPI job if one sends a signal like
SIGINT
orSIGKILL
to the pid of the job launcher command (mpirun
,jsrun
, etc). We should implement that method as the default and test it on SLURMsrun
, LSFjsrun
, SLURM with MVAPICH2mpirun_rsh
.As a bonus, we could see whether it's possible to keep the existing SLURM-specific method as a specialization on the JobLauncher class if one used
srun
to launch a job.Let's move the jobstep_id and kill_jobstep methods out of the resource manager class. Those should go in the job luancher class I think.