erik78se / Jobbers

Jobbers is a Python package that produces so called application "job-scripts". Those scripts are then ready to be submitted into SLURM.
GNU General Public License v3.0
1 stars 1 forks source link

Python submit scripts generated for Abaqus needs signal handling #31

Open hallback opened 5 years ago

hallback commented 5 years ago

Currently, an Abaqus job started by our templates generated by Jobbers spawns this process on the first node:

jhacxc 158833 0.2 0.0 154052 9320 ? S 10:57 0:00 /bin/python3 /var/spool/slurmd.spool/job00038/slurm_script

When issuing scancel on this job, Slurm will send a SIGTERM to that particular process, which only kills Python. The Abaqus process will not die, and no cleanup afterwards will take place.

Suggestion: Make sure the Python scripts generated by Jobbers listen to signals and tries to run "abaqus terminate ", something like this:

https://www.sharcnet.ca/Software/Abaqus/6.14.2/v6.14/books/usb/default.htm?startat=pt01ch03s02abx39.html

If that does not succeed within 5 minutes, run Popen.send_signal() or Popen.terminate() to kill the process. There may be leftovers that has been started by mpid, in that case better traverse the entire process tree or else processes started by mpid may be left orphaned.

https://docs.python.org/3/library/subprocess.html

Output from pstree -u on master node:

├─slurmstepd─┬─slurm_script(jhacxc)───python───python─┬─eliT_DriverLM───{eliT_DriverLM}
        │            │                                        └─mpirun─┬─cat
        │            │                                                 ├─mpid─┬─standard───12*[{standard}]
        │            │                                                 │      └─5*[standard───9*[{standard}]]
        │            │                                                 └─ssh
        │            └─3*[{slurmstepd}]