NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 325 forks source link

fetching PIDs for timeout jobs for cleanup sometimes fail to kill processes #1315

Open ilya-da opened 1 week ago

ilya-da commented 1 week ago

Under some circumstances slurm epilog fail to cleanup processes because of parsing of nvidia-smi pmon

From /var/log/slurm/prolog-epilog

Regular output should work well, but if for some reason output will contain one more comment line before processes list it will catch non PID line

root@hpc-hostname:~# nvidia-smi pmon -c 1 # gpu pid type sm mem enc dec command # Idx # C/G % % % % name 0 - - - - - - - 1 - - - - - - - 2 - - - - - - - 3 - - - - - - - 4 - - - - - - - 5 - - - - - - - 6 - - - - - - - 7 - - - - - - -

ilya-da commented 1 week ago

1316 proposed solution