Closed jfolz closed 1 year ago
Nice find. Looks like this is because Slurm does not run the task_exit
SPANK callback in the cgroup of the job step, contrarily to task_init
. I filed https://bugs.schedmd.com/show_bug.cgi?id=17426 to ask if this is intended and can be changed.
As a result, the container import should not have this problem.
But if Slurm pushes a change, it will probably take time so pyxis might need to modify oom_score_adj
in task_exit
as a workaround.
@jfolz question for you before I submit any change to pyxis, how is --container-export
typically used by your users?
A lot of our use cases are using true
as a command, to act like an enroot import
, e.g.:
$ srun --container-image ubuntu --container-save=/path/to/image.sqsh true
If this is also the case for you, we could have --container-save
do the export in task_init
and avoid this problem altogether.
But of course this would mean that this won't work anymore:
$ srun --container-image ubuntu --container-save=/path/to/image_with_emacs.sqsh bash -c 'apt-get update && apt-get install -y emacs-nox'
@flx42 we instruct our users to avoid creating new images. We provide commonly used images in enroot format, and additional dependencies usually only take a few seconds to install, so that can be done on job start. Sometimes that is not feasible though, e.g., when lengthy compilation steps are involved. So our typical use case for --container-save
involves many processing steps in between.
For image import we recommend calling enroot import
themselves, as we feel it's clearer what will happen when users interact with it directly.
This should be fixed now, I will try to do a new release soon.
Can confirm that we no longer encounter this issue. Thanks for the quick turnaround as always ❤️
Hello pyxis team!
We've recently started having issues with hanging container export in case of OOM. After some digging, we believe this is due to different behavior of Slurm after OS upgrade to Ubuntu 22.04, which uses cgroup v2.
What happens is that users specify an insufficient amount of memory for their job, and the memory limit is hit during container export. mksquashfs now hangs indefinitely, while the kernel constantly tries to find a process to kill, but reports "Out of memory and no killable processes". Here's an example for running processes when this occurs:
Note the
oom_score_adj
column. This value is added to the "badness" score that the OOM killer uses to determine which process to kill. It's set to -1000 for slurmstepd and its child processes, including mksquashfs. Badness is in [0, 1000], so after subtracting 1000 these processes always have a badness of 0. According to the kernel docs that makes them unkillable, which is consistent with our earlier observation. Slurm does this to prevent slurmstepd getting killed, which would mess up all kinds of things. Usually slurmstepd resetsoom_score_adj
to 0 for user processes, but that's not the case for those started by plugins.Digging through the Slurm sources, we found that this behavior can be adjusted by setting the SLURMSTEPD_OOM_ADJ env var for slurmd. Unfortunately this simply changes
oom_score_adj
for slurmstepd, making it killable as well.Right now we only see one other solution: make pyxis set
oom_score_adj
to 0 for its child processes.