mksquashfs unkillable by OOM killer during container export

jfolz commented 1 year ago

Hello pyxis team!

We've recently started having issues with hanging container export in case of OOM. After some digging, we believe this is due to different behavior of Slurm after OS upgrade to Ubuntu 22.04, which uses cgroup v2.

What happens is that users specify an insufficient amount of memory for their job, and the memory limit is hit during container export. mksquashfs now hangs indefinitely, while the kernel constantly tries to find a process to kill, but reports "Out of memory and no killable processes". Here's an example for running processes when this occurs:

[  pid  ]   uid    tgid total_vm      rss pgtables_bytes swapents oom_score_adj       name
[3077195]     0 3077195   104945     2751         114688        0         -1000 slurmstepd
[3078585]  3931 3078585     2107     1209          53248        0         -1000       bash
[3078613]  3931 3078613  2128310  1570563       16453632        0         -1000 mksquashfs

Note the oom_score_adj column. This value is added to the "badness" score that the OOM killer uses to determine which process to kill. It's set to -1000 for slurmstepd and its child processes, including mksquashfs. Badness is in [0, 1000], so after subtracting 1000 these processes always have a badness of 0. According to the kernel docs that makes them unkillable, which is consistent with our earlier observation. Slurm does this to prevent slurmstepd getting killed, which would mess up all kinds of things. Usually slurmstepd resets oom_score_adj to 0 for user processes, but that's not the case for those started by plugins.

Digging through the Slurm sources, we found that this behavior can be adjusted by setting the SLURMSTEPD_OOM_ADJ env var for slurmd. Unfortunately this simply changes oom_score_adj for slurmstepd, making it killable as well.

Right now we only see one other solution: make pyxis set oom_score_adj to 0 for its child processes.

flx42 commented 1 year ago

Nice find. Looks like this is because Slurm does not run the task_exit SPANK callback in the cgroup of the job step, contrarily to task_init. I filed https://bugs.schedmd.com/show_bug.cgi?id=17426 to ask if this is intended and can be changed. As a result, the container import should not have this problem.

But if Slurm pushes a change, it will probably take time so pyxis might need to modify oom_score_adj in task_exit as a workaround.

flx42 commented 1 year ago

@jfolz question for you before I submit any change to pyxis, how is --container-export typically used by your users?

A lot of our use cases are using true as a command, to act like an enroot import, e.g.:

$ srun --container-image ubuntu --container-save=/path/to/image.sqsh true

If this is also the case for you, we could have --container-save do the export in task_init and avoid this problem altogether.

But of course this would mean that this won't work anymore:

$ srun --container-image ubuntu --container-save=/path/to/image_with_emacs.sqsh bash -c 'apt-get update && apt-get install -y emacs-nox'

jfolz commented 1 year ago

@flx42 we instruct our users to avoid creating new images. We provide commonly used images in enroot format, and additional dependencies usually only take a few seconds to install, so that can be done on job start. Sometimes that is not feasible though, e.g., when lengthy compilation steps are involved. So our typical use case for --container-save involves many processing steps in between. For image import we recommend calling enroot import themselves, as we feel it's clearer what will happen when users interact with it directly.

flx42 commented 1 year ago

This should be fixed now, I will try to do a new release soon.

jfolz commented 1 year ago

Can confirm that we no longer encounter this issue. Thanks for the quick turnaround as always ❤️

NVIDIA / pyxis

mksquashfs unkillable by OOM killer during container export #120