NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

Can't srun with pyxis #93

Closed aboseria closed 1 year ago

aboseria commented 1 year ago

Keep getting an error that there is no space left on the device

(base) [maboseri@head ~]$ srun --container-name=pytorch cat /etc/os-release
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     enroot-nsenter: failed to create user namespace: No space left on device
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: node001: task 0: Exited with exit code 1

Any recommendations on how to resolve?

3XX0 commented 1 year ago

Make sure that /proc/sys/user/max_user_namespaces is set appropriately. See https://github.com/NVIDIA/enroot/blob/master/doc/requirements.md

aboseria commented 1 year ago

It was already set to 32 prior to this happening

flx42 commented 1 year ago

Can you try with a value higher than 32 to verify that it's the cause of the problem?

aboseria commented 1 year ago

Yup it's set to a higher amount but no luck. Is there a Unix command to clean up namespaces and kill defunct processes?

flx42 commented 1 year ago

You can try to use lsns and see what it reports.

On Wed, Oct 26, 2022, 09:45 aboseria @.***> wrote:

Yup it's set to a higher amount but no luck. Is there a Unix command to clean up namespaces and kill defunct processes?

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/pyxis/issues/93#issuecomment-1292323313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA32BDNCYIACR5MF6A3XKKLWFFN3DANCNFSM6AAAAAARNFPXCA . You are receiving this because you commented.Message ID: @.***>

aboseria commented 1 year ago

This is the output

        NS TYPE   NPROCS     PID USER     COMMAND
4026531835 cgroup      6 4141500 maboseri /bin/bash /cm/shared/apps/jupyter/12.2.0/bin/jupyterhub-singleuser-gw --port=44839 --SingleUserNotebookApp.default_url=/l
4026531836 pid         6 4141500 maboseri /bin/bash /cm/shared/apps/jupyter/12.2.0/bin/jupyterhub-singleuser-gw --port=44839 --SingleUserNotebookApp.default_url=/l
4026531837 user        6 4141500 maboseri /bin/bash /cm/shared/apps/jupyter/12.2.0/bin/jupyterhub-singleuser-gw --port=44839 --SingleUserNotebookApp.default_url=/l
4026531838 uts         6 4141500 maboseri /bin/bash /cm/shared/apps/jupyter/12.2.0/bin/jupyterhub-singleuser-gw --port=44839 --SingleUserNotebookApp.default_url=/l
4026531839 ipc         6 4141500 maboseri /bin/bash /cm/shared/apps/jupyter/12.2.0/bin/jupyterhub-singleuser-gw --port=44839 --SingleUserNotebookApp.default_url=/l
4026531840 mnt         6 4141500 maboseri /bin/bash /cm/shared/apps/jupyter/12.2.0/bin/jupyterhub-singleuser-gw --port=44839 --SingleUserNotebookApp.default_url=/l
4026531992 net         6 4141500 maboseri /bin/bash /cm/shared/apps/jupyter/12.2.0/bin/jupyterhub-singleuser-gw --port=44839 --SingleUserNotebookApp.default_url=/l
aboseria commented 1 year ago

Any recommendations based on the output?

flx42 commented 1 year ago

Not really, try a very very high value for max_user_namespaces? I'm not sure what's the logic on Ubuntu 22.10, but I have this:

$ cat /proc/sys/user/max_user_namespaces
126604

And on another machine:

$ cat /proc/sys/user/max_user_namespaces
8254821
aboseria commented 1 year ago

Still no luck :(

flx42 commented 1 year ago

It's weird, what's your distro and kernel version?

I would also recommend testing with just enroot and outside of a Slurm job to try to simplify the situation a little bit (no pyxis, no cgroup from Slurm).