NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Pyxis fails with docker socket permission denied #142

Closed RamHPC closed 3 weeks ago

RamHPC commented 3 weeks ago

Pyxis fails with permission denied while running a container.

slurmstepd: error: pyxis: child 2006578 failed with error code: 1 srun: error: gpu2: task 1: Exited with exit code 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [INFO] Fetching image slurmstepd: error: pyxis: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack

The user is part of the docker group and it is verified with "groups" and also running "docker run hello-world" commands.

$ ll /var/run/docker.sock srw-rw---- 1 root docker 0 Jun 11 18:22 /var/run/docker.sock

I could able to run the container only if I change the user from "root" to "$USER" (chown command). This is annoying and causing issues for users.

flx42 commented 3 weeks ago

There are multiple things that can go wrong here, not sure I can really help you fully debug your docker configuration on your cluster.

I would start by testing with docker inside a Slurm job (without pyxis):

$ srun docker pull ubuntu:24.04

We also recommend to use docker:// instead of dockerd:// for the image name.

RamHPC commented 3 weeks ago

There are multiple things that can go wrong here, not sure I can really help you fully debug your docker configuration on your cluster.

I would start by testing with docker inside a Slurm job (without pyxis):

$ srun docker pull ubuntu:24.04

srun docker pull/push works fine with "root" user. srun docker build has the same issue of permission.

$ srun -p aiml docker build --pull -t dl385:5000/mlperf-nvidia:image_segmentation-mxnet . ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied srun: error: gpu2: task 0: Exited with exit code 1

We also recommend to use docker:// instead of dockerd:// for the image name. docker:// is not working and that's when I switched to dockerd://

The error with using docker://

slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [ERROR] Invalid image reference: docker://dl385:5000/mlperf-nvidia:image_segmentation-mxnet slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack slurmstepd: error: pyxis: child 1248759 failed with error code: 1 srun: error: gpu2: task 1: Exited with exit code 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [ERROR] Invalid image reference: docker://dl385:5000/mlperf-nvidia:image_segmentation-mxnet slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack

flx42 commented 3 weeks ago

So the fact that srun -p aiml docker build --pull -t dl385:5000/mlperf-nvidia:image_segmentation-mxnet . fails shows that it's not an enroot/pyxis issue, so I can't help you here.

RamHPC commented 3 weeks ago

Thank you for the hint. The node has different "gid" for docker and that was causing issues. Resolved it now.