Closed RamHPC closed 3 weeks ago
There are multiple things that can go wrong here, not sure I can really help you fully debug your docker configuration on your cluster.
I would start by testing with docker inside a Slurm job (without pyxis):
$ srun docker pull ubuntu:24.04
We also recommend to use docker://
instead of dockerd://
for the image name.
There are multiple things that can go wrong here, not sure I can really help you fully debug your docker configuration on your cluster.
I would start by testing with docker inside a Slurm job (without pyxis):
$ srun docker pull ubuntu:24.04
srun docker pull/push works fine with "root" user. srun docker build has the same issue of permission.
$ srun -p aiml docker build --pull -t dl385:5000/mlperf-nvidia:image_segmentation-mxnet . ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied srun: error: gpu2: task 0: Exited with exit code 1
We also recommend to use
docker://
instead ofdockerd://
for the image name.docker://
is not working and that's when I switched todockerd://
The error with using docker://
slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [ERROR] Invalid image reference: docker://dl385:5000/mlperf-nvidia:image_segmentation-mxnet slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack slurmstepd: error: pyxis: child 1248759 failed with error code: 1 srun: error: gpu2: task 1: Exited with exit code 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [ERROR] Invalid image reference: docker://dl385:5000/mlperf-nvidia:image_segmentation-mxnet slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack
So the fact that srun -p aiml docker build --pull -t dl385:5000/mlperf-nvidia:image_segmentation-mxnet .
fails shows that it's not an enroot/pyxis issue, so I can't help you here.
Thank you for the hint. The node has different "gid" for docker and that was causing issues. Resolved it now.
Pyxis fails with permission denied while running a container.
slurmstepd: error: pyxis: child 2006578 failed with error code: 1 srun: error: gpu2: task 1: Exited with exit code 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing enroot log file: slurmstepd: error: pyxis: [INFO] Fetching image slurmstepd: error: pyxis: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied slurmstepd: error: pyxis: couldn't start container slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack
The user is part of the docker group and it is verified with "groups" and also running "docker run hello-world" commands.
$ ll /var/run/docker.sock srw-rw---- 1 root docker 0 Jun 11 18:22 /var/run/docker.sock
I could able to run the container only if I change the user from "root" to "$USER" (chown command). This is annoying and causing issues for users.