NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Can't get image from local Docker registry #118

Closed rstober closed 11 months ago

rstober commented 11 months ago

Hi, I'm trying to run a container that lives in a local docker registry. I'm trying to run it using srun, but it gives me a 400 error:

[robert@cnode001 ~]$ srun --container-image='docker://master:5000#custom-pytorch-3-1:latest' --pty --gres=gpu:t4:1 /bin/bash
pyxis: importing docker image: docker://master:5000#custom-pytorch-3-1:latest
slurmstepd: error: pyxis: child 30165 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [INFO] Querying registry for permission grant
slurmstepd: error: pyxis:     [ERROR] URL http://master:5000/v2/custom-pytorch-3-1/manifests/latest returned error code: 400 Bad Request
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: cnode001: task 0: Exited with exit code 1

Following some other threads about this, I found advice saying to try to pull the image just using enroot. This gives me the same 400 error:

[root@cnode001 ~]# enroot import --output rms-enroot-test.sqsh 'docker://master:5000#custom-pytorch-3-1:latest'
[INFO] Querying registry for permission grant
[ERROR] URL http://master:5000/v2/custom-pytorch-3-1/manifests/latest returned error code: 400 Bad Request

The image does exist in the local Docker registry. I can pull or run it just fine from there using Docker:

[robert@cnode001 ~]$ docker pull master:5000/custom-pytorch-3-1:latest
latest: Pulling from custom-pytorch-3-1
7608715873ec: Pull complete
7c8937d0a90f: Pull complete
c5b9a46f3cd0: Pull complete
. . .

And I've verified that the image is actually in the local Docker registry using Curl:

[root@cnode001 ~]# curl -k -X GET https://master:5000/v2/_catalog
{"repositories":["custom-pytorch-3-1","nvaie/pytorch-3-1","nvidia/pytorch"]}

[root@cnode001 ~]# curl -k -X GET https://master:5000/v2/custom-pytorch-3-1/tags/list
{"name":"custom-pytorch-3-1","tags":["latest"]}

What am I doing wrong?

flx42 commented 11 months ago

I was told by @rstober that this is fixed now.