NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

401 Unauthorized Error at Fetching image manifest when running `srun` with local sqsh image #67

Closed crinavar closed 2 years ago

crinavar commented 2 years ago

Hi community, We are currently having an issue when running sqsh format containers with SLURM srun. For testing, i was generating an example sqsh CUDA container from NGC, i.e.,

➜  03-GPU-single git:(main) ✗ srun --container-name=cuda:11.4.1-devel --container-save=cont.sqsh nvcc    
nvcc fatal   : No input files specified; use option --help for more information
srun: error: nodeGPU01: task 0: Exited with exit code 1

Then, when trying to use the container from the sqsh file we get the following error

➜  03-GPU-single git:(main) ✗ srun --container-image=cont.sqsh nvcc                           
pyxis: importing docker image ...
slurmstepd: error: pyxis: child 2610852 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     [INFO] Querying registry for permission grant
slurmstepd: error: pyxis:     [INFO] Authenticating with user: <anonymous>
slurmstepd: error: pyxis:     [INFO] Authentication succeeded
slurmstepd: error: pyxis:     [INFO] Fetching image manifest list
slurmstepd: error: pyxis:     [INFO] Fetching image manifest
slurmstepd: error: pyxis:     [ERROR] URL https://registry-1.docker.io/v2/library/cont.sqsh/manifests/latest returned error code: 401 Unauthorized
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: nodeGPU01: task 0: Exited with exit code 1

As we can see, it is not failing in the authentication phase, but in the "Fetching image manifest" part. I have already tried, with no success, the following changes:

The .credentials file is configured with the NGC part only, and doing srun with containers from NGC works without problem. Is there some additional authentication required when opening sqsh files?

Any help is really welcome

flx42 commented 2 years ago

Sorry about the confusion, you need to use ./cont.sqsh in order to have pyxis interpret the argument as a squashfs file instead of an image to pull from a registry.

crinavar commented 2 years ago

many thanks, that was the issue! Sorry for creating a post for this small problem. I was testing right now, and wanted to ask if it is normal for it to take around 30 seconds to start the sqsh CUDA container? (around 4.9 GBytes).

➜  03-GPU-single git:(main) ✗ time srun --container-image=./cont.sqsh nvcc
srun: job 15720 queued and waiting for resources
srun: job 15720 has been allocated resources
nvcc fatal   : No input files specified; use option --help for more information
srun: error: nodeGPU01: task 0: Exited with exit code 1
srun --container-image=./cont.sqsh nvcc  0.01s user 0.02s system 0% cpu 37.178 total

Here a group of researchers want to launch different SLURM jobs and want to share the same container by having a single sqsh file shared. The enroot config is the following

 ENROOT_RUNTIME_PATH /run/enroot/user-$(id -u)
 ENROOT_CACHE_PATH /home/enroot-cache/group-$(id -g)
 ENROOT_DATA_PATH /home/enroot-data/user-$(id -u)
 #ENROOT_TEMP_PATH /home/enroot-tmp/user-$(id -u)
 ENROOT_TEMP_PATH /tmp
 ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates
 ENROOT_MOUNT_HOME y
 ENROOT_RESTRICT_DEV y
 ENROOT_ROOTFS_WRITABLE y

Would this be a config that favors speed ? Again many thanks

flx42 commented 2 years ago

You probably want to have ENROOT_DATA_PATH point to a local filesystem, in one of our deployment we use:

ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)

And /tmp is a tmpfs mount, but that is only doable if you have sufficient RAM available on each node.

Also since you are storing the image in the HOME folder, it will depend how fast this filesystem is. If you have a parallel filesystem that you can use instead (e.g. Lustre, GPFS, ...), it might be a better choice.

crinavar commented 2 years ago

Yes, indeed. We are using NFS and /home is physically located in the storage node through infiniband, so we might be saturating that. We will have to re-think some of the design, but indeed moving to /tmp would improve speed. Thanks for all the help and fast replies.

flx42 commented 2 years ago

Note that /tmp might not be a tmpfs, depending on your system setup, it might be part of the root partition, in this case it could be a slowish filesystem (e.g. RAID 1).

crinavar commented 2 years ago

I see, did a fast check and in fact /tmp is on ext4 filesystem, not tmpfs. It is a DGX A100 node (we have just one).

➜  ~ df -T /tmp
Filesystem     Type  1K-blocks     Used  Available Use% Mounted on
/dev/md0       ext4 1844244028 27449744 1723042068   2% /