NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

`Permission denied` errors when importing Docker image #121

Closed javrtg closed 10 months ago

javrtg commented 10 months ago

Hi,

We're encountering Permission denied errors when attempting to import Docker images with Pyxis through SLURM job submissions. Specifically, this happens when using containers from the NVIDIA catalog.

[This is an example of error logs] ```shell pyxis: importing docker image ... slurmstepd: error: pyxis: child 1976120 failed with error code: 5 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing contents of log file ... slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: slurmstepd: error: pyxis: [INFO] Authentication succeeded slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [INFO] Downloading 7 missing layers... slurmstepd: error: pyxis: [INFO] Extracting image layers... slurmstepd: error: pyxis: tar (child): /raid/enroot-cache/group-18000/0b0815e859edb53949be96f250c81f60d18bbe6fbdc9cc85f768ed7eae96b969: No se puede efectuar open: Permiso denegado slurmstepd: error: pyxis: tar (child): Error is not recoverable: exiting now slurmstepd: error: pyxis: tar: Child returned status 2 slurmstepd: error: pyxis: tar: Error is not recoverable: exiting now slurmstepd: error: pyxis: tar (child): /raid/enroot-cache/group-18000/8abd77f1cc9e5692e7294a87d457616afa33abb8a4ec0e6b8857e41d1fb36c90: No se puede efectuar open: Permiso denegado slurmstepd: error: pyxis: tar (child): Error is not recoverable: exiting now slurmstepd: error: pyxis: tar: Child returned status 2 slurmstepd: error: pyxis: tar: Error is not recoverable: exiting now slurmstepd: error: pyxis: tar (child): /raid/enroot-cache/group-18000/bac1f8a1c195b96b2fe3ef7cf127e12a19fee10d78eacb8111a95dd46e23b0d7: No se puede efectuar open: Permiso denegado slurmstepd: error: pyxis: tar (child): Error is not recoverable: exiting now slurmstepd: error: pyxis: tar: Child returned status 2 slurmstepd: error: pyxis: tar: Error is not recoverable: exiting now slurmstepd: error: pyxis: tar (child): /raid/enroot-cache/group-18000/50292f59408b2e21d9b29cd44e4566156550099645d37203ff3ee2d1dc5036c5: No se puede efectuar open: Permiso denegado slurmstepd: error: pyxis: tar (child): Error is not recoverable: exiting now slurmstepd: error: pyxis: tar: Child returned status 2 slurmstepd: error: pyxis: tar: Error is not recoverable: exiting now slurmstepd: error: pyxis: tar (child): /raid/enroot-cache/group-18000/56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027: No se puede efectuar open: Permiso denegado slurmstepd: error: pyxis: tar (child): Error is not recoverable: exiting now slurmstepd: error: pyxis: tar: Child returned status 2 slurmstepd: error: pyxis: tar: Error is not recoverable: exiting now slurmstepd: error: pyxis: couldn't start container slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack ```

According to some tests we've done and based on the error logs, in particular, lines like the one below:

/raid/enroot-cache/group-18000/0b0815e859edb53949be96f250c81f60d18bbe6fbdc9cc85f768ed7eae96b969: No se puede efectuar open: Permiso denegado
# (I think that the english translation of the last phrase should be something like: Cannot be executed open: Permission denied)

we believe this error may be caused by the container becoming associated with the first user who uses it, preventing subsequent users from using the same container. Please, can you confirm this behavior?

For context, the above error arises when executing the sbatch slurm command using the batch script below. This specific script asks for the NVIDIA container cuda:12.2.0-devel-ubuntu20.04 :

#!/bin/bash

#SBATCH -N 1
#SBATCH --job-name="name"
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=50G
#SBATCH --gres=gpu:1

srun \
--container-mounts=/raid/ropert/user/:/workspace/user \
--container-workdir=/workspace/user \
--container-image=nvcr.io#nvidia/cuda:12.2.0-devel-ubuntu20.04 \
bash foo.sh

A workaround we have found consists on deleting the cache inside the folder /raid/enroot-cache/group-18000 that is mentioned in the error logs. This way, different users seem to be able to use the same NVIDIA container. However, we're unsure if this method is the reccomended solution. Please, could you provide guidance or an alternative solution?

Thank you

flx42 commented 10 months ago

we believe this error may be caused by the container becoming associated with the first user who uses it, preventing subsequent users from using the same container. Please, can you confirm this behavior?

It should work, that's why you share the ENROOT_CACHE_PATH with other members of your group. Perhaps it's due to special permissions settings on the folder or files, perhaps a custom umask setting.

What are the permissions on the layer files in this directory? It should be 640 to enable sharing layers with other users.

javrtg commented 10 months ago

Thank you for the prompt response!

Yes, that is Indeed the issue. Our permissions are 600 . We will see if we can change them to 640 :)

Thanks again!