NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
281 stars 31 forks source link

Pyxis not picking up our GPUs #110

Closed slurmuser closed 1 year ago

slurmuser commented 1 year ago

I am trying to run a container on Slurm, I am able to get it working on cpus but when I try to get it to run on a GPU partition it is not able to detect CUDA or any of the allocated GPUs.

I can get my container to start if I use --export=None but no GPU, if I use NVIDIA_DRIVER_CAPABILITIES=compute,utility then I get the following error.

This seems to indicates that we do not have nvidia-container-cli or something is off in its config.

pyxis: importing docker image ... slurmstepd: error: pyxis: container start failed with error code: 1 slurmstepd: error: pyxis: printing contents of log file ... slurmstepd: error: pyxis: [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1 slurmstepd: error: pyxis: couldn't start container slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack This is the script I use to launch my container:

This happens no matter what image I use. Furthermore if I don't export I can enter the container but nvidia-smi is unavailable and it doesn't pick up any gpus despite specifying --gpus=1 in my srun command.

flx42 commented 1 year ago

Which container image did you try? Perhaps your container image sets the environment variable NVIDIA_VISIBLE_DEVICES?

Try a simple CUDA image from DockerHub: nvidia/cuda:12.1.0-base-ubuntu22.04, without setting --export, and let me know if the situation is the same.

slurmuser commented 1 year ago

Hi so I attempted as suggested and getting the same issue:

`$ srun -p p_ml_a30 -c 4 --nodes=1 --gpus=1 --container-image nvcr.io#nvidia/cuda:12.1.0-base-ubuntu22.04 --pty bash -i

srun: job 56396714 queued and waiting for resources srun: job 56396714 has been allocated resources pyxis: importing docker image ... slurmstepd: error: pyxis: container start failed with error code: 1 slurmstepd: error: pyxis: printing contents of log file ... slurmstepd: error: pyxis: [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1 slurmstepd: error: pyxis: couldn't start container slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: xxxxx: task 0: Exited with exit code 1

$ printenv |grep -i nvidia

$ `

There are no environment variables set and I've use the suggested image.

flx42 commented 1 year ago

So I guess you really are missing libnvidia-container, you can follow the instructions here for installation: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit

flx42 commented 1 year ago

Feel free to reopen or file a new bug if you still have issues, thanks!