NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
610 stars 93 forks source link

NVIDIA-DALI Capabilities issue #100

Open gulzainali98 opened 2 years ago

gulzainali98 commented 2 years ago

I am running enroot container on slurm cluster and I am getting following error:

This is the whole error: https://pastebin.com/96CYv9fs I am trying to run training for this rep: https://github.com/m-tassano/fastdvdnet

The error mentioned in Pastebin occurs at the following line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102

Code works fine on the local machine. This error is occurring only on slurm cluster. I searched a bit and came across this post: https://github.com/NVIDIA/DALI/issues/2229 Which is a similar issue as mine

After going through the solutions in this issue, I found out that when running a video reader pipeline in a container, you need to explicitly enable all the capabilities. In the case of simple docker images, it can be done using the following syntax: https://github.com/NVIDIA/nvidia-docker/issues/1128#issuecomment-557930809

However, I am not sure how to achieve this task on our enroot containers?

flx42 commented 2 years ago

You can set the environment variable in your Dockerfile, or on the command line:

$ enroot import docker://nvidia/cuda:11.4.0-base
$ enroot create nvidia+cuda+11.4.0-base.sqsh 

$ NVIDIA_DRIVER_CAPABILITIES=compute,utility,video enroot start nvidia+cuda+11.4.0-base ldconfig -p | grep nvcuvid
        libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

But if nvidia-smi doesn't work in the container, as you mentioned in https://github.com/NVIDIA/DALI/issues/3390#issuecomment-930492624, then you probably have a different problem.

gulzainali98 commented 2 years ago

I had an admin run the commands here is the output: https://pastebin.com/vCbkgE3D

nvidia-smi is also working correctly. It's basic pipeline is executed. It's only when the video pipeline is created that the error occurs.

joernhees commented 2 years ago

interestingly when running enroot via slurm's pyxis integration the following will NOT work as one could expect:

NVIDIA_DRIVER_CAPABILITIES=all srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'echo $NVIDIA_DRIVER_CAPABILITIES'
#output: 
compute,utility,video

The output here seems to be some default depending on the image rather than our env var setting. (Using the --export=... arg of srun also doesn't work and probably confuses people due to the default of passing all if not present and its parsing of , in it.)

Is it possible that the docker://nvidia/cuda:11.4.0-base image mentioned above has some ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility line in their Dockerfile, which in the case of pyxis overrides the same env var from the current context?

Anyhow, the rather obvious workaround seems to be to just set the env var inside the container:

srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; echo $NVIDIA_DRIVER_CAPABILITIES ; ...'
#output: 
all

another option seems to be an enroot env var config file, but that's probably overkill and more confusing if other containers need other settings...

joernhees commented 2 years ago

ah, i think i actually found it... seems to be a scoping issue...

Observe the following 4 calls (all without the mentioned enroot env var config file):

# default
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157794 queued and waiting for resources
srun: job 157794 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
compute,utility

# inside only
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
all

# outside only
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
    libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility

# inside and outside
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
    libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

Notice how libnvcuvid is only available in the container if the outside ENV var was set. Also notice how inside the container the does not reflect the outside ENV var!

Let's repeat the same with a pytorch image:

# default:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157806 queued and waiting for resources
srun: job 157806 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
    libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video

# inside only:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
    libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

# outside only:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
    libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video

# inside and outside:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
glasgow
    libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

# explicitly setting outside to compute only
$ NVIDIA_DRIVER_CAPABILITIES=compute LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
compute,utility,video

Summarizing there seem to be 2 scopes, one outer and one inner, both which are dangerously out of sync, probably causing the confusion:

So if your base image already set the capabilities right, apparently magic kicks in and you don't need to worry. If your base image doesn't, then things get confusing and currently i'd suggest to explicitly set the NVIDIA_DRIVER_CAPABILITIES env var twice to the same value in outer and inner scope.

flx42 commented 2 years ago

Yes, sorry, it's a bit confusing. The environment of srun is passed to enroot so it can influence how the container is started and thus whether libnvcuvid.so.1 is mounted inside the container. However, the environment variables of srun and the environment variables of the container are then merged, but the container environment variables always take precedence.

flx42 commented 2 years ago

This was discussed in https://github.com/NVIDIA/pyxis/issues/26, but I admit that this particular case here is even more confusing than the problems I saw before.

While the mismatch of NVIDIA_DRIVER_CAPABILITIES is confusing, there is no reason to set NVIDIA_DRIVER_CAPABILITIES inside the container though.