Open gulzainali98 opened 3 years ago
You can set the environment variable in your Dockerfile, or on the command line:
$ enroot import docker://nvidia/cuda:11.4.0-base
$ enroot create nvidia+cuda+11.4.0-base.sqsh
$ NVIDIA_DRIVER_CAPABILITIES=compute,utility,video enroot start nvidia+cuda+11.4.0-base ldconfig -p | grep nvcuvid
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
But if nvidia-smi
doesn't work in the container, as you mentioned in https://github.com/NVIDIA/DALI/issues/3390#issuecomment-930492624, then you probably have a different problem.
I had an admin run the commands here is the output: https://pastebin.com/vCbkgE3D
nvidia-smi is also working correctly. It's basic pipeline is executed. It's only when the video pipeline is created that the error occurs.
interestingly when running enroot via slurm's pyxis integration the following will NOT work as one could expect:
NVIDIA_DRIVER_CAPABILITIES=all srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'echo $NVIDIA_DRIVER_CAPABILITIES'
#output:
compute,utility,video
The output here seems to be some default depending on the image rather than our env var setting. (Using the --export=...
arg of srun
also doesn't work and probably confuses people due to the default of passing all if not present and its parsing of ,
in it.)
Is it possible that the docker://nvidia/cuda:11.4.0-base image mentioned above has some ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
line in their Dockerfile
, which in the case of pyxis
overrides the same env var from the current context?
Anyhow, the rather obvious workaround seems to be to just set the env var inside the container:
srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; echo $NVIDIA_DRIVER_CAPABILITIES ; ...'
#output:
all
another option seems to be an enroot env var config file, but that's probably overkill and more confusing if other containers need other settings...
ah, i think i actually found it... seems to be a scoping issue...
Observe the following 4 calls (all without the mentioned enroot env var config file):
# default
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157794 queued and waiting for resources
srun: job 157794 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
compute,utility
# inside only
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
all
# outside only
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility
# inside and outside
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvidia+cuda+11.4.0-base.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all
Notice how libnvcuvid is only available in the container if the outside ENV var was set. Also notice how inside the container the does not reflect the outside ENV var!
Let's repeat the same with a pytorch image:
# default:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157806 queued and waiting for resources
srun: job 157806 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video
# inside only:
$ LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all
# outside only:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
compute,utility,video
# inside and outside:
$ NVIDIA_DRIVER_CAPABILITIES=all LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'export NVIDIA_DRIVER_CAPABILITIES=all ; hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
glasgow
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all
# explicitly setting outside to compute only
$ NVIDIA_DRIVER_CAPABILITIES=compute LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'hostname ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
...
glasgow
compute,utility,video
Summarizing there seem to be 2 scopes, one outer and one inner, both which are dangerously out of sync, probably causing the confusion:
$NVIDIA_DRIVER_CAPABILITIES
var inside the container!!!So if your base image already set the capabilities right, apparently magic kicks in and you don't need to worry. If your base image doesn't, then things get confusing and currently i'd suggest to explicitly set the NVIDIA_DRIVER_CAPABILITIES
env var twice to the same value in outer and inner scope.
Yes, sorry, it's a bit confusing. The environment of srun
is passed to enroot
so it can influence how the container is started and thus whether libnvcuvid.so.1
is mounted inside the container.
However, the environment variables of srun
and the environment variables of the container are then merged, but the container environment variables always take precedence.
This was discussed in https://github.com/NVIDIA/pyxis/issues/26, but I admit that this particular case here is even more confusing than the problems I saw before.
While the mismatch of NVIDIA_DRIVER_CAPABILITIES
is confusing, there is no reason to set NVIDIA_DRIVER_CAPABILITIES
inside the container though.
I am running enroot container on slurm cluster and I am getting following error:
This is the whole error: https://pastebin.com/96CYv9fs I am trying to run training for this rep: https://github.com/m-tassano/fastdvdnet
The error mentioned in Pastebin occurs at the following line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102
Code works fine on the local machine. This error is occurring only on slurm cluster. I searched a bit and came across this post: https://github.com/NVIDIA/DALI/issues/2229 Which is a similar issue as mine
After going through the solutions in this issue, I found out that when running a video reader pipeline in a container, you need to explicitly enable all the capabilities. In the case of simple docker images, it can be done using the following syntax: https://github.com/NVIDIA/nvidia-docker/issues/1128#issuecomment-557930809
However, I am not sure how to achieve this task on our enroot containers?