Closed crinavar closed 1 year ago
Try using NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics srun -p gpu --container-image ...
Many thanks! It worked perfectly
ENV NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics
should be added to the Dockerfile then.
Hi @flx42
Sorry to bring this up, but today I found that setting the environment variable as suggested in the srun
command is no longer making the change. Reading some related posts I found that the container definitions will override the values chosen at srun
time.
Looking inside the CUDA container (stored with enroot), i went into /etc/environment
and indeed the container was setting the variable to NVIDIA_DRIVER_CAPABILITIES=utility,compute
, so I added "graphics". Then, when launching srun ... bash
with the container, it prints the variable properly (echo $NVIDIA_DRIVER_CAPABILITIES).
However, when launching the OptiX program (out custom CUDA+OptiX code), it still fails with a "Library not found" because the file /usr/lib/x86_64-linux-gnu/libnvoptix.so.1
is still not properly loaded.
Do you know if editing the /etc/environment
should have been sufficient for this file to load properly? at this point I am not sure if I need to install the driver inside the container, or the native driver from the node is sufficient.
In this situation, you can't rely on the value of NVIDIA_DRIVER_CAPABILITIES
from inside the container. It will indeed take the value from the container image over the value from your srun
environment, but NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics srun ...
will correctly apply to enroot
and should work properly.
Perhaps you have an old version of libnvidia-container? Or perhaps libnvoptix.so.1
is not present on the host system at all?
Hi @flx42
I started checking the packages and indeed I was missing one, libnvidia-gl-535-server
in my case. Now it works just as you mentioned (no need to modify the container files):
NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics srun ...
Many thanks again
Dear all, Although this is an OptiX error, I wonder if this is triggered by pyxis and if it has any solution.
So recently we realized that we cannot run OptiX jobs through Pyxis. When we launch a Slurm Job with a CUDA container and launch the binary, we get an error of the type
Optix Error: 'Library not found'
. We are using a CUDA container from Nvidia NGC, OptiX is installed (uncompressed) locally on the user's home. The code compiles OK on the interactive session of the job.After some google searching, I found that the solution may be this https://github.com/NVIDIA/nvidia-container-toolkit/issues/187 (i.e., to add
/usr/lib/x86_64-linux-gnu/libnvoptix.so.1
throughlibnvidia-container
) but I am not sure how this translates to the Pyxis environment. Any guides are welcome.