NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Problems running OptiX job through Pyxis, what am I missing? #103

Closed crinavar closed 1 year ago

crinavar commented 1 year ago

Dear all, Although this is an OptiX error, I wonder if this is triggered by pyxis and if it has any solution.

So recently we realized that we cannot run OptiX jobs through Pyxis. When we launch a Slurm Job with a CUDA container and launch the binary, we get an error of the type Optix Error: 'Library not found'. We are using a CUDA container from Nvidia NGC, OptiX is installed (uncompressed) locally on the user's home. The code compiles OK on the interactive session of the job.

➜  srun -p gpu --container-name=cuda-11.4.2 --gpus=1 --pty zsh
➜  build git:(main) ./rtxcuda 0 $((2**20)) 1 4

RTX Config.........................../home/cnavarro/temporal/RTX-CUDA-template/./src/rtx_functions.h:47 Optix Error: 'Library not found'
➜  build git:(main) ./rtxcuda 0 $((2**20)) 1 4

After some google searching, I found that the solution may be this https://github.com/NVIDIA/nvidia-container-toolkit/issues/187 (i.e., to add /usr/lib/x86_64-linux-gnu/libnvoptix.so.1 through libnvidia-container) but I am not sure how this translates to the Pyxis environment. Any guides are welcome.

flx42 commented 1 year ago

Try using NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics srun -p gpu --container-image ...

crinavar commented 1 year ago

Many thanks! It worked perfectly

flx42 commented 1 year ago

ENV NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics should be added to the Dockerfile then.

crinavar commented 8 months ago

Hi @flx42 Sorry to bring this up, but today I found that setting the environment variable as suggested in the srun command is no longer making the change. Reading some related posts I found that the container definitions will override the values chosen at srun time.

Looking inside the CUDA container (stored with enroot), i went into /etc/environment and indeed the container was setting the variable to NVIDIA_DRIVER_CAPABILITIES=utility,compute, so I added "graphics". Then, when launching srun ... bash with the container, it prints the variable properly (echo $NVIDIA_DRIVER_CAPABILITIES).

However, when launching the OptiX program (out custom CUDA+OptiX code), it still fails with a "Library not found" because the file /usr/lib/x86_64-linux-gnu/libnvoptix.so.1 is still not properly loaded.

Do you know if editing the /etc/environment should have been sufficient for this file to load properly? at this point I am not sure if I need to install the driver inside the container, or the native driver from the node is sufficient.

flx42 commented 8 months ago

In this situation, you can't rely on the value of NVIDIA_DRIVER_CAPABILITIES from inside the container. It will indeed take the value from the container image over the value from your srun environment, but NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics srun ... will correctly apply to enroot and should work properly.

Perhaps you have an old version of libnvidia-container? Or perhaps libnvoptix.so.1 is not present on the host system at all?

crinavar commented 8 months ago

Hi @flx42 I started checking the packages and indeed I was missing one, libnvidia-gl-535-server in my case. Now it works just as you mentioned (no need to modify the container files):

NVIDIA_DRIVER_CAPABILITIES=utility,compute,graphics srun ...

Many thanks again