NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.19k stars 2.03k forks source link

Optix 6.0 not supported in docker #990

Closed robertsulej closed 4 years ago

robertsulej commented 5 years ago

1. Issue or feature description

Since Optix 6.0 a part of its libraries was moved to the GPU driver and became inaccessible in docker. Initialization of the Optix 6 context fails in docker with the error "Failed to load OptiX library", while it is working correctly in the host. The same procedure is working correctly in both, host and docker, with Optix 5.

This issue was reported on the nvidia developers forum and also reported in the nvidia-docker issues, however, people were directed to libnvidia-container support. I submitted issue there, but I am not sure if it does not fit better here.

2. Steps to reproduce the issue

I compiled an image with one of the Optix 6 SDK samples (failing) and the same with Optix 5 (running OK):

Image is configured to run Optix 6 sample:

docker run --rm --runtime=nvidia rsulej/optix-docker-test

You can run exactly the same but with Optix 5 in the interactive mode:

docker run --rm --runtime=nvidia -it rsulej/optix-docker-test sh

and inside the container:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/minimalOptixSample/optix5
/minimalOptixSample/optix5/optixDeviceQuery

The whole setup, including Dockerfile is available in GitHub.

Thanks for help! Robert

RenaudWasTaken commented 5 years ago

Sorry this never got answered. From what I understand (though I haven't looked into it very deeply), you will need to mount libnvoptix.so.X and libnvidia-rtcore.so.X from the hist to the container.

Unfortunately extending support for Optix into container is a bit further down the roadmap and hence won't get tackled natively for a few months.

robertsulej commented 5 years ago

Well... that works!

Someone from the OptiX forum already tried copying files to docker, but missed libnvidia-rtcore.so.X.

I just mounted all the files you mention and the device query sample works fine. I need to point manually to the exact driver version, but for the moment it is perfectly enough. If I run into troubles with a more sophisticated app, I'll be back.

Thanks! Robert

qhaas commented 5 years ago

Thought the recent major changes with how nvidia-docker interacts with Docker 19.03, nvidia-container-runtime 3.1, the proprietary driver 430, etc. might have addressed this, but it is still an issue.

robertsulej commented 5 years ago

Things are moving forward.. In the new OptiX 7 all the OptiX symbols (and also cuDNN for AI denoiser) are moved to the driver. I did not try yet if @RenaudWasTaken solution will work and which driver files need to be mounted. Just letting you know there are major changes.

RenaudWasTaken commented 4 years ago

With libnvidia-container1 version 1.0.4 (or newer) I added an experimental support for this.

Experimental because I really just mounted the two libraries without testing or looking into what more might be required.

Feel free to test and give me feedback :)

robertsulej commented 4 years ago

Thanks! I'll try and let you know.

chenzhekl commented 4 years ago

Hi @RenaudWasTaken, Thanks for your work! I'm having libnvidia-container1 == 1.0.5, though. libnvoptix.so and libnvidia-rtcore.so are still not mounted into the container automatically. Do I need to turn the behavior on with any flags?

RenaudWasTaken commented 4 years ago

You can try it with the environment variable NVIDIA_DRIVER_CAPABILITIES set to graphics

chenzhekl commented 4 years ago

I tried it but no luck. This was the command executed NVIDIA_DRIVER_CAPABILITIES=graphics sudo docker run -d -p 2222:22 --rm --gpus all --name test chenzhekl/test.

and the driver version on the host:

Driver Version: 440.33.01    
CUDA Version: 10.2
RenaudWasTaken commented 4 years ago
sudo docker run -e  NVIDIA_DRIVER_CAPABILITIES=graphics --gpus all nvidia/cuda:10.0-base
chenzhekl commented 4 years ago

My bad.. Thanks for your help! Everything works now.

RenaudWasTaken commented 4 years ago

Closing for now as this seems to be resolved.

ventz commented 10 months ago

I am adding this for those wondering why OptiX is not working with NVIDIA_DRIVER_CAPABILITIES=graphics

To break it down, you have the "old" option -- mounting the needed drivers/libs:

# Assuming the SDK is installed in /tmp/NVIDIA and the `build` directory is within that for the `optixHello` sample

# In this case:
# Ubuntu 20.04
# Cuda 11.4
# NVIDIA 470
# This means we can only go up to Optix7.3 (due to R470)

docker run -v /tmp/NVIDIA:/tmp/NVIDIA \
-v /usr/lib/x86_64-linux-gnu/libnvoptix.so.1:/usr/lib/x86_64-linux-gnu/libnvoptix.so.1 \
-v /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.82.01:/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.82.01 \
--gpus all -it --rm nvidia/cuda:11.4.3-runtime-ubuntu20.04

If you run optixHello -- this will work.

And here is the "new" option:

docker run -v /tmp/NVIDIA:/tmp/NVIDIA \
-e NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility \
--gpus all -it --rm nvidia/cuda:11.4.3-runtime-ubuntu20.04

^ At first attempt this will not work -- and most folks that are complaining about it not working are probably running int this -- you will see that you have both libs within the container:

/usr/lib/x86_64-linux-gnu/libnvoptix.so.470.82.01
/usr/lib/x86_64-linux-gnu/libnvoptix.so.1

However, looking more closely will show you that /usr/lib/x86_64-linux-gnu/libnvoptix.so.1 is 0 bytes.

The reason is because from the HOST running Docker, you have:

# cd /usr/lib/x86_64-linux-gnu/
# ls -alh | grep libnvoptix
lrwxrwxrwx  1 root root    23 Nov 16  2021 libnvoptix.so.1 -> libnvoptix.so.470.82.01
-rw-r--r--  1 root root  161M Oct 27  2021 libnvoptix.so.470.82.01

The symlink is translating into an empty map.

The easy way to fix this within the container is:

ln -sf /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.82.01 /usr/lib/x86_64-linux-gnu/libnvoptix.so.1

As soon as you do that it works:

root@0eddf5bed7c2:/tmp/NVIDIA/build/bin# ./optixHello 
Caught exception: OPTIX_ERROR_LIBRARY_NOT_FOUND: Optix call 'optixInit()' failed: /tmp/NVIDIA/SDK/optixHello/optixHello.cpp:124)

root@0eddf5bed7c2:/tmp/NVIDIA/build/bin# ln -sf /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.82.01 /usr/lib/x86_64-linux-gnu/libnvoptix.so.1

root@0eddf5bed7c2:/tmp/NVIDIA/build/bin# ./optixHello 
[ 4][       KNOBS]: All knobs on default.

[ 4][  DISK CACHE]: Opened database: "/var/tmp/OptixCache_root/cache7.db"
[ 4][  DISK CACHE]:     Cache data size: "15.9 KiB"
[ 4][   DISKCACHE]: Cache hit for key: ptx-6766-keydeb0e13958c7dc89fbcbe36c70c7e95d-sm_80-rtc0-drv470.82.01
[ 4][COMPILE FEEDBACK]: 
[ 4][COMPILE FEEDBACK]: Info: Pipeline has 1 module(s), 1 entry function(s), 0 trace call(s), 0 continuation callable call(s), 0 direct callable call(s), 1 basic block(s) in entry functions, 79 instruction(s) in entry functions, 0 non-entry function(s), 0 basic block(s) in non-entry functions, 0 instruction(s) in non-entry functions

GLFW Error 65544: X11: The DISPLAY environment variable is missing
Caught exception: Failed to initialize GLFW

Hope this helps others.