NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.21k stars 2.03k forks source link

start docker nvidia fail could not select device driver "" with capabilities: [[gpu]] #1682

Closed ywangwxd closed 1 year ago

ywangwxd commented 2 years ago

I am following the official instructions to install the latest nvidia-docker2, nvidia-container-toolkit. OS: ubuntu 18.04

But I cannot start docker with nvidia driver. the error message is : ############################################## could not select device driver "" with capabilities: [[gpu]] #############################################

On the host, I have already installed nvidia driver and I can see the device using nvidia-smi command

-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 0% 45C P8 16W / 220W | 233MiB / 7973MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3943 G /usr/lib/xorg/Xorg 18MiB | | 0 N/A N/A 3976 G /usr/bin/gnome-shell 71MiB | | 0 N/A N/A 4160 G /usr/lib/xorg/Xorg 112MiB | | 0 N/A N/A 4297 G /usr/bin/gnome-shell 27MiB | +-----------------------------------------------------------------------------+

I can also see the device under /dev as follows

/dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools

I check the log of nvidia-container-cli, I can see the following warning message

-- WARNING, the following logs are for debugging purposes only --

I0919 09:04:42.269104 28911 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977) I0919 09:04:42.269187 28911 nvc.c:350] using root / I0919 09:04:42.269210 28911 nvc.c:351] using ldcache /etc/ld.so.cache I0919 09:04:42.269240 28911 nvc.c:352] using unprivileged user 1001:1001 I0919 09:04:42.269299 28911 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0919 09:04:42.269596 28911 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0919 09:04:42.270954 28913 nvc.c:273] failed to set inheritable capabilities W0919 09:04:42.271053 28913 nvc.c:274] skipping kernel modules load due to failure I0919 09:04:42.271560 28914 rpc.c:71] starting driver rpc service I0919 09:04:42.276809 28915 rpc.c:71] starting nvcgo rpc service I0919 09:04:42.277317 28911 nvc_info.c:766] requesting driver information with '' I0919 09:04:42.278263 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.141.03 I0919 09:04:42.278294 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.141.03 I0919 09:04:42.278310 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.141.03 I0919 09:04:42.278330 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03 I0919 09:04:42.278346 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03 I0919 09:04:42.278362 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03 I0919 09:04:42.278380 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.141.03 I0919 09:04:42.278397 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03 I0919 09:04:42.278412 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.141.03 I0919 09:04:42.278426 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.141.03 I0919 09:04:42.278440 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.141.03 I0919 09:04:42.278455 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.141.03 I0919 09:04:42.278471 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.141.03 I0919 09:04:42.278487 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03 I0919 09:04:42.278504 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.141.03 I0919 09:04:42.278522 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03 I0919 09:04:42.278544 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03 I0919 09:04:42.278563 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.141.03 I0919 09:04:42.278583 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03 I0919 09:04:42.278604 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03 I0919 09:04:42.278728 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03 I0919 09:04:42.278790 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.141.03 I0919 09:04:42.278813 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.141.03 I0919 09:04:42.278833 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03 I0919 09:04:42.278854 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.141.03 I0919 09:04:42.278887 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.470.141.03 I0919 09:04:42.278905 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03 I0919 09:04:42.278931 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.470.141.03 I0919 09:04:42.278959 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.470.141.03 I0919 09:04:42.278979 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.470.141.03 I0919 09:04:42.279006 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ifr.so.470.141.03 I0919 09:04:42.279033 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.470.141.03 I0919 09:04:42.279053 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.470.141.03 I0919 09:04:42.279074 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.470.141.03 I0919 09:04:42.279094 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.470.141.03 I0919 09:04:42.279120 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.470.141.03 I0919 09:04:42.279145 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.470.141.03 I0919 09:04:42.279165 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.470.141.03 I0919 09:04:42.279186 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.470.141.03 I0919 09:04:42.279222 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.470.141.03 I0919 09:04:42.279255 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.470.141.03 I0919 09:04:42.279277 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.470.141.03 I0919 09:04:42.279297 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03 I0919 09:04:42.279318 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.470.141.03 W0919 09:04:42.279332 28911 nvc_info.c:399] missing library libnvidia-nscq.so W0919 09:04:42.279337 28911 nvc_info.c:399] missing library libcudadebugger.so W0919 09:04:42.279340 28911 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W0919 09:04:42.279344 28911 nvc_info.c:399] missing library libnvidia-pkcs11.so W0919 09:04:42.279349 28911 nvc_info.c:399] missing library libvdpau_nvidia.so W0919 09:04:42.279354 28911 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W0919 09:04:42.279358 28911 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W0919 09:04:42.279362 28911 nvc_info.c:403] missing compat32 library libcudadebugger.so W0919 09:04:42.279367 28911 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W0919 09:04:42.279371 28911 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W0919 09:04:42.279376 28911 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W0919 09:04:42.279380 28911 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0919 09:04:42.279384 28911 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W0919 09:04:42.279388 28911 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0919 09:04:42.279391 28911 nvc_info.c:403] missing compat32 library libnvoptix.so W0919 09:04:42.279395 28911 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0919 09:04:42.279667 28911 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0919 09:04:42.279678 28911 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0919 09:04:42.279690 28911 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0919 09:04:42.279703 28911 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control W0919 09:04:42.279756 28911 nvc_info.c:425] missing binary nv-fabricmanager W0919 09:04:42.279760 28911 nvc_info.c:425] missing binary nvidia-cuda-mps-server I0919 09:04:42.279775 28911 nvc_info.c:343] listing firmware path /lib/firmware/nvidia/470.141.03/gsp.bin I0919 09:04:42.279789 28911 nvc_info.c:529] listing device /dev/nvidiactl I0919 09:04:42.279792 28911 nvc_info.c:529] listing device /dev/nvidia-uvm I0919 09:04:42.279797 28911 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I0919 09:04:42.279800 28911 nvc_info.c:529] listing device /dev/nvidia-modeset I0919 09:04:42.279814 28911 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket W0919 09:04:42.279828 28911 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W0919 09:04:42.279837 28911 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I0919 09:04:42.279842 28911 nvc_info.c:822] requesting device information with '' I0919 09:04:42.285437 28911 nvc_info.c:713] listing device /dev/nvidia0 (GPU-661838a0-fb69-bf82-164a-6c9ae0dcc7f6 at 00000000:01:00.0) I0919 09:04:42.285446 28911 nvc.c:434] shutting down library context I0919 09:04:42.285493 28915 rpc.c:95] terminating nvcgo rpc service I0919 09:04:42.285765 28911 rpc.c:135] nvcgo rpc service terminated successfully I0919 09:04:42.286026 28914 rpc.c:95] terminating driver rpc service I0919 09:04:42.286086 28911 rpc.c:135] driver rpc service terminated successfully NVRM version: 470.141.03 CUDA version: 11.4

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce RTX 3070 Brand: GeForce GPU UUID: GPU-661838a0-fb69-bf82-164a-6c9ae0dcc7f6 Bus Location: 00000000:01:00.0 Architecture: 8.6

The strange thing is that I could succesfully use the docker with nvidia gpu before, it failed just after a reboot. There has been nothing changed if my memory is correct. I have also tried reinstalling nvidia-container-toolkit, nvidia-docker2

what can I do now?

c-patrick commented 2 years ago

Sorry I can't help, but I have the exact same issue - all was working before, then after an update (Driver Version: 515.65.01) and a reboot the gpu doesn't work in docker anymore. I'm running a Quadro P400 on RHEL 8.6.

elezar commented 2 years ago

@ywangwxd / @c-patrick could you provide the docker commands that you are running?

We have seen reports of issues with the NVIDIA Container Toolkit v1.11.0, so this may indicate a regression in those components. Could you:

  1. Enable debug logging for the runtime and cli by uncommenting the #debug lines in /etc/nvidia-container-runtime/config.toml and attaching the /var/log/nvidia-container-*.log files to this issue.
  2. Downgrade to NVIDIA Container Toolkit v1.10.0 and see if this addresses your issues?
c-patrick commented 2 years ago

@elezar thanks for looking into this. The command I'm running is: docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi which returns the following error: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

I've uncommented the #debug lines in /etc/nvidia-container-runtime but it has not generated any log files in /var/log/. I also downgraded NVIDIA Container Toolkit to v1.10.0 but sadly the same error persists (still, no logs generated).

elezar commented 2 years ago

@c-patrick could you provide the output for:

ls -al /usr/bin/nvidia-container*
c-patrick commented 2 years ago

@elezar Sure, please find the output below:

$ ls -al /usr/bin/nvidia-container*
-rwxr-xr-x. 1 root root   48072 Sep  6 10:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x. 1 root root 3648696 Jun 13 11:42 /usr/bin/nvidia-container-runtime
lrwxrwxrwx. 1 root root      33 Sep 19 12:46 /usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
elezar commented 2 years ago

OK, I would expect a nvidia-container-toolkit binary to exist in this folder. With the v1.10.0 release we had the following:

/usr/bin/nvidia-container-toolkit
/usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit

In the v1.11.0 release we switched these as we want to use nvidia-container-runtime-hook as the actual executable name. This means we should have:

/usr/bin/nvidia-container-runtime-hook
/usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

However, due to the way the RPM packages are defined the symlink is (unconditionally) removed in the post uninstall step.

For 1.11.0 we have:

%postun
rm -f %{_bindir}/nvidia-container-toolkit

For 1.10.0 we had:

%postun
rm -f %{_bindir}/nvidia-container-runtime-hook

What this means is that when upgrading from 1.10.0 to 1.11.0 the actual hook is deleted and the same happens when dowgrading from 1.11.0 to 1.10.0.

The workaround is to remove the nvidia-container-toolkit package before installing the required version. Could you run:

sudo yum remove -y nvidia-container-toolkit
sudo yum install -y nvidia-container-toolkit-1.11.0-1

And then confirm the following:

$ ls -al /usr/bin/nvidia-container-*
-rwxr-xr-x 1 root root   47368 Sep  6 09:22 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079040 Sep  6 09:23 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root xxxxxxxx Sep  6 09:23 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root      38 Sep 19 12:10 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

@ywangwxd since you're using ubuntu and not RHEL, I would have to check the packages there a bit more closely, but I can see a similar situation occuring there.

c-patrick commented 2 years ago

@elezar Thanks very much for your help. I removed and then installed NVIDIA container toolkit and all is working well. Running ls -al /usr/bin/nvidia-container* produces the following result:

$ ls -al /usr/bin/nvidia-container*
-rwxr-xr-x. 1 root root   48072 Sep  6 10:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x. 1 root root 4079768 Sep  6 10:29 /usr/bin/nvidia-container-runtime
-rwxr-xr-x. 1 root root 2142816 Sep  6 10:29 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx. 1 root root      38 Sep 19 13:49 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

Now running docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi does not error out and instead returns the expected result.

Thank you very much again for your help.

ywangwxd commented 2 years ago

@ywangwxd / @c-patrick could you provide the docker commands that you are running?

We have seen reports of issues with the NVIDIA Container Toolkit v1.11.0, so this may indicate a regression in those components. Could you:

  1. Enable debug logging for the runtime and cli by uncommenting the #debug lines in /etc/nvidia-container-runtime/config.toml and attaching the /var/log/nvidia-container-*.log files to this issue.
  2. Downgrade to NVIDIA Container Toolkit v1.10.0 and see if this addresses your issues?

thank you although I have solved the issue in another way. I searched on google and another post said it was because the docker is installed in a snap mode (I do not know what it is actually), trying reinstalling it. I found this way solved my problem.

Anyway, I will keep your response in mind. I may encounter the same problem again in the future, who knows.

gentoorax commented 2 years ago

OK, I would expect a nvidia-container-toolkit binary to exist in this folder. With the v1.10.0 release we had the following:

/usr/bin/nvidia-container-toolkit
/usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit

In the v1.11.0 release we switched these as we want to use nvidia-container-runtime-hook as the actual executable name. This means we should have:

/usr/bin/nvidia-container-runtime-hook
/usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

However, due to the way the RPM packages are defined the symlink is (unconditionally) removed in the post uninstall step.

For 1.11.0 we have:

%postun
rm -f %{_bindir}/nvidia-container-toolkit

For 1.10.0 we had:

%postun
rm -f %{_bindir}/nvidia-container-runtime-hook

What this means is that when upgrading from 1.10.0 to 1.11.0 the actual hook is deleted and the same happens when dowgrading from 1.11.0 to 1.10.0.

The workaround is to remove the nvidia-container-toolkit package before installing the required version. Could you run:

sudo yum remove -y nvidia-container-toolkit
sudo yum install -y nvidia-container-toolkit-1.11.0-1

And then confirm the following:

$ ls -al /usr/bin/nvidia-container-*
-rwxr-xr-x 1 root root   47368 Sep  6 09:22 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079040 Sep  6 09:23 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root xxxxxxxx Sep  6 09:23 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root      38 Sep 19 12:10 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

@ywangwxd since you're using ubuntu and not RHEL, I would have to check the packages there a bit more closely, but I can see a similar situation occuring there.

Thanks for this followed your advice fixed the problem for me. Found countless info on ubuntu for similar problems but this was just what I needed for CentOS Stream. Ta.

vasu-parspec commented 2 years ago

Hi @elezar , thanks a lot for your comments and detailed description ... below solution worked for me

sudo yum remove -y nvidia-container-toolkit
sudo yum install -y nvidia-container-toolkit-1.11.0-1
jonasclaes commented 2 years ago

I'm experiencing this issue as well at the moment on Flatcar Linux using Docker and the nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubuntu20.04 container.

jonasclaes commented 2 years ago

Update: I just fixed this...

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html dictates that you can use an env var. Using this environment variable it does work!

Is there a bug in the --gpus all code?

elezar commented 2 years ago

Update: I just fixed this...

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html dictates that you can use an env var. Using this environment variable it does work!

Is there a bug in the --gpus all code?

The --gpus all code is part of the Docker CLI codebase and injects the NVIDIA Container Runtime hook directly. This may behave differently than the NVIDIA Container Runtime inserting the hook. The root cause, however, is that the nvidia-container-runtime-hook executable does not exist on the system when having upgraded from <=v1.10.0 to v1.11.0, so I would expect using the environment variables to also fail -- although the failure mode may be different.

elezar commented 2 years ago

Looking through this problem again, note that reinstalling the nvidia-container-toolkit-1.11.0-1 package should be sufficient to ensure that the correct files are created. Thus, if the nvidia-container-runtime-hook file is missing:

ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root   48072 Sep  6 09:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep  6 09:29 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root      38 Oct  4 12:01 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

Running:

yum reinstall -y nvidia-container-toolkit-1.11.0-1

Ensures that this file is installed correctly:

ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root   48072 Sep  6 09:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep  6 09:29 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 2142816 Sep  6 09:29 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root      38 Oct  4 12:03 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
jonasclaes commented 2 years ago

I'm running this as the Docker container on Flatcar Linux, since you cannot install anything on Flatcar.

gengwg commented 1 year ago

I would have to check the packages there a bit more closely, but I can see a similar situation occuring there.

I can confirm this 1.10 --> 1.11 upgrade breaks in Redhat/RPM based OS too.

$ cat /etc/redhat-release
CentOS Stream release 8

Reproduce

Here are steps to reproduce:

  1. Remove current package for clean install.
# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root   48072 Sep  6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep  6 02:29 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root      38 Nov 16 23:35 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

# dnf remove nvidia-container-toolkit
....
Removed:
  libnvidia-container-tools-1.11.0-1.x86_64                libnvidia-container1-1.11.0-1.x86_64                nvidia-container-toolkit-1.11.0-1.x86_64                nvidia-container-toolkit-base-1.11.0-1.x86_64

Complete!

# ls -al /usr/bin/nvidia-container*
ls: cannot access '/usr/bin/nvidia-container*': No such file or directory
  1. Install 1.10. (to simulate previous state).
# dnf downgrade nvidia-container-toolkit-1.10.0-1.x86_64
....
Installed:
  libnvidia-container-tools-1.11.0-1.x86_64                                     libnvidia-container1-1.11.0-1.x86_64                                     nvidia-container-toolkit-1.10.0-1.x86_64

Complete!

# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root   48072 Sep  6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 3648696 Jun 13 03:42 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root      33 Nov 16 23:36 /usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
-rwxr-xr-x 1 root root 2138656 Jun 13 03:42 /usr/bin/nvidia-container-toolkit
  1. Upgrade to 1.11.
# dnf install nvidia-container-toolkit-1.11.0-1.x86_64
....
Upgraded:
  nvidia-container-toolkit-1.11.0-1.x86_64
Installed:
  nvidia-container-toolkit-base-1.11.0-1.x86_64

Complete!
  1. Verify it's broken:
# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root   48072 Sep  6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep  6 02:29 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root      38 Nov 16 23:48 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

Fix

Manual Fix

Simply reinstalling it fixed it. Confirmed on two hosts at least.

# dnf reinstall nvidia-container-toolkit

# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root   48072 Sep  6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep  6 02:29 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 2142816 Sep  6 02:29 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root      38 Nov 16 18:48 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

Chef Fix

Here is a chef recipe I used to fix it for anyone using Chef. One must do a FULL uninstall and reinstall. There is no 'reinstall' action in Chef.

Here is how I implemented it in Chef currently:

dgx_nvidia_container_runtime_packages = %w{
  nvidia-container-toolkit
}

package dgx_nvidia_container_runtime_packages do
  action :remove
  not_if { File.exist?('/usr/bin/nvidia-container-runtime-hook') }
end

package dgx_nvidia_container_runtime_packages do
  action :upgrade
end

First chef run removes the package:

  * dnf_package[nvidia-container-toolkit] action remove
    - remove package ["nvidia-container-toolkit"]
  * dnf_package[nvidia-container-toolkit] action upgrade
    - upgrade(allow_downgrade) package nvidia-container-toolkit from uninstalled to 0:1.11.0-1.x86_64

2nd (and subsequent) Chef runs should do nothing:

  * dnf_package[nvidia-container-toolkit] action remove (skipped due to not_if)
  * dnf_package[nvidia-container-toolkit] action upgrade (up to date)
cvolz commented 1 year ago

I have come across the same issue and can confirm that it also happens on CentOS 7 here. After upgrading nvidia-container-toolkit from 1.10.0 to 1.11.0, the /usr/bin/nvidia-container-runtime-hook has disappeared.

Ideally, I’m looking for a solution for this issue where the RPM upgrade would resolve this problem automatically. In the company I work for we provide software updates by deploying RPMs to target machines, where they get updated automatically. It is difficult for us to apply the workaround of first uninstalling 1.10.0 before updating.

May I suggest the following solution:

For testing, I added a post scriptlet to the nvidia-container-toolkit.spec that makes a temporary copy of the binary:

%post
mkdir -p %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit
cp -af %{_bindir}/nvidia-container-runtime-hook %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit

In the posttrans scriplet, I added a few lines that restore the file later, if it got deleted by 1.10.0 during uninstall:

%posttrans
if [ ! -e %{_bindir}/nvidia-container-runtime-hook ]; then 
  # reparing lost file nvidia-container-runtime-hook
  cp -avf %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit/nvidia-container-runtime-hook %{_bindir} 
fi 
rm -f %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit/nvidia-container-runtime-hook
ln -sf %{_bindir}/nvidia-container-runtime-hook %{_bindir}/nvidia-container-toolkit

I believe I saw for the downgrade case (back to 1.10.0) you have already added a fix (don’t remove the file if it isn’t a symlink).

I am not sure about the Debian/Ubuntu package, as I am not familiar with deb packaging. But if it is affected by this issue, too, then there could be a similar solution.

I think this would also be beneficial for other users, who might not be aware of this issue and the workaround. This change would fix it automatically.

elezar commented 1 year ago

@cvolz thanks for the detailed investigation. Would you be up to creating a merge request against https://gitlab.com/nvidia/container-toolkit/container-toolkit with your proposed changes so that these could be reviewed and included in the next release?

cvolz commented 1 year ago

Hi @elezar, I'm open to contributing a merge request, but the question may be when I get to do that, as I am currently tied up at work. And I have not used your build environment and gitlab.com yet, so I will probably need some extra time to get set up.

When are you planning the next release?

cvolz commented 1 year ago

Hi @elezar, I have just opened the merge request for above patch: Gitlab !263

I have succeeded in building the RPM package and testing the upgrade and downgrade from/to 1.10.0 and it seems that the /usr/bin/nvidia-container-runtime-hook is now preserved.

elezar commented 1 year ago

Sorry for the delay. I had a look at the MR yesterday. One small question / comment.

The next non-RC release should go out by the end of the month.

elezar commented 1 year ago

@cvolz since your MR has been merged, I assume your issue has been resloved.

@ywangwxd was your original issue resolved? I am closing this issue in the mean time, but please reopen if it still persists.

visionlineNP commented 1 year ago

This problem has come up again. Ubuntu 20.04, NVidia driver 535.86.05. Driver works on the host.

$ nvidia-smi 
Mon Sep 18 12:08:34 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A3000 Laptop GPU    Off | 00000000:01:00.0  On |                  N/A |
| N/A   57C    P8              17W /  90W |    160MiB /  6144MiB |     26%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Can not get GPU to work with docker. Have reinstalled docker, reinstalled the nvidia-container-toolkit. No change.

$ docker run --rm  --gpus all ubuntu nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

The hooks are all in place.

$ ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root   47472 Sep  7 12:06 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 3651080 Sep  7 12:07 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 2698280 Sep  7 12:07 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root      38 Sep 20  2022 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook

Everything is latest versions:

$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Reading package lists... Done
Building dependency tree       
Reading state information... Done
containerd.io is already the newest version (1.6.24-1).
docker-buildx-plugin is already the newest version (0.11.2-1~ubuntu.20.04~focal).
docker-ce-cli is already the newest version (5:24.0.6-1~ubuntu.20.04~focal).
docker-ce is already the newest version (5:24.0.6-1~ubuntu.20.04~focal).
docker-compose-plugin is already the newest version (2.21.0-1~ubuntu.20.04~focal).

$ sudo apt-get install nvidia-container-toolkit
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-container-toolkit is already the newest version (1.14.1-1).
visionlineNP commented 1 year ago

And let this be a lesson in proper use of docker context. My context was set to use a remote machine. Fixed it by running

docker use context default

The error message could have been more helpful. Then again, if someone can set the context to something else, they can keep track of it.

renaissancelab commented 8 months ago

Very good