Closed ywangwxd closed 1 year ago
Sorry I can't help, but I have the exact same issue - all was working before, then after an update (Driver Version: 515.65.01) and a reboot the gpu doesn't work in docker anymore. I'm running a Quadro P400 on RHEL 8.6.
@ywangwxd / @c-patrick could you provide the docker commands that you are running?
We have seen reports of issues with the NVIDIA Container Toolkit v1.11.0, so this may indicate a regression in those components. Could you:
#debug
lines in /etc/nvidia-container-runtime/config.toml
and attaching the /var/log/nvidia-container-*.log
files to this issue.@elezar thanks for looking into this. The command I'm running is:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
which returns the following error:
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
I've uncommented the #debug
lines in /etc/nvidia-container-runtime
but it has not generated any log files in /var/log/
. I also downgraded NVIDIA Container Toolkit to v1.10.0 but sadly the same error persists (still, no logs generated).
@c-patrick could you provide the output for:
ls -al /usr/bin/nvidia-container*
@elezar Sure, please find the output below:
$ ls -al /usr/bin/nvidia-container*
-rwxr-xr-x. 1 root root 48072 Sep 6 10:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x. 1 root root 3648696 Jun 13 11:42 /usr/bin/nvidia-container-runtime
lrwxrwxrwx. 1 root root 33 Sep 19 12:46 /usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
OK, I would expect a nvidia-container-toolkit
binary to exist in this folder. With the v1.10.0
release we had the following:
/usr/bin/nvidia-container-toolkit
/usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
In the v1.11.0 release we switched these as we want to use nvidia-container-runtime-hook
as the actual executable name. This means we should have:
/usr/bin/nvidia-container-runtime-hook
/usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
However, due to the way the RPM packages are defined the symlink is (unconditionally) removed in the post uninstall step.
For 1.11.0 we have:
%postun
rm -f %{_bindir}/nvidia-container-toolkit
For 1.10.0 we had:
%postun
rm -f %{_bindir}/nvidia-container-runtime-hook
What this means is that when upgrading from 1.10.0
to 1.11.0
the actual hook is deleted and the same happens when dowgrading from 1.11.0
to 1.10.0
.
The workaround is to remove the nvidia-container-toolkit
package before installing the required version. Could you run:
sudo yum remove -y nvidia-container-toolkit
sudo yum install -y nvidia-container-toolkit-1.11.0-1
And then confirm the following:
$ ls -al /usr/bin/nvidia-container-*
-rwxr-xr-x 1 root root 47368 Sep 6 09:22 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079040 Sep 6 09:23 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root xxxxxxxx Sep 6 09:23 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root 38 Sep 19 12:10 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
@ywangwxd since you're using ubuntu and not RHEL, I would have to check the packages there a bit more closely, but I can see a similar situation occuring there.
@elezar Thanks very much for your help. I removed and then installed NVIDIA container toolkit and all is working well. Running ls -al /usr/bin/nvidia-container*
produces the following result:
$ ls -al /usr/bin/nvidia-container*
-rwxr-xr-x. 1 root root 48072 Sep 6 10:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x. 1 root root 4079768 Sep 6 10:29 /usr/bin/nvidia-container-runtime
-rwxr-xr-x. 1 root root 2142816 Sep 6 10:29 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx. 1 root root 38 Sep 19 13:49 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
Now running docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
does not error out and instead returns the expected result.
Thank you very much again for your help.
@ywangwxd / @c-patrick could you provide the docker commands that you are running?
We have seen reports of issues with the NVIDIA Container Toolkit v1.11.0, so this may indicate a regression in those components. Could you:
- Enable debug logging for the runtime and cli by uncommenting the
#debug
lines in/etc/nvidia-container-runtime/config.toml
and attaching the/var/log/nvidia-container-*.log
files to this issue.- Downgrade to NVIDIA Container Toolkit v1.10.0 and see if this addresses your issues?
thank you although I have solved the issue in another way. I searched on google and another post said it was because the docker is installed in a snap mode (I do not know what it is actually), trying reinstalling it. I found this way solved my problem.
Anyway, I will keep your response in mind. I may encounter the same problem again in the future, who knows.
OK, I would expect a
nvidia-container-toolkit
binary to exist in this folder. With thev1.10.0
release we had the following:/usr/bin/nvidia-container-toolkit /usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
In the v1.11.0 release we switched these as we want to use
nvidia-container-runtime-hook
as the actual executable name. This means we should have:/usr/bin/nvidia-container-runtime-hook /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
However, due to the way the RPM packages are defined the symlink is (unconditionally) removed in the post uninstall step.
For 1.11.0 we have:
%postun rm -f %{_bindir}/nvidia-container-toolkit
For 1.10.0 we had:
%postun rm -f %{_bindir}/nvidia-container-runtime-hook
What this means is that when upgrading from
1.10.0
to1.11.0
the actual hook is deleted and the same happens when dowgrading from1.11.0
to1.10.0
.The workaround is to remove the
nvidia-container-toolkit
package before installing the required version. Could you run:sudo yum remove -y nvidia-container-toolkit sudo yum install -y nvidia-container-toolkit-1.11.0-1
And then confirm the following:
$ ls -al /usr/bin/nvidia-container-* -rwxr-xr-x 1 root root 47368 Sep 6 09:22 /usr/bin/nvidia-container-cli -rwxr-xr-x 1 root root 4079040 Sep 6 09:23 /usr/bin/nvidia-container-runtime -rwxr-xr-x 1 root root xxxxxxxx Sep 6 09:23 /usr/bin/nvidia-container-runtime-hook lrwxrwxrwx 1 root root 38 Sep 19 12:10 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
@ywangwxd since you're using ubuntu and not RHEL, I would have to check the packages there a bit more closely, but I can see a similar situation occuring there.
Thanks for this followed your advice fixed the problem for me. Found countless info on ubuntu for similar problems but this was just what I needed for CentOS Stream. Ta.
Hi @elezar , thanks a lot for your comments and detailed description ... below solution worked for me
sudo yum remove -y nvidia-container-toolkit
sudo yum install -y nvidia-container-toolkit-1.11.0-1
I'm experiencing this issue as well at the moment on Flatcar Linux using Docker and the nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubuntu20.04
container.
Update: I just fixed this...
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html dictates that you can use an env var. Using this environment variable it does work!
Is there a bug in the --gpus all code?
Update: I just fixed this...
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html dictates that you can use an env var. Using this environment variable it does work!
Is there a bug in the --gpus all code?
The --gpus all
code is part of the Docker CLI codebase and injects the NVIDIA Container Runtime hook directly. This may behave differently than the NVIDIA Container Runtime inserting the hook. The root cause, however, is that the nvidia-container-runtime-hook
executable does not exist on the system when having upgraded from <=v1.10.0
to v1.11.0
, so I would expect using the environment variables to also fail -- although the failure mode may be different.
Looking through this problem again, note that reinstalling the nvidia-container-toolkit-1.11.0-1
package should be sufficient to ensure that the correct files are created. Thus, if the nvidia-container-runtime-hook
file is missing:
ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root 48072 Sep 6 09:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep 6 09:29 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root 38 Oct 4 12:01 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
Running:
yum reinstall -y nvidia-container-toolkit-1.11.0-1
Ensures that this file is installed correctly:
ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root 48072 Sep 6 09:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep 6 09:29 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 2142816 Sep 6 09:29 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root 38 Oct 4 12:03 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
I'm running this as the Docker container on Flatcar Linux, since you cannot install anything on Flatcar.
I would have to check the packages there a bit more closely, but I can see a similar situation occuring there.
I can confirm this 1.10 --> 1.11 upgrade breaks in Redhat/RPM based OS too.
$ cat /etc/redhat-release
CentOS Stream release 8
Here are steps to reproduce:
# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root 48072 Sep 6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep 6 02:29 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root 38 Nov 16 23:35 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
# dnf remove nvidia-container-toolkit
....
Removed:
libnvidia-container-tools-1.11.0-1.x86_64 libnvidia-container1-1.11.0-1.x86_64 nvidia-container-toolkit-1.11.0-1.x86_64 nvidia-container-toolkit-base-1.11.0-1.x86_64
Complete!
# ls -al /usr/bin/nvidia-container*
ls: cannot access '/usr/bin/nvidia-container*': No such file or directory
# dnf downgrade nvidia-container-toolkit-1.10.0-1.x86_64
....
Installed:
libnvidia-container-tools-1.11.0-1.x86_64 libnvidia-container1-1.11.0-1.x86_64 nvidia-container-toolkit-1.10.0-1.x86_64
Complete!
# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root 48072 Sep 6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 3648696 Jun 13 03:42 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root 33 Nov 16 23:36 /usr/bin/nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
-rwxr-xr-x 1 root root 2138656 Jun 13 03:42 /usr/bin/nvidia-container-toolkit
# dnf install nvidia-container-toolkit-1.11.0-1.x86_64
....
Upgraded:
nvidia-container-toolkit-1.11.0-1.x86_64
Installed:
nvidia-container-toolkit-base-1.11.0-1.x86_64
Complete!
# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root 48072 Sep 6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep 6 02:29 /usr/bin/nvidia-container-runtime
lrwxrwxrwx 1 root root 38 Nov 16 23:48 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
Simply reinstalling it fixed it. Confirmed on two hosts at least.
# dnf reinstall nvidia-container-toolkit
# ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root 48072 Sep 6 02:26 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 4079768 Sep 6 02:29 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 2142816 Sep 6 02:29 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root 38 Nov 16 18:48 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
Here is a chef recipe I used to fix it for anyone using Chef. One must do a FULL uninstall and reinstall. There is no 'reinstall' action in Chef.
Here is how I implemented it in Chef currently:
dgx_nvidia_container_runtime_packages = %w{
nvidia-container-toolkit
}
package dgx_nvidia_container_runtime_packages do
action :remove
not_if { File.exist?('/usr/bin/nvidia-container-runtime-hook') }
end
package dgx_nvidia_container_runtime_packages do
action :upgrade
end
First chef run removes the package:
* dnf_package[nvidia-container-toolkit] action remove
- remove package ["nvidia-container-toolkit"]
* dnf_package[nvidia-container-toolkit] action upgrade
- upgrade(allow_downgrade) package nvidia-container-toolkit from uninstalled to 0:1.11.0-1.x86_64
2nd (and subsequent) Chef runs should do nothing:
* dnf_package[nvidia-container-toolkit] action remove (skipped due to not_if)
* dnf_package[nvidia-container-toolkit] action upgrade (up to date)
I have come across the same issue and can confirm that it also happens on CentOS 7 here. After upgrading nvidia-container-toolkit
from 1.10.0 to 1.11.0, the /usr/bin/nvidia-container-runtime-hook
has disappeared.
Ideally, I’m looking for a solution for this issue where the RPM upgrade would resolve this problem automatically. In the company I work for we provide software updates by deploying RPMs to target machines, where they get updated automatically. It is difficult for us to apply the workaround of first uninstalling 1.10.0 before updating.
May I suggest the following solution:
For testing, I added a post scriptlet to the nvidia-container-toolkit.spec
that makes a temporary copy of the binary:
%post
mkdir -p %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit
cp -af %{_bindir}/nvidia-container-runtime-hook %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit
In the posttrans scriplet, I added a few lines that restore the file later, if it got deleted by 1.10.0 during uninstall:
%posttrans
if [ ! -e %{_bindir}/nvidia-container-runtime-hook ]; then
# reparing lost file nvidia-container-runtime-hook
cp -avf %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit/nvidia-container-runtime-hook %{_bindir}
fi
rm -f %{_localstatedir}/lib/rpm-state/nvidia-container-toolkit/nvidia-container-runtime-hook
ln -sf %{_bindir}/nvidia-container-runtime-hook %{_bindir}/nvidia-container-toolkit
I believe I saw for the downgrade case (back to 1.10.0) you have already added a fix (don’t remove the file if it isn’t a symlink).
I am not sure about the Debian/Ubuntu package, as I am not familiar with deb packaging. But if it is affected by this issue, too, then there could be a similar solution.
I think this would also be beneficial for other users, who might not be aware of this issue and the workaround. This change would fix it automatically.
@cvolz thanks for the detailed investigation. Would you be up to creating a merge request against https://gitlab.com/nvidia/container-toolkit/container-toolkit with your proposed changes so that these could be reviewed and included in the next release?
Hi @elezar, I'm open to contributing a merge request, but the question may be when I get to do that, as I am currently tied up at work. And I have not used your build environment and gitlab.com yet, so I will probably need some extra time to get set up.
When are you planning the next release?
Hi @elezar, I have just opened the merge request for above patch: Gitlab !263
I have succeeded in building the RPM package and testing the upgrade and downgrade from/to 1.10.0 and it seems that the /usr/bin/nvidia-container-runtime-hook
is now preserved.
Sorry for the delay. I had a look at the MR yesterday. One small question / comment.
The next non-RC release should go out by the end of the month.
@cvolz since your MR has been merged, I assume your issue has been resloved.
@ywangwxd was your original issue resolved? I am closing this issue in the mean time, but please reopen if it still persists.
This problem has come up again. Ubuntu 20.04, NVidia driver 535.86.05. Driver works on the host.
$ nvidia-smi
Mon Sep 18 12:08:34 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A3000 Laptop GPU Off | 00000000:01:00.0 On | N/A |
| N/A 57C P8 17W / 90W | 160MiB / 6144MiB | 26% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Can not get GPU to work with docker. Have reinstalled docker, reinstalled the nvidia-container-toolkit. No change.
$ docker run --rm --gpus all ubuntu nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
The hooks are all in place.
$ ls -al /usr/bin/nvidia-container*
-rwxr-xr-x 1 root root 47472 Sep 7 12:06 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 3651080 Sep 7 12:07 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 2698280 Sep 7 12:07 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root 38 Sep 20 2022 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
Everything is latest versions:
$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Reading package lists... Done
Building dependency tree
Reading state information... Done
containerd.io is already the newest version (1.6.24-1).
docker-buildx-plugin is already the newest version (0.11.2-1~ubuntu.20.04~focal).
docker-ce-cli is already the newest version (5:24.0.6-1~ubuntu.20.04~focal).
docker-ce is already the newest version (5:24.0.6-1~ubuntu.20.04~focal).
docker-compose-plugin is already the newest version (2.21.0-1~ubuntu.20.04~focal).
$ sudo apt-get install nvidia-container-toolkit
Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-container-toolkit is already the newest version (1.14.1-1).
And let this be a lesson in proper use of docker context
. My context was set to use a remote machine. Fixed it by running
docker use context default
The error message could have been more helpful. Then again, if someone can set the context to something else, they can keep track of it.
Very good
I am following the official instructions to install the latest nvidia-docker2, nvidia-container-toolkit. OS: ubuntu 18.04
But I cannot start docker with nvidia driver. the error message is : ############################################## could not select device driver "" with capabilities: [[gpu]] #############################################
On the host, I have already installed nvidia driver and I can see the device using nvidia-smi command
-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 0% 45C P8 16W / 220W | 233MiB / 7973MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3943 G /usr/lib/xorg/Xorg 18MiB | | 0 N/A N/A 3976 G /usr/bin/gnome-shell 71MiB | | 0 N/A N/A 4160 G /usr/lib/xorg/Xorg 112MiB | | 0 N/A N/A 4297 G /usr/bin/gnome-shell 27MiB | +-----------------------------------------------------------------------------+
I can also see the device under /dev as follows
/dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools
I check the log of nvidia-container-cli, I can see the following warning message
-- WARNING, the following logs are for debugging purposes only --
I0919 09:04:42.269104 28911 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977) I0919 09:04:42.269187 28911 nvc.c:350] using root / I0919 09:04:42.269210 28911 nvc.c:351] using ldcache /etc/ld.so.cache I0919 09:04:42.269240 28911 nvc.c:352] using unprivileged user 1001:1001 I0919 09:04:42.269299 28911 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0919 09:04:42.269596 28911 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0919 09:04:42.270954 28913 nvc.c:273] failed to set inheritable capabilities W0919 09:04:42.271053 28913 nvc.c:274] skipping kernel modules load due to failure I0919 09:04:42.271560 28914 rpc.c:71] starting driver rpc service I0919 09:04:42.276809 28915 rpc.c:71] starting nvcgo rpc service I0919 09:04:42.277317 28911 nvc_info.c:766] requesting driver information with '' I0919 09:04:42.278263 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.141.03 I0919 09:04:42.278294 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.141.03 I0919 09:04:42.278310 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.141.03 I0919 09:04:42.278330 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03 I0919 09:04:42.278346 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03 I0919 09:04:42.278362 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03 I0919 09:04:42.278380 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.141.03 I0919 09:04:42.278397 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03 I0919 09:04:42.278412 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.141.03 I0919 09:04:42.278426 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.141.03 I0919 09:04:42.278440 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.141.03 I0919 09:04:42.278455 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.141.03 I0919 09:04:42.278471 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.141.03 I0919 09:04:42.278487 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03 I0919 09:04:42.278504 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.141.03 I0919 09:04:42.278522 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03 I0919 09:04:42.278544 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03 I0919 09:04:42.278563 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.141.03 I0919 09:04:42.278583 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03 I0919 09:04:42.278604 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03 I0919 09:04:42.278728 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03 I0919 09:04:42.278790 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.141.03 I0919 09:04:42.278813 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.141.03 I0919 09:04:42.278833 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03 I0919 09:04:42.278854 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.141.03 I0919 09:04:42.278887 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.470.141.03 I0919 09:04:42.278905 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03 I0919 09:04:42.278931 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.470.141.03 I0919 09:04:42.278959 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.470.141.03 I0919 09:04:42.278979 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.470.141.03 I0919 09:04:42.279006 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ifr.so.470.141.03 I0919 09:04:42.279033 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.470.141.03 I0919 09:04:42.279053 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.470.141.03 I0919 09:04:42.279074 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.470.141.03 I0919 09:04:42.279094 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.470.141.03 I0919 09:04:42.279120 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.470.141.03 I0919 09:04:42.279145 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.470.141.03 I0919 09:04:42.279165 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.470.141.03 I0919 09:04:42.279186 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.470.141.03 I0919 09:04:42.279222 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.470.141.03 I0919 09:04:42.279255 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.470.141.03 I0919 09:04:42.279277 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.470.141.03 I0919 09:04:42.279297 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03 I0919 09:04:42.279318 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.470.141.03 W0919 09:04:42.279332 28911 nvc_info.c:399] missing library libnvidia-nscq.so W0919 09:04:42.279337 28911 nvc_info.c:399] missing library libcudadebugger.so W0919 09:04:42.279340 28911 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W0919 09:04:42.279344 28911 nvc_info.c:399] missing library libnvidia-pkcs11.so W0919 09:04:42.279349 28911 nvc_info.c:399] missing library libvdpau_nvidia.so W0919 09:04:42.279354 28911 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W0919 09:04:42.279358 28911 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W0919 09:04:42.279362 28911 nvc_info.c:403] missing compat32 library libcudadebugger.so W0919 09:04:42.279367 28911 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W0919 09:04:42.279371 28911 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W0919 09:04:42.279376 28911 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W0919 09:04:42.279380 28911 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0919 09:04:42.279384 28911 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W0919 09:04:42.279388 28911 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0919 09:04:42.279391 28911 nvc_info.c:403] missing compat32 library libnvoptix.so W0919 09:04:42.279395 28911 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0919 09:04:42.279667 28911 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0919 09:04:42.279678 28911 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0919 09:04:42.279690 28911 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0919 09:04:42.279703 28911 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control W0919 09:04:42.279756 28911 nvc_info.c:425] missing binary nv-fabricmanager W0919 09:04:42.279760 28911 nvc_info.c:425] missing binary nvidia-cuda-mps-server I0919 09:04:42.279775 28911 nvc_info.c:343] listing firmware path /lib/firmware/nvidia/470.141.03/gsp.bin I0919 09:04:42.279789 28911 nvc_info.c:529] listing device /dev/nvidiactl I0919 09:04:42.279792 28911 nvc_info.c:529] listing device /dev/nvidia-uvm I0919 09:04:42.279797 28911 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I0919 09:04:42.279800 28911 nvc_info.c:529] listing device /dev/nvidia-modeset I0919 09:04:42.279814 28911 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket W0919 09:04:42.279828 28911 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W0919 09:04:42.279837 28911 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I0919 09:04:42.279842 28911 nvc_info.c:822] requesting device information with '' I0919 09:04:42.285437 28911 nvc_info.c:713] listing device /dev/nvidia0 (GPU-661838a0-fb69-bf82-164a-6c9ae0dcc7f6 at 00000000:01:00.0) I0919 09:04:42.285446 28911 nvc.c:434] shutting down library context I0919 09:04:42.285493 28915 rpc.c:95] terminating nvcgo rpc service I0919 09:04:42.285765 28911 rpc.c:135] nvcgo rpc service terminated successfully I0919 09:04:42.286026 28914 rpc.c:95] terminating driver rpc service I0919 09:04:42.286086 28911 rpc.c:135] driver rpc service terminated successfully NVRM version: 470.141.03 CUDA version: 11.4
Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce RTX 3070 Brand: GeForce GPU UUID: GPU-661838a0-fb69-bf82-164a-6c9ae0dcc7f6 Bus Location: 00000000:01:00.0 Architecture: 8.6
The strange thing is that I could succesfully use the docker with nvidia gpu before, it failed just after a reboot. There has been nothing changed if my memory is correct. I have also tried reinstalling nvidia-container-toolkit, nvidia-docker2
what can I do now?