Failed to initialize NVML: Unknown Error after calling`systemctl daemon-reload`

cheyang commented 2 years ago

1. Issue or feature description

Failed to initialize NVML: Unknown Error does not occurred in initial NVIDIA docker created, but it's happened after calling systemctl daemon-reload.

It works fine in

Kernel: 4.19.91 and systemd 219.

But it doesn't work in

Kernel: 5.10.23 and systemd 239.

I tried to monitor it with bpftrace:

During container startup, I can see event:

systemd, 1, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0
runc, 2, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0
runc, 1, c 195:0 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960
runc, 1, c 195:254 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960
runc, 1, c 195:255 rw, cri-containerd-ef460b0cb5dc0f9103ef2ed266863333eded42abedaf7960

And I can see the devicel.list in container as below:

cat /sys/fs/cgroup/devices/devices.list 
...
c 195:254 rw
c 195:0 rw

But after running systemctl daemon-reload, I find the event:

systemd, 1, c 195:* m, cri-containerd-d254b91e9b76d5e6a1b787f4fc6004f0f6318ac910360ee0

And the devicel.list in container as below:

cat /sys/fs/cgroup/devices/devices.list
...
c 195:* m

GPU device is not able be rw.

Currently I'm not able to use cgroup V2. Any suggestions about it? Thanks very much.

2. Steps to reproduce the issue

Run container

docker run --env NVIDIA_VISIBLE_DEVICES=all --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --name test -itd nvidia/cuda:11.2.1-devel-ubuntu20.04 bash

Check nvidia-smi

docker exec -it test nvidia-smi
Thu Jun 30 12:33:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:07.0 Off |                    0 |
| N/A   31C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

check device cgroup

docker exec -it test cat /sys/fs/cgroup/devices/devices.list
c 136:* rwm
c 5:2 rwm
c 5:1 rwm
c 5:0 rwm
c 1:9 rwm
c 1:8 rwm
c 1:7 rwm
c 1:5 rwm
c 1:3 rwm
b *:* m
c *:* m
c 10:200 rwm
c 195:0 rwm
c 195:255 rwm
c 237:0 rwm
c 237:1 rw

call systemd reload

systemctl daemon-reload

check nvidia-smi

docker exec -it test nvidia-smi
Failed to initialize NVML: Unknown Error

check device list

docker exec -it test cat /sys/fs/cgroup/devices/devices.list
b 9:* m
b 253:* m
b 254:* m
b 259:* m
c 1:* m
c 4:* m
c 5:* m
c 7:* m
c 10:* m
c 13:* m
c 29:* m
c 128:* m
c 136:* rwm
c 162:* m
c 180:* m
c 188:* m
c 189:* m
c 195:* m

3. Information to attach (optional if deemed irrelevant)

[ x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg
[ ] Driver information from nvidia-smi -a
[ x] Docker version from docker version
[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V
[ ] NVIDIA container library logs (see troubleshooting)
[ ] Docker command, image and tag used

nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0630 12:21:27.164651 124864 nvc.c:372] initializing library context (version=1.7.0, build=f37bb387ad05f6e501069d99e4135a97289faf1f)
I0630 12:21:27.164727 124864 nvc.c:346] using root /
I0630 12:21:27.164736 124864 nvc.c:347] using ldcache /etc/ld.so.cache
I0630 12:21:27.164742 124864 nvc.c:348] using unprivileged user 65534:65534
I0630 12:21:27.164767 124864 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0630 12:21:27.164915 124864 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0630 12:21:27.166200 124865 nvc.c:274] loading kernel module nvidia
I0630 12:21:27.166344 124865 nvc.c:278] running mknod for /dev/nvidiactl
I0630 12:21:27.166383 124865 nvc.c:282] running mknod for /dev/nvidia0
I0630 12:21:27.166400 124865 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0630 12:21:27.171675 124865 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0630 12:21:27.171818 124865 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0630 12:21:27.173614 124865 nvc.c:292] loading kernel module nvidia_uvm
I0630 12:21:27.173661 124865 nvc.c:296] running mknod for /dev/nvidia-uvm
I0630 12:21:27.173750 124865 nvc.c:301] loading kernel module nvidia_modeset
I0630 12:21:27.173783 124865 nvc.c:305] running mknod for /dev/nvidia-modeset
I0630 12:21:27.174053 124866 driver.c:101] starting driver service
I0630 12:21:27.177048 124864 nvc_info.c:758] requesting driver information with ''
I0630 12:21:27.178411 124864 nvc_info.c:171] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.460.91.03
I0630 12:21:27.178525 124864 nvc_info.c:171] selecting /usr/lib64/libnvoptix.so.460.91.03
I0630 12:21:27.178580 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-tls.so.460.91.03
I0630 12:21:27.178625 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-rtcore.so.460.91.03
I0630 12:21:27.178676 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.460.91.03
I0630 12:21:27.178740 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-opticalflow.so.460.91.03
I0630 12:21:27.178802 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-opencl.so.460.91.03
I0630 12:21:27.178851 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ngx.so.460.91.03
I0630 12:21:27.178914 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ml.so.460.91.03
I0630 12:21:27.178979 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-ifr.so.460.91.03
I0630 12:21:27.179043 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glvkspirv.so.460.91.03
I0630 12:21:27.179088 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glsi.so.460.91.03
I0630 12:21:27.179136 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-glcore.so.460.91.03
I0630 12:21:27.179177 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-fbc.so.460.91.03
I0630 12:21:27.179236 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-encode.so.460.91.03
I0630 12:21:27.179311 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-eglcore.so.460.91.03
I0630 12:21:27.179352 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-compiler.so.460.91.03
I0630 12:21:27.179394 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-cfg.so.460.91.03
I0630 12:21:27.179460 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-cbl.so.460.91.03
I0630 12:21:27.179504 124864 nvc_info.c:171] selecting /usr/lib64/libnvidia-allocator.so.460.91.03
I0630 12:21:27.179570 124864 nvc_info.c:171] selecting /usr/lib64/libnvcuvid.so.460.91.03
I0630 12:21:27.179715 124864 nvc_info.c:171] selecting /usr/lib64/libcuda.so.460.91.03
I0630 12:21:27.179797 124864 nvc_info.c:171] selecting /usr/lib64/libGLX_nvidia.so.460.91.03
I0630 12:21:27.179846 124864 nvc_info.c:171] selecting /usr/lib64/libGLESv2_nvidia.so.460.91.03
I0630 12:21:27.179900 124864 nvc_info.c:171] selecting /usr/lib64/libGLESv1_CM_nvidia.so.460.91.03
I0630 12:21:27.179950 124864 nvc_info.c:171] selecting /usr/lib64/libEGL_nvidia.so.460.91.03
I0630 12:21:27.180005 124864 nvc_info.c:171] selecting /usr/lib/vdpau/libvdpau_nvidia.so.460.91.03
I0630 12:21:27.180062 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-tls.so.460.91.03
I0630 12:21:27.180106 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ptxjitcompiler.so.460.91.03
I0630 12:21:27.180157 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-opticalflow.so.460.91.03
I0630 12:21:27.180220 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-opencl.so.460.91.03
I0630 12:21:27.180273 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ml.so.460.91.03
I0630 12:21:27.180335 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-ifr.so.460.91.03
I0630 12:21:27.180390 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glvkspirv.so.460.91.03
I0630 12:21:27.180441 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glsi.so.460.91.03
I0630 12:21:27.180480 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-glcore.so.460.91.03
I0630 12:21:27.180520 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-fbc.so.460.91.03
I0630 12:21:27.180574 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-encode.so.460.91.03
I0630 12:21:27.180626 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-eglcore.so.460.91.03
I0630 12:21:27.180664 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-compiler.so.460.91.03
I0630 12:21:27.180703 124864 nvc_info.c:171] selecting /usr/lib/libnvidia-allocator.so.460.91.03
I0630 12:21:27.180757 124864 nvc_info.c:171] selecting /usr/lib/libnvcuvid.so.460.91.03
I0630 12:21:27.180806 124864 nvc_info.c:171] selecting /usr/lib/libcuda.so.460.91.03
I0630 12:21:27.180860 124864 nvc_info.c:171] selecting /usr/lib/libGLX_nvidia.so.460.91.03
I0630 12:21:27.180908 124864 nvc_info.c:171] selecting /usr/lib/libGLESv2_nvidia.so.460.91.03
I0630 12:21:27.180954 124864 nvc_info.c:171] selecting /usr/lib/libGLESv1_CM_nvidia.so.460.91.03
I0630 12:21:27.181003 124864 nvc_info.c:171] selecting /usr/lib/libEGL_nvidia.so.460.91.03
W0630 12:21:27.181031 124864 nvc_info.c:397] missing library libnvidia-nscq.so
W0630 12:21:27.181040 124864 nvc_info.c:397] missing library libnvidia-fatbinaryloader.so
W0630 12:21:27.181048 124864 nvc_info.c:401] missing compat32 library libnvidia-cfg.so
W0630 12:21:27.181056 124864 nvc_info.c:401] missing compat32 library libnvidia-nscq.so
W0630 12:21:27.181065 124864 nvc_info.c:401] missing compat32 library libnvidia-fatbinaryloader.so
W0630 12:21:27.181074 124864 nvc_info.c:401] missing compat32 library libnvidia-ngx.so
W0630 12:21:27.181081 124864 nvc_info.c:401] missing compat32 library libnvidia-rtcore.so
W0630 12:21:27.181089 124864 nvc_info.c:401] missing compat32 library libnvoptix.so
W0630 12:21:27.181095 124864 nvc_info.c:401] missing compat32 library libnvidia-cbl.so
I0630 12:21:27.181378 124864 nvc_info.c:297] selecting /usr/bin/nvidia-smi
I0630 12:21:27.181406 124864 nvc_info.c:297] selecting /usr/bin/nvidia-debugdump
I0630 12:21:27.181434 124864 nvc_info.c:297] selecting /usr/bin/nvidia-persistenced
I0630 12:21:27.181470 124864 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-control
I0630 12:21:27.181494 124864 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-server
W0630 12:21:27.181542 124864 nvc_info.c:423] missing binary nv-fabricmanager
W0630 12:21:27.181579 124864 nvc_info.c:347] missing firmware path /lib/firmware/nvidia/460.91.03
I0630 12:21:27.181610 124864 nvc_info.c:520] listing device /dev/nvidiactl
I0630 12:21:27.181622 124864 nvc_info.c:520] listing device /dev/nvidia-uvm
I0630 12:21:27.181632 124864 nvc_info.c:520] listing device /dev/nvidia-uvm-tools
I0630 12:21:27.181640 124864 nvc_info.c:520] listing device /dev/nvidia-modeset
W0630 12:21:27.181670 124864 nvc_info.c:347] missing ipc path /var/run/nvidia-persistenced/socket
W0630 12:21:27.181700 124864 nvc_info.c:347] missing ipc path /var/run/nvidia-fabricmanager/socket
W0630 12:21:27.181721 124864 nvc_info.c:347] missing ipc path /tmp/nvidia-mps
I0630 12:21:27.181729 124864 nvc_info.c:814] requesting device information with ''
I0630 12:21:27.187620 124864 nvc_info.c:705] listing device /dev/nvidia0 (GPU-b9f5bdeb-9f56-89e8-48f1-5abdbbcffeb5 at 00000000:00:07.0)
NVRM version:   460.91.03
CUDA version:   11.2

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Unknown
GPU UUID:       GPU-b9f5bdeb-9f56-89e8-48f1-5abdbbcffeb5
Bus Location:   00000000:00:07.0
Architecture:   7.5
I0630 12:21:27.187673 124864 nvc.c:423] shutting down library context
I0630 12:21:27.188169 124866 driver.c:163] terminating driver service
I0630 12:21:27.188563 124864 driver.c:203] driver service terminated successfully

docker info
Client:
 Debug Mode: false

Server:
 Containers: 21
  Running: 18
  Paused: 0
  Stopped: 3
 Images: 11
 Server Version: 19.03.15
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 Security Options:
  seccomp
   Profile: default
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 14.39GiB
 Name: iZ2zeixjfsr9m8l4nbfzo9Z
 ID: ETCL:FYKN:TKAU:I44W:M6FP:EIXX:RXIE:CEWG:GBST:HNAV:CIG6:RLNA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://pqbap4ya.mirror.aliyuncs.com/
 Live Restore Enabled: true

liuyanyi commented 2 years ago

I also encountered this problem, which has been occurring for some time.

cheyang commented 2 years ago

@klueska Could you help take a look? Thanks.

yummypeng commented 2 years ago

I find these logs during systemd reload:

Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:254: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:255: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:1: No such file or directory

From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:

cd /dev/char
ln -s ../nvidia0 195:0
ln -s ../nvidiactl 195:255
ln -s ../nvidia-uvm 237:0

Furthermore, i find runc converts paths from /dev/nvidia* to /dev/char/*, the logic can be found here https://github.com/opencontainers/runc/blob/release-1.0/libcontainer/cgroups/systemd/common.go#L177.

So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/ -> /dev/nvidia ?

@elezar

yummypeng commented 2 years ago

Otherwise, if there exists a configuration file that we can explicitly set DeviceAllow as /dev/nvidia* which can be recognized by systemd?

TucoBruto commented 2 years ago

@klueska Could you help take a look? Thanks.

hey, I have been experienced this issue for a long time, I solved this by adding --privilege to the dockers which need graphic card, hope this helps.

cheyang commented 2 years ago

@klueska Could you help take a look? Thanks.

hey, I have been experienced this issue for a long time, I solved this by adding --privilege to the dockers which need graphic card, hope this helps.

Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.

gengwg commented 1 year ago

@klueska Could you help take a look? Thanks.

hey, I have been experienced this issue for a long time, I solved this by adding --privilege to the dockers which need graphic card, hope this helps.

Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.

I fixed this issue in our env (centos 8, systemd 239) perfectly with cgroup v2, for both docker and containerd nodes. i can share the steps how we fixed it by upgrading from cgroup1 to cgroup2, if that's an option for you.

mbentley commented 1 year ago

I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg

gengwg commented 1 year ago

I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg

Sure here I wrote the detailed steps how I fixed it using cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

mbentley commented 1 year ago

In that case, whatever the trigger is that you're seeing apparently isn't the same as mine as all that your instructions do is switch from cgroups v1 to v2. I'm already on cgroups v2 here on Debian 11 (bullseye) and I know that just having cgroups v2 enabled doesn't fix anything for me.

# systemctl --version
systemd 247 (247.3-7+deb11u1)

# dpkg -l | grep libnvidia-container
ii  libnvidia-container-tools             1.11.0-1                                 amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.11.0-1                                 amd64        NVIDIA container runtime library

# runc --version
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d
spec: 1.0.2-dev
go: go1.18.8
libseccomp: 2.5.1

# containerd --version
containerd containerd.io 1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661

# uname -a
Linux athena 5.10.0-19-amd64 NVIDIA/nvidia-docker#1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux

gengwg commented 1 year ago

yeah i do see some people still reporting it in v2, example this.

time wise, this issue starts to appear after we upgraded from centos 7 to centos 8. all components (kernel, systemd, containerd, nvidia runtime, etc.) on the pipeline all got upgraded. so i'm not totally sure which component (or possibly multiple components) caused this issue. in our case v1 to v2 seems fixed this issue so far for a week or so. i will monitor it in case it's back again.

fingertap commented 1 year ago

It has been over a week. Did you see the error again?

matifali commented 1 year ago

I find these logs during systemd reload:
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:254: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:255: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:1: No such file or directory
From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:
cd /dev/char
ln -s ../nvidia0 195:0
ln -s ../nvidiactl 195:255
ln -s ../nvidia-uvm 237:0
Furthermore, i find runc converts paths from /dev/nvidia* to /dev/char/*, the logic can be found here https://github.com/opencontainers/runc/blob/release-1.0/libcontainer/cgroups/systemd/common.go#L177.

So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/ -> /dev/nvidia ?

@elezar

How to get these logs to find the device numbers for my use case?

yummypeng commented 1 year ago

How to get these logs to find the device numbers for my use case?

@matifali You can simply use ls -l /dev/nvidia* to find the device ids. For example:

ls -l /dev/vcsa3
crw-rw---- 1 root tty 7, 131 Jul 13 19:40 /dev/vcsa3

Here, 7,131 is the major and minor device number for this device.

RezaImany commented 1 year ago

i've just fixed same issue in ubuntu 22.04 with changing my docker compose file simply use cgroup2 by commenting out #no-cgroups = false line in /etc/nvidia-container-runtime/config.toml and change your docker-compose file like this: mount /dev drive to /dev in container and set privileged: true in docker compose file also you need to specify runtime with this "runtime: nvidia"

and your final docker-compose file be like this:

version: '3' services: nvidia: image: restart: always container_name: Nvidia-Container ports:

privileged: true volumes:
/dev:/dev runtime: nvidia

and magic just happend! before this changes when i call systemctl daemon-reload , in host nvidia-smi worked but when exec in container nvidia-smi i got Failed to Initialize NVML Error thing but now systemctl daemon reload not effect on NVML initilization in container

matifali commented 1 year ago

And what if we are not using docker-compose @RezaImany. I am using terraform to provision with the gpus="all" flag. Exposing all devices to the container isn't a good approach and also privileged=true.

RezaImany commented 1 year ago

And what if we are not using docker compose @RezaImany. I am using terraform to provision with gpus=all flag. Exposing all devices to the container isn't a good approach and also privileged=true.

the root cause of this error is cgroup controller not allow container to reconnect to NVML until restart, you should mod cgroup for bypassing some limitations

the --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do.

matifali commented 1 year ago

For my use case, multiple people are using the same machine, and setting privileged=true is not a good idea as the isolation between users is not there anymore. Is there any other way?

slapshin commented 10 months ago

Hello,

Which status of the problem? I still have the same problem on cgroup2...

# systemctl --version
systemd 249 (249.11-0ubuntu3.11)
# dpkg -l | grep libnvidia-container
ii  libnvidia-container-tools             1.14.3-1                                amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.14.3-1                                amd64        NVIDIA container runtime library
# runc --version
runc version 1.1.9
commit: v1.1.9-0-gccaecfc
spec: 1.0.2-dev
go: go1.20.8
libseccomp: 2.5.3
# containerd --version
containerd containerd.io 1.6.24 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
# uname -a
Linux toor 5.15.0-88-generic NVIDIA/nvidia-docker#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# docker info
...
Cgroup Driver: systemd
Cgroup Version: 2
...

matifali commented 10 months ago

@slapshin Have you followed this approach? https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

slapshin commented 10 months ago

I can't set privileged: true, because of requirements. Also I'm already on cgroups v2...

slapshin commented 10 months ago

https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1740502744 - it is working for me

NVIDIA / nvidia-container-toolkit