Open cheyang opened 2 years ago
I also encountered this problem, which has been occurring for some time.
@klueska Could you help take a look? Thanks.
I find these logs during systemd reload:
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:254: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:255: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:0: No such file or directory
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:1: No such file or directory
From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:
cd /dev/char
ln -s ../nvidia0 195:0
ln -s ../nvidiactl 195:255
ln -s ../nvidia-uvm 237:0
Furthermore, i find runc converts paths from /dev/nvidia*
to /dev/char/*
, the logic can be found here https://github.com/opencontainers/runc/blob/release-1.0/libcontainer/cgroups/systemd/common.go#L177.
So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/ -> /dev/nvidia ?
@elezar
Otherwise, if there exists a configuration file that we can explicitly set DeviceAllow
as /dev/nvidia*
which can be recognized by systemd?
@klueska Could you help take a look? Thanks.
hey, I have been experienced this issue for a long time, I solved this by adding --privilege
to the dockers which need graphic card, hope this helps.
@klueska Could you help take a look? Thanks.
hey, I have been experienced this issue for a long time, I solved this by adding
--privilege
to the dockers which need graphic card, hope this helps.
Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.
@klueska Could you help take a look? Thanks.
hey, I have been experienced this issue for a long time, I solved this by adding
--privilege
to the dockers which need graphic card, hope this helps.Thanks for response. But I'm not able to set privilege because I'm using it in Kubernetes, and it will let user see all the gpus.
I fixed this issue in our env (centos 8, systemd 239) perfectly with cgroup v2, for both docker and containerd nodes. i can share the steps how we fixed it by upgrading from cgroup1 to cgroup2, if that's an option for you.
I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg
I'm using cgroups v2 myself so I would be interested in hearing what you did @gengwg
Sure here I wrote the detailed steps how I fixed it using cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
In that case, whatever the trigger is that you're seeing apparently isn't the same as mine as all that your instructions do is switch from cgroups v1 to v2. I'm already on cgroups v2 here on Debian 11 (bullseye) and I know that just having cgroups v2 enabled doesn't fix anything for me.
# systemctl --version
systemd 247 (247.3-7+deb11u1)
# dpkg -l | grep libnvidia-container
ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library
# runc --version
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d
spec: 1.0.2-dev
go: go1.18.8
libseccomp: 2.5.1
# containerd --version
containerd containerd.io 1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
# uname -a
Linux athena 5.10.0-19-amd64 NVIDIA/nvidia-docker#1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux
yeah i do see some people still reporting it in v2, example this.
time wise, this issue starts to appear after we upgraded from centos 7 to centos 8. all components (kernel, systemd, containerd, nvidia runtime, etc.) on the pipeline all got upgraded. so i'm not totally sure which component (or possibly multiple components) caused this issue. in our case v1 to v2 seems fixed this issue so far for a week or so. i will monitor it in case it's back again.
It has been over a week. Did you see the error again?
I find these logs during systemd reload:
Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:0: No such file or directory Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:254: No such file or directory Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/195:255: No such file or directory Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:0: No such file or directory Jun 30 20:38:51 iZ2zeixjfsr9m8l4nbfzo9Z systemd[1]: Couldn't stat device /dev/char/237:1: No such file or directory
From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:
cd /dev/char ln -s ../nvidia0 195:0 ln -s ../nvidiactl 195:255 ln -s ../nvidia-uvm 237:0
Furthermore, i find runc converts paths from
/dev/nvidia*
to/dev/char/*
, the logic can be found here https://github.com/opencontainers/runc/blob/release-1.0/libcontainer/cgroups/systemd/common.go#L177.So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/ -> /dev/nvidia ?
@elezar
How to get these logs to find the device numbers for my use case?
How to get these logs to find the device numbers for my use case?
@matifali You can simply use ls -l /dev/nvidia*
to find the device ids. For example:
ls -l /dev/vcsa3
crw-rw---- 1 root tty 7, 131 Jul 13 19:40 /dev/vcsa3
Here, 7,131
is the major and minor device number for this device.
i've just fixed same issue in ubuntu 22.04 with changing my docker compose file simply use cgroup2 by commenting out #no-cgroups = false line in /etc/nvidia-container-runtime/config.toml and change your docker-compose file like this: mount /dev drive to /dev in container and set privileged: true in docker compose file also you need to specify runtime with this "runtime: nvidia"
and your final docker-compose file be like this:
version: '3'
services:
nvidia:
image:
/dev:/dev runtime: nvidia
and magic just happend! before this changes when i call systemctl daemon-reload , in host nvidia-smi worked but when exec in container nvidia-smi i got Failed to Initialize NVML Error thing but now systemctl daemon reload not effect on NVML initilization in container
And what if we are not using docker-compose @RezaImany. I am using terraform to provision with the gpus="all"
flag.
Exposing all devices to the container isn't a good approach and also privileged=true
.
And what if we are not using docker compose @RezaImany. I am using terraform to provision with gpus=all flag. Exposing all devices to the container isn't a good approach and also privileged=true.
the root cause of this error is cgroup controller not allow container to reconnect to NVML until restart, you should mod cgroup for bypassing some limitations
the --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do.
For my use case, multiple people are using the same machine, and setting privileged=true
is not a good idea as the isolation between users is not there anymore. Is there any other way?
Hello,
Which status of the problem? I still have the same problem on cgroup2...
# systemctl --version
systemd 249 (249.11-0ubuntu3.11)
# dpkg -l | grep libnvidia-container
ii libnvidia-container-tools 1.14.3-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.14.3-1 amd64 NVIDIA container runtime library
# runc --version
runc version 1.1.9
commit: v1.1.9-0-gccaecfc
spec: 1.0.2-dev
go: go1.20.8
libseccomp: 2.5.3
# containerd --version
containerd containerd.io 1.6.24 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
# uname -a
Linux toor 5.15.0-88-generic NVIDIA/nvidia-docker#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# docker info
...
Cgroup Driver: systemd
Cgroup Version: 2
...
@slapshin Have you followed this approach? https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
I can't set privileged: true
, because of requirements. Also I'm already on cgroups v2...
https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1740502744 - it is working for me
1. Issue or feature description
Failed to initialize NVML: Unknown Error
does not occurred in initial NVIDIA docker created, but it's happened after callingsystemctl daemon-reload
.It works fine in
Kernel: 4.19.91 and systemd 219.
But it doesn't work in
Kernel: 5.10.23 and systemd 239.
I tried to monitor it with bpftrace:
During container startup, I can see event:
And I can see the
devicel.list
in container as below:But after running
systemctl daemon-reload
, I find the event:And the
devicel.list
in container as below:GPU device is not able be
rw
.Currently I'm not able to use
cgroup V2
. Any suggestions about it? Thanks very much.2. Steps to reproduce the issue
Run container
Check nvidia-smi
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V