Closed wanghm closed 8 months ago
I already confirmed libnvidia-ml.so.1 has been correctly installed in the worker nodes.
~# ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
lrwxrwxrwx 1 root root 26 Oct 12 14:27 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.525.105.17
:~# ldconfig -p | grep "libnvidia-ml.so.1"
libnvidia-ml.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so.1
I researched this today. Seems line 79 default_runtime_name needs to be changed to nvidia.
Tried this way, nvidia-device-plugin can detect GPU device now. (after recreate the demonset pods)
sudo containerd config default | sudo tee /etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
sed -i 's/default_runtime_name = "runc"/default_runtime_name = "nvidia"/' /etc/containerd/config.toml
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
sudo systemctl restart kubelet
Another option would be to create a RuntimeClass associated with the nvidia runtime. This would have to be added to the pod spec for the device plugin.
Note that the --set-as-default
flag could also be used to set the default when running nvidia-ctk runtime configure
@elezar Thanks. So is this correct to configure config.toml nvidia runtime for containerd?
sudo containerd config default | sudo tee /etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
I confirmed that if /etc/containerd/config.toml doesn't exist, nvidia-ctk runtime configure
command only generates the file with differences, and it fails to start the containerd service with it.
@wanghm if the file doesn't exist, the NVIDIA Container runtime will create the file with only the NVIDIA Container Runtime specific configuration changes. As far as I am aware, containerd should generate a default config in memory and then apply the changes from any files loaded.
@elezar Thanks for your help.
Yes, if the file doesn't exist, sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
create the file with only the NVIDIA Container Runtime.
It's same as:
https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
Tested it again, seems containerd, kubelet service works.
But after restarting the node, part of the pods(CNI, CSI related) are becoming unstable, status are varying between CrashLoopBackOff
and Running
.
I tried add SystemdCgroup = true
in the last line, now it's working.
Can you please confirm if this is the correct config file we should use?
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
@wanghm yes, this looks correct. In cases where the file exists, we infer settings such as SystemdCgroup = true
from the runc
settings defined in the config.
I will create an internal ticket to track handling this correctly.
I believe older versions of containerd had SystemdCgroup = false
by default, and newer ones have flipped it to SystemdCgroup = true
by default.
Maybe the right thing to do if no config.toml
file exists is to just run containerd config default
into an in-memory buffer and then do exactly what we normally would have done if we had read the config.toml
file from disk.
Thank you.
I'd like to delete /etc/containerd/config.toml at first, and run sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
, then add SystemdCgroup = true
manually.
It would be helpful if the command can generate it as well.
Close this issue.
Issue
After using the config.toml generated automatically according to the new installation guide, the vGPU (cuda) test fails. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.14.1/install-guide.html# It may also be affecting the normal operation of Worker nodes: → Sometimes it falsely reports overload → causing nodes to become not ready → unable to deploy Pods.
Can anyone confirm the config files and tell us how to configure it?
Environment info
OS: Ubuntu 22.04.3 LTS 5.15.0-79-generic containerd: 1.6.22 Kubernetes: 1.26.0
~# nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | ~# nvidia-container-cli -V cli-version: 1.14.2 lib-version: 1.14.2
Containerd Configuration Files:
config.toml_20230828_default_runc.toml https://gist.github.com/wanghm/873a14cf6ac637d4507d9d5ee797b970
config.toml_nvidia_202308_manual_created.toml https://gist.github.com/wanghm/996002594502f354df53201728e1b946
config.toml_nvidia_20231012_auto_generated_by_ctk1.14.2.toml https://gist.github.com/wanghm/42f2fdee9fa25fa74237e6a631e0f67b
Error Messages:
From test pod:
From nvidia-device-plugin: