Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd

zvier commented 2 years ago

1. Issue or feature description

After change the k8s container runtime from docker to containerd, we execute nvidia-smi in a k8s GPU POD, it returns error with Failed to initialize NVML: Unknown Error and the pod cannot work well.

2. Steps to reproduce the issue

I configured my containerd referenced https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2. The containerd diff config is:

--- config.toml 2020-12-17 19:13:03.242630735 +0000
+++ /etc/containerd/config.toml 2020-12-17 19:27:02.019027793 +0000
@@ -70,7 +70,7 @@
   ignore_image_defined_volumes = false
   [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
-      default_runtime_name = "runc"
+      default_runtime_name = "nvidia"
      no_pivot = false
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
@@ -94,6 +94,15 @@
         privileged_without_host_devices = false
         base_runtime_spec = ""
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+            SystemdCgroup = true
+       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+          privileged_without_host_devices = false
+          runtime_engine = ""
+          runtime_root = ""
+          runtime_type = "io.containerd.runc.v1"
+          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+            BinaryName = "/usr/bin/nvidia-container-runtime"
+            SystemdCgroup = true

Then, I run the base test case with ctr command, it passed and return expectly.

ctr image pull docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04  
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

When created the GPU pod from k8s, the pod alos can running, but execute nvidia-smi in pod it returns error with Failed to initialize NVML: Unknown Error. The test pod yaml is:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
    command:
      - sleep
      - "3600"
    resources:
      limits:
         nvidia.com/gpu: 1
  nodeName: test-node

3. Information to attach (optional if deemed irrelevant)

I think the nvidia config in my host is right. the only change is the container runtime we use containerd directly instead of docker. And if we used docker as runtime it works well.

Common error checking:

[ ] The k8s-device-plugin container logs

crictl logs 90969408d45c6
2022/07/11 23:39:21 Loading NVML
2022/07/11 23:39:21 Starting FS watcher.
2022/07/11 23:39:21 Starting OS watcher.
2022/07/11 23:39:21 Retreiving plugins.
2022/07/11 23:39:21 Starting GRPC server for 'nvidia.com/gpu'
2022/07/11 23:39:21 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia.sock
2022/07/11 23:39:21 Registered device plugin for 'nvidia.com/gpu' with Kubelet

Additional information that might help better understand your environment and reproduce the bug:

[ ] Containerd version from containerd -v 1.6.5
[ ] Kernel version from uname -a 4.18.0-2.4.3

elezar commented 2 years ago

Note that the following command doesn't use the same code path for injecting GPUs as what K8s does.

ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

Would it be possible to test this with nerdctl instead? or ensure that the RUNTIME is set instead of using the --gpus 0 flag?

Also, could you provide information on the version of the device plugin you are using, the driver version, and the version of the NVIDIA Container Toolkit.

zvier commented 2 years ago

Note that the following command doesn't use the same code path for injecting GPUs as what K8s does.
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi
Would it be possible to test this with nerdctl instead? or ensure that the RUNTIME is set instead of using the --gpus 0 flag?

Also, could you provide information on the version of the device plugin you are using, the driver version, and the version of the NVIDIA Container Toolkit.

Two test case for above suggestions.

Use nerdctl instead of ctr


nerdctl run --network=host --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

No devices were found

2. Use `--runtime io.containerd.runc.v1` instead of `--gpus 0`

ctr run --runtime io.containerd.runc.v1 --rm -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

which nvidia-smi /bin/nvidia-smi


Device plugin:

nvidia-k8s-device-plugin:1.0.0-beta6


NVIDIA packages version:

rpm -qa 'nvidia' libnvidia-container-tools-1.3.1-1.x86_64 nvidia-container-runtime-3.4.0-1.x86_64 libnvidia-container1-1.3.1-1.x86_64 nvidia-docker2-2.5.0-1.noarch nvidia-container-toolkit-1.4.0-2.x86_64

NVIDIA container library version:

nvidia-container-cli -V version: 1.3.1 build date: 2020-12-14T14:18+0000 build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

zvier commented 2 years ago

We strace the nvidia-smi process in the container and found access the /dev/nvidiactl device not permitted.

elezar commented 2 years ago

@zvier those are very old versions for all the packages and the device plugin.

Would you be able to try with the latest versions:

nvidia-container-toolkit, libnvidia-container-tools, and libnvidia-container1: v1.10.0
device-plugin: v0.12.0

zvier commented 2 years ago

@zvier those are very old versions for all the packages and the device plugin.

Would you be able to try with the latest versions:

nvidia-container-toolkit, libnvidia-container-tools, and libnvidia-container1: v1.10.0

device-plugin: v0.12.0

Also not work well. But it can work if I add securityContext filed in my pod yaml like this:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
    command:
      - sleep
      - "36000"
    resources:
      limits:
         nvidia.com/gpu: 1
    securityContext:
      privileged: true
  nodeName: test-node-1

elezar commented 2 years ago

So to summarise. If you update the versions to the latest AND run the test pod in privileged then you're able to run nvidia-smi in the container.

This is expected since this would mount all of /dev/nv* into the container regardless and would then avoid the permission errors on /dev/nvidiactl.

Could you enable debug output for the nvidia-container-cli by uncommenting the #debug = lines in /etc/nvidia-contianer-runtime/config.toml and then including the output from /var/log/nvidia-container-toolkit.log here?

You should also be able to use ctr directly in this case by running something like:

sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
    cuda-11.0.3-base-ubuntu20.04 nvidia-smi

(note how the runc-binary is set to the nvidia-container-runtime).

zvier commented 2 years ago

ctr run --rm -t \ --runc-binary=/usr/bin/nvidia-container-runtime \ --env NVIDIA_VISIBLE_DEVICES=all \ docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \ cuda-11.0.3-base-ubuntu20.04 nvidia-smi

If test the pod with privileged, update nvidia versions is no needed.

After uncommenting the #debug = lines in /etc/nvidia-contianer-runtime/config.toml and run ctr run command, it print ok. The output of /var/log/nvidia-container-toolkit.log is:

{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/default/cuda-11.0.3-base-ubuntu20.04/config.json","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:45+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:45+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:49+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:49+08:00"}

If my container runtime is containerd, the /etc/nvidia-container-runtime/config.toml is:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

If my container runtime is dockerd, the /etc/nvidia-container-runtime/config.toml is:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

yangfeiyu20102011 commented 2 years ago

@elezar Hi，I have encountered the similar problem~ The permissions of /dev/nvidia* are 'rw', but nvidia-smi failed.

I find that the permissions in devices.list are not right.

I try to use root to echo "c 195:* rwm" > /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf9413023_9640_4bd8_b76f_b1b629642012.slice/cri-containerd-c33389a1c755d1d6fe2de531890db4bc5e821e41646ac6d2ff7aa83662f00c9e.scope/devices.allow

/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf9413023_9640_4bd8_b76f_b1b629642012.slice/cri-containerd-c33389a1c755d1d6fe2de531890db4bc5e821e41646ac6d2ff7aa83662f00c9e.scope/devices.listchanged as expected.

But after a moment, the devices.list restored.Maybe it's the problem. Kubelet and containerd may update the cgroup devices at regular intervals. How to solve it? Thanks!

klueska commented 2 years ago

Are you running the plugin with the --pass-device-specs option? This flag was designed to avoid this exact issue: https://github.com/NVIDIA/k8s-device-plugin#as-command-line-flags-or-envvars

yangfeiyu20102011 commented 2 years ago

Are you running the plugin with the --pass-device-specs option? This flag was designed to avoid this exact issue: https://github.com/NVIDIA/k8s-device-plugin#as-command-line-flags-or-envvars

I find that runc update may also change the devices.list. setUnitProperties(m.dbus, unitName, properties...)changes the devices.list through systemd. The properties are made by genV1ResourcesProperties, deviceProperties in properties will include entry.Path = fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor) /dev/nvidiactl can not be found in /dev/char/195:255, the right format should be DeviceAllow=/dev/char/195:255 rw

I wan to make a PR to runc like this ` // " n:m " rules are just a path in /dev/{block,char}/. switch rule.Type { case devices.BlockDevice: entry.Path = fmt.Sprintf("/dev/block/%d:%d", rule.Major, rule.Minor) case devices.CharDevice: entry.Path = getCharEntryPath(rule) }

func isNVIDIADevice(rule *devices.Rule) bool { // NVIDIA device has major 195 and 507 if rule.Major == 195 || rule.Major == 507 { return true } return false }

func getNVIDIAEntryPath(rule *devices.Rule) string { str := "/dev/" switch rule.Major { case 195: switch rule.Minor { case 254: str = str + "nvidia-modeset" case 255: str = str + "nvidiactl" default: str = str + "nvidia" + strconv.Itoa(int(rule.Minor)) } case 507: switch rule.Minor { case 0: str = str + "nvidia-uvm" case 1: str = str + "nvidia-uvm-tools" } } return str }

func getCharEntryPath(rule *devices.Rule) string { if isNVIDIADevice(rule) { return getNVIDIAEntryPath(rule) } return fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor) } `

Do you meet the same problem? Thank you! @klueska

gwgrisk commented 1 year ago

@klueska Hi,I have encountered the same problem. I used the command cat /var/lib/kubelet/cpu_manager_state and got the following output:

{"policyName":"none","defaultCpuSet":"","checksum":1353318690}

Does this mean that the issue with the cpuset does not exist, and therefore it is not necessary to pass the PASS_DEVICE_SPECS parameter when starting?

zvier commented 1 year ago

This PR has fixed this problem.

elezar commented 1 year ago

Thanks for the confirmation @zvier.

@gwgrisk Note that with newer versions of systemd and using systemd cgroup management, it is also required to specify the PASS_DEVICE_SPECS option. It is thus no longer limited to interactions with GPU manager since any systemd reload will trigger a container to lose access to the underlying device nodes in this case.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

NVIDIA / k8s-device-plugin