RKE2: [pre-installed drivers+container-toolkit] error creating symlinks

DevKyleS commented 1 year ago

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html

1. Quick Debug Information

BAREMETAL
OS/Version:Ubuntu 22.04.3 LTS
Container Runtime Type/Version: containerd
K8s Flavor/Version: Rancher RKE2 v1.25.12+rke2r1
GPU Operator Version: nvidia gpu-operator-v23.6.0

2. Issue or feature description

Deploying gpu-operator there's error Error: error validating driver installation: error creating symlinks error in the nvidia-operator-validator container on RKE2 with pre-installed drivers and container toolkit.

level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

3. Steps to reproduce the issue

$ sudo apt-get install -y nvidia-driver-535-server nvidia-container-toolkit
$ sudo shutdown -r now
$ helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set driver.enabled=false \
    --set toolkit.enabled=false \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
    --set psp.enabled=true

4. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods -n gpu-operator
[ ] kubernetes daemonset status: kubectl get ds -n gpu-operator
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n gpu-operator POD_NAME
[x] If a pod/ds is in an error state or pending state kubectl logs -n gpu-operator POD_NAME --all-containers
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

spectrum@spectrum:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

spectrum@spectrum:~$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 all [installed,automatic]
libnvidia-compute-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-container-tools/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-container1/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-decode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-encode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-extra-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-fbc1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-gl-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-compute-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-container-toolkit-base/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,bionic,now 1.13.5-1 amd64 [installed]
nvidia-dkms-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-driver-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed]
nvidia-firmware-535-server-535.54.03/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-source-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]

spectrum@spectrum:~$ cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

spectrum@spectrum:~$ nvidia-container-cli info
NVRM version:   535.54.03
CUDA version:   12.2

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce GTX 1650
Brand:          GeForce
GPU UUID:       GPU-648ac414-633e-cf39-d315-eabd271dfad1
Bus Location:   00000000:01:00.0
Architecture:   7.5

spectrum@spectrum:~$ kubectl logs -n gpu-operator -p nvidia-operator-validator-j8kvt --all-containers=true
time="2023-08-17T05:24:02Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Wed Aug 16 23:24:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P0              11W /  75W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
time="2023-08-17T05:24:03Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidiactl already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-modeset already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm-tools already exists"
time="2023-08-17T05:24:03Z" level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""
Error from server (BadRequest): previous terminated container "toolkit-validation" in pod "nvidia-operator-validator-j8kvt" not found

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

# tree /run/nvidia
/run/nvidia
├── driver
└── validations

2 directories, 0 files

spectrum@spectrum:/tmp/nvidia-gpu-operator_20230816_2329 $ cat gpu_operand_ds_nvidia-operator-validator.descr
Name:           nvidia-operator-validator
Selector:       app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator
Node-Selector:  nvidia.com/gpu.deploy.operator-validator=true
Labels:         app=nvidia-operator-validator
                app.kubernetes.io/managed-by=gpu-operator
                app.kubernetes.io/part-of=gpu-operator
                helm.sh/chart=gpu-operator-v23.6.0
Annotations:    deprecated.daemonset.template.generation: 1
                nvidia.com/last-applied-hash: fa2bb82bef132a9a
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 1 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=nvidia-operator-validator
                    app.kubernetes.io/managed-by=gpu-operator
                    app.kubernetes.io/part-of=gpu-operator
                    helm.sh/chart=gpu-operator-v23.6.0
  Service Account:  nvidia-operator-validator
  Init Containers:
   driver-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
   toolkit-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
   cuda-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:            (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
   plugin-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:            (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
  Containers:
   nvidia-operator-validator:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    Environment:  <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
  Volumes:
   run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
   driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
   host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
   host-dev-char:
    Type:               HostPath (bare host directory volume)
    Path:               /dev/char
    HostPathType:
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  11m   daemonset-controller  Created pod: nvidia-operator-validator-j8kvt

spectrum@spectrum:~$ sudo nvidia-ctk system create-dev-char-symlinks
INFO[0000] Creating link /dev/char/195:254 => /dev/nvidia-modeset
WARN[0000] Could not create symlink: symlink /dev/nvidia-modeset /dev/char/195:254: file exists
INFO[0000] Creating link /dev/char/507:0 => /dev/nvidia-uvm
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm /dev/char/507:0: file exists
INFO[0000] Creating link /dev/char/507:1 => /dev/nvidia-uvm-tools
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm-tools /dev/char/507:1: file exists
INFO[0000] Creating link /dev/char/195:0 => /dev/nvidia0
WARN[0000] Could not create symlink: symlink /dev/nvidia0 /dev/char/195:0: file exists
INFO[0000] Creating link /dev/char/195:255 => /dev/nvidiactl
WARN[0000] Could not create symlink: symlink /dev/nvidiactl /dev/char/195:255: file exists
INFO[0000] Creating link /dev/char/511:1 => /dev/nvidia-caps/nvidia-cap1
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap1 /dev/char/511:1: file exists
INFO[0000] Creating link /dev/char/511:2 => /dev/nvidia-caps/nvidia-cap2
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap2 /dev/char/511:2: file exists

elezar commented 1 year ago

Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:

      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"

to the validator.driver.env.

cc @cdesiniotis

DevKyleS commented 1 year ago

I got it finally working after removing /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl (by renaming) I'll dig more into the differences later.

$ mv config.toml.tmpl config.toml.tmpl-nvidia
$ sudo service containerd restart
$ sudo service rke2-server restart
$ helm uninstall gpu-operator -n gpu-operator
$ helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator \
    --set driver.enabled=false \
    --set toolkit.enabled=false \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
    --set psp.enabled=true \
    --set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION \
    --set-string validator.driver.env[0].value=true

Looks like the original rke2 containerd config differs from what I had placed in the tmpl file per guidance on https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html#bare-metal-passthrough-with-pre-installed-drivers-and-nvidia-container-toolkit

But it still works...

$ sudo ctr run --rm -t     --runc-binary=/usr/bin/nvidia-container-runtime     --env NVIDIA_VISIBLE_DEVICES=all     docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04     cuda-22.2.0-base-ubuntu22.04 nvidia-smiThu Aug 17 17:36:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| 34%   36C    P8               7W /  75W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@spectrum:/var/lib/rancher/rke2/agent/etc/containerd# cat config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "index.docker.io/rancher/pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

DevKyleS commented 1 year ago

Still investigating... Looks like I have multiple versions of nvidia-container-runtime installed somehow. Still investigating, as this appears to not be working but the node/containers can start now (couldn't before)...

DevKyleS commented 1 year ago

Upgrading to v23.6.1 I'm no longer able to reproduce this issue.

After reading https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.1/release-notes.html#fixed-issues I attempted this with the new version. My issue has been resolved.

@elezar I think this can be closed as it appears the v23.6.1 release has fixed this problem.

cmontemuino commented 10 months ago

I still can reproduce this issue with version v23.9.0.

In our case the NVIDIA drivers come pre-installed, and I can see devices /dev/nvidia*

armaneshaghi commented 9 months ago

Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:
      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"
to the validator.driver.env.

cc @cdesiniotis

I tried this on my cluster policy and restarted the cluster but still get the same error. I am using version 23.9.0

armaneshaghi commented 9 months ago

I still can reproduce this issue with version v23.9.0.

In our case the NVIDIA drivers come pre-installed, and I can see devices /dev/nvidia*

Did you find a workaround?

cmontemuino commented 9 months ago

name: DISABLE_DEV_CHAR_SYMLINK_CREATION value: "true"

This what we have in our values.yaml:

validator:
  driver:
    env:
      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"

tinklern commented 9 months ago

I hit this today on v23.9.1

Adding DISABLE_DEV_CHAR_SYMLINK_CREATION resolved it in my case.

That said - the release notes say this should have been fixed in 23.6.1: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.9.1/release-notes.html#id8

And I am definitely still seeing it in 23.9.1

Mostly consumer GPUs (RTX2080s) on my nodes.

danieljkemp commented 4 months ago

Just encountered this with a Tesla P4 on v23.9.1/rke2 v1.29

CoderTH commented 3 months ago

We also encountered the same problem in v23.9.0. I manually modified the DISABLE_DEV_CHAR_SYMLINK_CREATION parameter as prompted, and the container-toolkit works normally. However, the tookit-validator check of nvidia-operator-validator fails, and the following error message is displayed