Problem's enabling GPU - Workaround included

Summary

The default GPU | NVIDIA addon does not find the correct drivers and thus containers are crashing.

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

What Should Happen Instead?

Everything should work after enabling GPU-Addon. microk8s enable nvidia

Reproduction Steps

microk8s enable nvidia

Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
WARNING: --set-as-default-runtime is deprecated, please use --gpu-operator-toolkit-version instead
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using auto GPU driver
W1222 14:39:49.104108 1716891 warnings.go:70] unknown field "spec.daemonsets.rollingUpdate"
W1222 14:39:49.104132 1716891 warnings.go:70] unknown field "spec.daemonsets.updateStrategy"
NAME: gpu-operator
LAST DEPLOYED: Fri Dec 22 14:39:47 2023
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator

microk8s kubectl get pods --namespace gpu-operator-resources

NAME                                                          READY   STATUS                  RESTARTS      AGE
gpu-operator-node-feature-discovery-worker-ldvbf              1/1     Running                 0             4m42s
gpu-operator-559f7cd69b-7cqhm                                 1/1     Running                 0             4m42s
gpu-operator-node-feature-discovery-master-5bfbc54c8d-hppfr   1/1     Running                 0             4m42s
gpu-feature-discovery-zrp99                                   0/1     Init:CrashLoopBackOff   5 (91s ago)   4m21s
nvidia-operator-validator-hxfbf                               0/1     Init:CrashLoopBackOff   5 (89s ago)   4m22s
nvidia-device-plugin-daemonset-xmvvr                          0/1     Init:CrashLoopBackOff   5 (85s ago)   4m22s
nvidia-container-toolkit-daemonset-shdrn                      0/1     Init:CrashLoopBackOff   5 (80s ago)   4m22s
nvidia-dcgm-exporter-96gmz                                    0/1     Init:CrashLoopBackOff   5 (77s ago)   4m21s

microk8s kubectl describe pod nvidia-operator-validator-hxfbf -n gpu-operator-resources

Name:                 nvidia-operator-validator-hxfbf
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-operator-validator
Node:                 gpu01/132.176.10.80
Start Time:           Fri, 22 Dec 2023 14:40:09 +0100
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=6bd5fd4488
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: b921d851a8c76ad40b2f18e285c2b61d7f7300fd471f8ac751ca401bf9a32ded
                      cni.projectcalico.org/podIP: 10.1.69.188/32
                      cni.projectcalico.org/podIPs: 10.1.69.188/32
Status:               Pending
IP:                   10.1.69.188
IPs:
  IP:           10.1.69.188
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  containerd://2d9d39b1bbf489f5fc99c451a463935d8f63d5faddefac4305f7c849710eb7a5
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:18c9ea88ae06d479e6657b8a4126a8ee3f4300a40c16ddc29fb7ab3763d46005
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    StartError
      Message:   failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error d                                                 uring container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 01:00:00 +0100
      Finished:     Fri, 22 Dec 2023 14:45:44 +0100
    Ready:          False
    Restart Count:  6
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
  toolkit-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:  false
      COMPONENT:  toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                true
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  kube-api-access-8xhcc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
Type     Reason   Age                    From     Message
----     ------   ----                   ----     -------
Warning  BackOff  2m57s (x26 over 8m3s)  kubelet  Back-off restarting failed container driver-validation in pod nvidia-operator-validator-hxfbf_gpu-operator-resources                                                 (97c4f528-a16c-476b-a696-3c70cf6ed271)

nvidia-smi


Thu Dec 21 16:55:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               On  | 00000000:01:00.0 Off |                  Off |
| 30%   28C    P8               7W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               On  | 00000000:25:00.0 Off |                  Off |
| 30%   29C    P8               6W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               On  | 00000000:41:00.0 Off |                  Off |
| 30%   28C    P8               8W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   28C    P8               5W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               On  | 00000000:81:00.0 Off |                  Off |
| 30%   27C    P8               9W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               On  | 00000000:C1:00.0 Off |                  Off |
| 30%   27C    P8               7W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               On  | 00000000:C4:00.0 Off |                  Off |
| 30%   27C    P8               2W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               On  | 00000000:E1:00.0 Off |                  Off |
| 30%   27C    P8               6W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

ls -la /run/nvidia/driver

total 0
drwxr-xr-x 2 root root 40 Dez 21 17:26 .
drwxr-xr-x 4 root root 80 Dez 21 17:26 ..

cat /etc/docker/daemon.json

{
    "insecure-registries" : ["localhost:32000"],
       "runtimes": {
       "nvidia": {
           "args": [],
           "path": "nvidia-container-runtime"
       }
   }
}}

cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml

accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

microk8s inspect

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

Building the report tarball
  Report tarball is at /var/snap/microk8s/6089/inspection-report-20231221_170521.tar.gz

microk8s kubectl describe clusterpolicies --all-namespaces

Name:         cluster-policy
Namespace:
Labels:       app.kubernetes.io/component=gpu-operator
              app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: gpu-operator
              meta.helm.sh/release-namespace: gpu-operator-resources
API Version:  nvidia.com/v1
Kind:         ClusterPolicy
Metadata:
  Creation Timestamp:  2023-12-21T16:16:44Z
  Generation:          1
  Resource Version:    105635519
  UID:                 e20bbaad-bdaf-4c87-86dd-b2fcc3d8f88f
Spec:
  Daemonsets:
    Priority Class Name:  system-node-critical
    Tolerations:
      Effect:    NoSchedule
      Key:       nvidia.com/gpu
      Operator:  Exists
  Dcgm:
    Enabled:            false
    Host Port:          5555
    Image:              dcgm
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            3.1.3-1-ubuntu20.04
  Dcgm Exporter:
    Enabled:  true
    Env:
      Name:             DCGM_EXPORTER_LISTEN
      Value:            :9400
      Name:             DCGM_EXPORTER_KUBERNETES
      Value:            true
      Name:             DCGM_EXPORTER_COLLECTORS
      Value:            /etc/dcgm-exporter/dcp-metrics-included.csv
    Image:              dcgm-exporter
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/k8s
    Service Monitor:
      Additional Labels:
      Enabled:       false
      Honor Labels:  false
      Interval:      15s
    Version:         3.1.3-3.1.2-ubuntu20.04
  Device Plugin:
    Enabled:  true
    Env:
      Name:             PASS_DEVICE_SPECS
      Value:            true
      Name:             FAIL_ON_INIT_ERROR
      Value:            true
      Name:             DEVICE_LIST_STRATEGY
      Value:            envvar
      Name:             DEVICE_ID_STRATEGY
      Value:            uuid
      Name:             NVIDIA_VISIBLE_DEVICES
      Value:            all
      Name:             NVIDIA_DRIVER_CAPABILITIES
      Value:            all
    Image:              k8s-device-plugin
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia
    Version:            v0.13.0-ubi8
  Driver:
    Cert Config:
      Name:
    Enabled:            false
    Image:              driver
    Image Pull Policy:  IfNotPresent
    Kernel Module Config:
      Name:
    Licensing Config:
      Config Map Name:
      Nls Enabled:      false
    Manager:
      Env:
        Name:             ENABLE_GPU_POD_EVICTION
        Value:            true
        Name:             ENABLE_AUTO_DRAIN
        Value:            true
        Name:             DRAIN_USE_FORCE
        Value:            false
        Name:             DRAIN_POD_SELECTOR_LABEL
        Value:
        Name:             DRAIN_TIMEOUT_SECONDS
        Value:            0s
        Name:             DRAIN_DELETE_EMPTYDIR_DATA
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia/cloud-native
      Version:            v0.5.1
    Rdma:
      Enabled:         false
      Use Host Mofed:  false
    Repo Config:
      Config Map Name:
    Repository:         nvcr.io/nvidia
    Version:            525.60.13
    Virtual Topology:
      Config:
  Gfd:
    Enabled:  true
    Env:
      Name:             GFD_SLEEP_INTERVAL
      Value:            60s
      Name:             GFD_FAIL_ON_INIT_ERROR
      Value:            true
    Image:              gpu-feature-discovery
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia
    Version:            v0.7.0-ubi8
  Mig:
    Strategy:  single
  Mig Manager:
    Config:
      Name:
    Enabled:  true
    Env:
      Name:   WITH_REBOOT
      Value:  false
    Gpu Clients Config:
      Name:
    Image:              k8s-mig-manager
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            v0.5.0-ubuntu20.04
  Node Status Exporter:
    Enabled:            false
    Image:              gpu-operator-validator
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            v22.9.1
  Operator:
    Default Runtime:  containerd
    Init Container:
      Image:              cuda
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia
      Version:            11.8.0-base-ubi8
    Runtime Class:        nvidia
  Psp:
    Enabled:  false
  Sandbox Device Plugin:
    Enabled:            true
    Image:              kubevirt-gpu-device-plugin
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia
    Version:            v1.2.1
  Sandbox Workloads:
    Default Workload:  container
    Enabled:           false
  Toolkit:
    Enabled:  true
    Env:
      Name:             CONTAINERD_CONFIG
      Value:            /var/snap/microk8s/current/args/containerd-template.toml
      Name:             CONTAINERD_SOCKET
      Value:            /var/snap/microk8s/common/run/containerd.sock
      Name:             CONTAINERD_SET_AS_DEFAULT
      Value:            0
    Image:              container-toolkit
    Image Pull Policy:  IfNotPresent
    Install Dir:        /usr/local/nvidia
    Repository:         nvcr.io/nvidia/k8s
    Version:            v1.11.0-ubuntu20.04
  Validator:
    Image:              gpu-operator-validator
    Image Pull Policy:  IfNotPresent
    Plugin:
      Env:
        Name:    WITH_WORKLOAD
        Value:   true
    Repository:  nvcr.io/nvidia/cloud-native
    Version:     v22.9.1
  Vfio Manager:
    Driver Manager:
      Env:
        Name:             ENABLE_AUTO_DRAIN
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia/cloud-native
      Version:            v0.5.1
    Enabled:              true
    Image:                cuda
    Image Pull Policy:    IfNotPresent
    Repository:           nvcr.io/nvidia
    Version:              11.7.1-base-ubi8
  Vgpu Device Manager:
    Config:
      Default:          default
      Name:
    Enabled:            true
    Image:              vgpu-device-manager
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            v0.2.0
  Vgpu Manager:
    Driver Manager:
      Env:
        Name:             ENABLE_AUTO_DRAIN
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia/cloud-native
      Version:            v0.5.1
    Enabled:              false
    Image:                vgpu-manager
    Image Pull Policy:    IfNotPresent
Events:                   <none>

Can you suggest a fix?

Change values in:

/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml

root = "/run/nvidia/driver" to root = "/"

/usr/local/nvidia/toolkit/nvidia-container-runtime added: "runtimes": { "nvidia": { "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime", "runtimeArgs": [] } }

Added symlink:

ln -s /sbin /run/nvidia/driver/sbin

restart k8s

microk8s stop microk8s start

Then all containers are starting up correctly !

Best regards !

EDIT: Found following issue containing the same issue: https://github.com/NVIDIA/gpu-operator/issues/511

canonical / microk8s-core-addons