Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

1. Issue or feature description

I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu I'm getting the same error Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured .

2. Steps to reproduce the issue

I have followed this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu

Download k0s binary

curl -L "https://github.com/k0sproject/k0s/releases/download/v1.24.4%2Bk0s.0/k0s-v1.24.4+k0s.0-amd64" -o /tmp/k0s
chmod +x /tmp/k0s

Download k0sctl binary

curl -L "https://github.com/k0sproject/k0sctl/releases/download/v0.13.2/k0sctl-linux-x64" -o /usr/local/bin/k0sctl
chmod +x /usr/local/bin/k0sctl

Then you need to create a k0sctl.yaml config file: For a multi-node Kubernetes cluster

k0sctl.yaml file

apiVersion: k0sctl.k0sproject.io/v1beta1
kind:  Cluster
metadata:
  name: my-cluster
spec:
  hosts:
    - role: controller
      localhost:
        enabled: true
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
    - role: worker
      ssh:
        address: 43.88.62.134
        user: user
        keyPath: .ssh/id_rsa
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
    - role: worker
      ssh:
        address: 43.88.62.133
        user: user
        keyPath: .ssh/id_rsa
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
  k0s:
    version: 1.24.4+k0s.0
    config:
      spec:
        network:
          provider: calico

/tmp/k0s/containerd.toml file

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

Then run the command: k0sctl apply --config /path/to/k0sctl.yaml

Deploy NVIDIA GPU Operator

values.yaml file

operator:
  defaultRuntime: containerd

toolkit:
  version: v1.10.0-ubuntu20.04
  env:
    - name: CONTAINERD_CONFIG
      value: /etc/k0s/containerd.toml
    - name: CONTAINERD_SOCKET
      value: /run/k0s/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

driver:
  manager:
    image: k8s-driver-manager
    repository: nvcr.io/nvidia/cloud-native
    version: v0.4.0
    imagePullPolicy: IfNotPresent
    env:
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
  repoConfig:
    configMapName: repo-config
  version: "495.29.05"

validator:
version: "v1.11.0"

Install Helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Now, add the NVIDIA Helm repository:

helm repo add nvidia https://nvidia.github.io/gpu-operator \
   && helm repo update

helm install --wait --generate-name \
     nvidia/gpu-operator

helm upgrade     --install     --namespace=gpu-operator     --create-namespace     --wait     --values=values.yaml    gpu-operator     nvidia/gpu-operator

1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?

on both worker nodes the drivers/container-toolkit is pre-installed.

on controller node its not installed because its non-GPU machine.

2. OS version `Ubuntu 20.04.5 LTS`

3. Status of all pods under gpu-operator namespace

NAME                                                              READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-jc4wt                                       0/1     Init:0/1   0          18h
gpu-feature-discovery-r27zv                                       0/1     Init:0/1   0          18h
gpu-operator-1673351272-node-feature-discovery-master-65d8hl88v   1/1     Running    0          18h
gpu-operator-1673351272-node-feature-discovery-worker-8j72k       1/1     Running    0          18h
gpu-operator-1673351272-node-feature-discovery-worker-wj5gd       1/1     Running    0          18h
gpu-operator-95b545d6f-r2cnp                                      1/1     Running    0          18h
nvidia-container-toolkit-daemonset-lg79g                          1/1     Running    0          18h
nvidia-container-toolkit-daemonset-q26kq                          1/1     Running    0          18h
nvidia-dcgm-exporter-2vpwj                                        0/1     Init:0/1   0          18h
nvidia-dcgm-exporter-gx6dv                                        0/1     Init:0/1   0          18h
nvidia-device-plugin-daemonset-tbbgb                              0/1     Init:0/1   0          18h
nvidia-device-plugin-daemonset-z29kx                              0/1     Init:0/1   0          18h
nvidia-operator-validator-79s4j                                   0/1     Init:0/4   0          18h
nvidia-operator-validator-thbq2                                   0/1     Init:0/4   0          18h

4. Logs from init-containers

from device-plugin

Error from server (BadRequest): container "toolkit-validation" in pod "nvidia-device-plugin-daemonset-tbbgb" is waiting to start: PodInitializing

from container-toolkit


time="2023-01-10T11:57:43Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:57:43Z" level=info msg="Config version: 2"
time="2023-01-10T11:57:43Z" level=info msg="Updating config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully updated config"
time="2023-01-10T11:57:43Z" level=info msg="Flushing config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:57:43Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:57:43Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:57:43Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:57:43Z" level=info msg="Waiting for signal"
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ time="2023-01-10T11:51:53Z" level=info msg="Successfully loaded config" time="2023-01-10T11:51:53Z" level=info msg="Config version: 2" time="2023-01-10T11:51:53Z" level=info msg="Updating config" time="2023-01-10T11:51:53Z" level=info msg="Successfully updated config" time="2023-01-10T11:51:53Z" level=info msg="Flushing config" time="2023-01-10T11:51:53Z" level=info msg="Successfully flushed config" time="2023-01-10T11:51:53Z" level=info msg="Sending SIGHUP signal to containerd" time="2023-01-10T11:51:53Z" level=info msg="Successfully signaled containerd" time="2023-01-10T11:51:53Z" level=info msg="Completed 'setup' for containerd" time="2023-01-10T11:51:53Z" level=info msg="Waiting for signal"

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1601 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1601 G /usr/lib/xorg/Xorg 9MiB | | 1 N/A N/A 1736 G /usr/bin/gnome-shell 8MiB | +-----------------------------------------------------------------------------+

NVIDIA / gpu-operator