NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.25k stars 238 forks source link

no runtime for "nvidia" is configured #730

Open yanis-incepto opened 3 weeks ago

yanis-incepto commented 3 weeks ago

1. Quick Debug Information

2. Issue or feature description

kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
.... 
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               2m11s                 default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
  Warning  FailedCreatePodSandBox  12s (x10 over 2m11s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

It looks like the runtime isn't present as it's not found but it exists.

kubectl get runtimeclasses.node.k8s.io                                               
NAME     HANDLER   AGE
nvidia   nvidia    7d1h 
kubectl describe runtimeclasses.node.k8s.io nvidia                                   
Name:         nvidia
Namespace:    
Labels:       app.kubernetes.io/component=gpu-operator
Annotations:  <none>
API Version:  node.k8s.io/v1
Handler:      nvidia
Kind:         RuntimeClass
Metadata:
  Creation Timestamp:  2024-05-27T08:53:18Z
  Owner References:
    API Version:           nvidia.com/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  ClusterPolicy
    Name:                  cluster-policy
    UID:                   2c237c3d-07eb-4856-8316-046489793e3d
  Resource Version:        265073642
  UID:                     26fd5054-7344-4e6d-9029-a610ae0df560
Events:                    <none>

3. Steps to reproduce the issue

I installed the chart with helmfile

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE

 kubectl get pods -n gpu-operator                                                  
NAME                                                         READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-spbbk                                  0/1     Init:0/1   0          41s
gpu-operator-d97f85598-j7qt4                                 1/1     Running    0          7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8        1/1     Running    0          6d20h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj   1/1     Running    0          6d20h
gpu-operator-node-feature-discovery-worker-jfdsh             1/1     Running    0          85s
nvidia-container-toolkit-daemonset-vb6qn                     0/1     Init:0/1   0          41s
nvidia-dcgm-exporter-9xmbm                                   0/1     Init:0/1   0          41s
nvidia-device-plugin-daemonset-w72xb                         0/1     Init:0/1   0          41s
nvidia-driver-daemonset-v4n96                                0/1     Running    0          73s
nvidia-operator-validator-vbq6v                              0/1     Init:0/4   0          41s

kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE

 kubectl get ds -n gpu-operator                                                       
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       7d
gpu-operator-node-feature-discovery-worker   1         1         1       1            1           instance-type=gpu                                                      6d20h
nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                           7d
nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               7d
nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                               7d
nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   7d
nvidia-driver-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                                      7d
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 7d
nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                          7d

If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME

kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
.... 
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               2m11s                 default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
  Warning  FailedCreatePodSandBox  12s (x10 over 2m11s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

kubectl exec nvidia-driver-daemonset-v4n96 -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
Mon Jun  3 10:01:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
elezar commented 3 weeks ago

@yanis-incepto the nvidia-container-toolkit-daemonset-vb6qn is stuck in the init state and has not yet configured the nvidia runtime in containerd. Could you provide the logs for the containers in this daemonset?

yanis-incepto commented 3 weeks ago

nvidia-container-toolkit is finally running after some timebut still the error with the others (and it never gets away, i tried letting everyuthing during a few hours) :

kubectl get pods -n gpu-operator                                                     
NAME                                                         READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-t4bv8                                  0/1     Init:0/1   0          10m
gpu-operator-d97f85598-j7qt4                                 1/1     Running    0          7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8        1/1     Running    0          6d21h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj   1/1     Running    0          6d21h
gpu-operator-node-feature-discovery-worker-fcwh7             1/1     Running    0          10m
nvidia-container-toolkit-daemonset-gn495                     1/1     Running    0          10m
nvidia-dcgm-exporter-wnhss                                   0/1     Init:0/1   0          10m
nvidia-device-plugin-daemonset-dwwqr                         0/1     Init:0/1   0          10m
nvidia-driver-daemonset-p47wp                                1/1     Running    0          10m
nvidia-operator-validator-zk4mv                              0/1     Init:0/4   0          10m

For his logs : it looks likle he's waiting for a signal :

kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-gn495                
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
time="2024-06-03T10:46:29Z" level=info msg="Parsing arguments"
time="2024-06-03T10:46:29Z" level=info msg="Starting nvidia-toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Verifying Flags"
time="2024-06-03T10:46:29Z" level=info msg=Initializing
time="2024-06-03T10:46:29Z" level=info msg="Installing toolkit"
time="2024-06-03T10:46:29Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2024-06-03T10:46:29Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2024-06-03T10:46:29Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory"
time="2024-06-03T10:46:29Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'"
time="2024-06-03T10:46:29Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.15.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.15.0'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2024-06-03T10:46:29Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit"
time="2024-06-03T10:46:29Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2024-06-03T10:46:29Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'"
time="2024-06-03T10:46:29Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2024-06-03T10:46:29Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc", "crun"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.cdi]
      annotation-prefixes = ["cdi.k8s.io/"]
      default-kind = "management.nvidia.com/gpu"
      spec-dirs = ["/etc/cdi", "/var/run/cdi"]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
  path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
  skip-mode-detection = true

[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2024-06-03T10:46:29Z" level=info msg="Setting up runtime"
time="2024-06-03T10:46:29Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2024-06-03T10:46:29Z" level=info msg="Successfully parsed arguments"
time="2024-06-03T10:46:29Z" level=info msg="Starting 'setup' for containerd"
time="2024-06-03T10:46:29Z" level=info msg="Config file does not exist; using empty config"
time="2024-06-03T10:46:29Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2024-06-03T10:46:29Z" level=info msg="Sending SIGHUP signal to containerd"
time="2024-06-03T10:46:29Z" level=info msg="Successfully signaled containerd"
time="2024-06-03T10:46:29Z" level=info msg="Completed 'setup' for containerd"
time="2024-06-03T10:46:29Z" level=info msg="Waiting for signal"
elezar commented 3 weeks ago

Please restart the nvidia-operator-validator-zk4mv pod to start with. If this proceeds, then restart the other pods too.

yanis-incepto commented 3 weeks ago

just recreated the pod, still same issue

yanis-incepto commented 3 weeks ago

is there a compatibility table for gpu-operator ? Maybe latest version is not compatible with kubernetes 1.24.14 ?

Li357 commented 3 weeks ago

I had this issue when my /etc/containerd/config.toml was incorrect (was missing runc from default). This is what it looks like now on each node:

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = "/usr/bin/runc"
yanis-incepto commented 2 weeks ago

I had this issue when my /etc/containerd/config.toml was incorrect (was missing runc from default). This is what it looks like now on each node:

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = "/usr/bin/runc"

Hello, Thanks for your help, but unfortunately, i just tried but it didn't work