NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.88k stars 305 forks source link

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure #549

Open BartoszZawadzki opened 1 year ago

BartoszZawadzki commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

I'm deploying gpu-operator from Helm chart using ArgoCD in my Kubernetes cluster (1.23.17), which is built using kops on AWS infrastructure (not EKS).

Now I've been struggling with this for a while now, I've had used both docker and containerd in my Kubernetes cluster as a container runtime engine. I'm currently running containerd v1.6.21

After deploying the gpu-operator this is what is happening in the gpu-operator namespace:

NAME                                                         READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-jtgll                                  0/1     Init:0/1   0          11m
gpu-feature-discovery-m82hx                                  0/1     Init:0/1   0          11m
gpu-feature-discovery-rzkzj                                  0/1     Init:0/1   0          11m
gpu-operator-6489b6d9-d5smv                                  1/1     Running    0          11m
gpu-operator-node-feature-discovery-master-86dd7c646-6jvns   1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-5r7g6             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-5v7bn             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-6lzkk             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-7z6zw             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-8t9hk             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-b7k2t             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-fz7f2             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-hdp28             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-j9f45             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-rqx4l             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-svk5h             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-v6rx9             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-wd7h7             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-wqsp5             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-xf7m6             1/1     Running    0          11m
nvidia-container-toolkit-daemonset-26djz                     1/1     Running    0          11m
nvidia-container-toolkit-daemonset-72mvg                     1/1     Running    0          11m
nvidia-container-toolkit-daemonset-trk6f                     1/1     Running    0          11m
nvidia-dcgm-exporter-bpvks                                   0/1     Init:0/1   0          11m
nvidia-dcgm-exporter-cchvm                                   0/1     Init:0/1   0          11m
nvidia-dcgm-exporter-fd98x                                   0/1     Init:0/1   0          11m
nvidia-device-plugin-daemonset-fwwgr                         0/1     Init:0/1   0          11m
nvidia-device-plugin-daemonset-kblb6                         0/1     Init:0/1   0          11m
nvidia-device-plugin-daemonset-zlgdm                         0/1     Init:0/1   0          11m
nvidia-driver-daemonset-mg5g8                                1/1     Running    0          11m
nvidia-driver-daemonset-tschz                                1/1     Running    0          11m
nvidia-driver-daemonset-x285r                                1/1     Running    0          11m
nvidia-operator-validator-qjgsb                              0/1     Init:0/4   0          11m
nvidia-operator-validator-trlfn                              0/1     Init:0/4   0          11m
nvidia-operator-validator-vtkdz                              0/1     Init:0/4   0          11m

Getting into more details on the pods that are stuck in the init state: kubectl -n gpu-operator describe po gpu-feature-discovery-jtgll

Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               23m                   default-scheduler  Successfully assigned gpu-operator/gpu-feature-discovery-jtgll to ip-172-20-99-192.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  3m46s (x93 over 23m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

kubectl -n gpu-operator describe po nvidia-dcgm-exporter-bpvks

Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               24m                   default-scheduler  Successfully assigned gpu-operator/nvidia-dcgm-exporter-bpvks to ip-172-20-45-35.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  22m                   kubelet            Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
  Warning  FailedCreatePodSandBox  4m43s (x93 over 24m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

kubectl -n gpu-operator describe po nvidia-device-plugin-daemonset-fwwgr

Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               25m                  default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-fwwgr to ip-172-20-99-192.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  23m                  kubelet            Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
  Warning  FailedCreatePodSandBox  31s (x117 over 25m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

kubectl -n gpu-operator describe po nvidia-operator-validator-qjgsb

Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               26m                  default-scheduler  Successfully assigned gpu-operator/nvidia-operator-validator-qjgsb to ip-172-20-99-192.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  80s (x117 over 26m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

And finally my ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  annotations:
    helm.sh/resource-policy: keep
  creationTimestamp: "2023-07-12T14:42:17Z"
  generation: 1
  labels:
    app.kubernetes.io/component: gpu-operator
    app.kubernetes.io/instance: gpu-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: gpu-operator
    app.kubernetes.io/version: v23.3.2
    argocd.argoproj.io/instance: gpu-operator
    helm.sh/chart: gpu-operator-v23.3.2
  name: cluster-policy
  resourceVersion: "223035606"
  uid: 961e3b87-a5ff-47d9-944d-f9cca9e72fa9
spec:
  cdi:
    default: false
    enabled: false
  daemonsets:
    labels:
      app.kubernetes.io/managed-by: gpu-operator
      helm.sh/chart: gpu-operator-v23.3.2
    priorityClassName: system-node-critical
    rollingUpdate:
      maxUnavailable: "1"
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists
    updateStrategy: RollingUpdate
  dcgm:
    enabled: false
    hostPort: 5555
    image: dcgm
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: 3.1.7-1-ubuntu20.04
  dcgmExporter:
    enabled: true
    env:
    - name: DCGM_EXPORTER_LISTEN
      value: :9400
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: /etc/dcgm-exporter/dcp-metrics-included.csv
    image: dcgm-exporter
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/k8s
    serviceMonitor:
      additionalLabels: {}
      enabled: false
      honorLabels: false
      interval: 15s
    version: 3.1.7-3.1.4-ubuntu20.04
  devicePlugin:
    enabled: true
    env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY
      value: envvar
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
    image: k8s-device-plugin
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: v0.14.0-ubi8
  driver:
    certConfig:
      name: ""
    enabled: true
    image: driver
    imagePullPolicy: IfNotPresent
    kernelModuleConfig:
      name: ""
    licensingConfig:
      configMapName: ""
      nlsEnabled: false
    manager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      - name: DRAIN_USE_FORCE
        value: "false"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: 0s
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "false"
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.1
    rdma:
      enabled: false
      useHostMofed: false
    repoConfig:
      configMapName: ""
    repository: nvcr.io/nvidia
    startupProbe:
      failureThreshold: 120
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 60
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    usePrecompiled: false
    version: 525.105.17
    virtualTopology:
      config: ""
  gfd:
    enabled: true
    env:
    - name: GFD_SLEEP_INTERVAL
      value: 60s
    - name: GFD_FAIL_ON_INIT_ERROR
      value: "true"
    image: gpu-feature-discovery
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: v0.8.0-ubi8
  mig:
    strategy: single
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
    env:
    - name: WITH_REBOOT
      value: "false"
    gpuClientsConfig:
      name: ""
    image: k8s-mig-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.5.2-ubuntu20.04
  nodeStatusExporter:
    enabled: false
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v23.3.2
  operator:
    defaultRuntime: containerd
    initContainer:
      image: cuda
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia
      version: 12.1.1-base-ubi8
    runtimeClass: nvidia
  psp:
    enabled: false
  sandboxDevicePlugin:
    enabled: true
    image: kubevirt-gpu-device-plugin
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: v1.2.1
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  toolkit:
    enabled: true
    image: container-toolkit
    imagePullPolicy: IfNotPresent
    installDir: /usr/local/nvidia
    repository: nvcr.io/nvidia/k8s
    version: v1.13.0-ubuntu20.04
  validator:
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "true"
    repository: nvcr.io/nvidia/cloud-native
    version: v23.3.2
  vfioManager:
    driverManager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "false"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.1
    enabled: true
    image: cuda
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: 12.1.1-base-ubi8
  vgpuDeviceManager:
    config:
      default: default
      name: ""
    enabled: true
    image: vgpu-device-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.2.1
  vgpuManager:
    driverManager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "false"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.1
    enabled: false
    image: vgpu-manager
    imagePullPolicy: IfNotPresent
status:
  namespace: gpu-operator
  state: notReady

2. Steps to reproduce the issue

Deploy gpu-operator using Helm chart (23.3.2)

3. Information to attach (optional if deemed irrelevant)

BartoszZawadzki commented 1 year ago

Additional info: apart from changing container runtime interface from docker to containerd, I have also tried different gpu-operator settings (values), with both CDI enabled/disabled, RDMA enabled/disabled and other - to no avail.

acesir commented 1 year ago

did you ever figure this out @BartoszZawadzki? Dealing with same issue on EKS and ubuntu

BartoszZawadzki commented 1 year ago

No, but since I'm using kops I've tried using this - https://kops.sigs.k8s.io/gpu/ and it worked out-of-the-box

sunhailin-Leo commented 1 year ago

also meet this problem. How to solve it?

shivamerla commented 1 year ago

failed to get sandbox runtime: no runtime for "nvidia" this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs of nvidia-driver-daemonset and nvidia-container-toolkit pods to figure out the actual error.

sunhailin-Leo commented 1 year ago

@shivamerla

shivamerla commented 1 year ago

No, Rocky Linux is not supported currently.

BartoszZawadzki commented 1 year ago

failed to get sandbox runtime: no runtime for "nvidia" this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs of nvidia-driver-daemonset and nvidia-container-toolkit pods to figure out the actual error.

I have attached logs from all containers deployed via gpu-operator helm chart in the inital issue.

cwrau commented 1 year ago

We're running into the same problem, the pods gpu-feature-discovery, nvidia-operator-validator, nvidia-dcgm-exporter and nvidia-device-plugin-daemonset are all not starting because of Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

nivia-gpu-operator-node-feature-discovery-worker log

nvidia-driver log

nvidia-container-toolkit-daemonset log


EDIT: Our problem is this issue in containerd which makes it impossible to additively use the imports to configure containerd plugins. In our case we're configuring registry mirrors which in turn completely overrides nvidias runtime configuration. We're probably going to have to go the same route as nvidia, meaning we'd have to somehow parse the config.toml, add our config and write it back.

waterfeeds commented 11 months ago

Hi bro, I once encountered the same error. I'll give you my example for your reference. A week ago, I installed the nvidia driver, toolkits and device-plugin manually for test gpu running. I run containerd as runtime for kubelet, on ubuntu 22.04, then it works on cuda testing. A few days ago I tried gpu-operator installation, before that i uninstall nvidia driver, toolkits and device-plugin, and reverted the /etc/containerd/config.toml config. I got the same error as you.I had read many old issues about this err, then I found a committer of gpu-operator recommended lsmod | grep nvidia command, so I found some nvidia driver using by ubuntu kernel, meaned that uninstall imcompletely, so i reboot my host, and lsmod | grep nvidia command get nothing. Glad to say, everything is ok, all the nvidia pod become running. Hope useful to you !

ordinaryparksee commented 5 months ago

May this problem is failed of symlink creation. I dont think it's a good way but you can avoid this issue by disabling symlink creation.

First you have to check your problem is from this situation. kubectl logs -f nvidia-container-toolkit-daemonset-j8wcf -n gpu-operator-resources -c driver-validation

If right you can see error message like this below time="2024-06-19T07:21:42Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""

Now just follow that message.

summarize

  1. open clusterpolicy with this command kubectl edit clusterpolicies.nvidia.com
  2. find validator: and add driver: part.

result is

  validator:
    driver:
      env:
      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "false"
    repository: nvcr.io/nvidia/cloud-native
    version: v23.9.1
choucavalier commented 5 months ago

Hey guys I have the exact same error as mentioned by @ordinaryparksee

What's going on with these symlinks? I don't understand :/