NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.25k stars 238 forks source link

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured #463

Open captainsk7 opened 1 year ago

captainsk7 commented 1 year ago

1. Issue or feature description

I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu I'm getting the same error Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured .

2. Steps to reproduce the issue

I have followed this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu

Download k0s binary

curl -L "https://github.com/k0sproject/k0s/releases/download/v1.24.4%2Bk0s.0/k0s-v1.24.4+k0s.0-amd64" -o /tmp/k0s
chmod +x /tmp/k0s

Download k0sctl binary

curl -L "https://github.com/k0sproject/k0sctl/releases/download/v0.13.2/k0sctl-linux-x64" -o /usr/local/bin/k0sctl
chmod +x /usr/local/bin/k0sctl

Then you need to create a k0sctl.yaml config file: For a multi-node Kubernetes cluster

k0sctl.yaml file

apiVersion: k0sctl.k0sproject.io/v1beta1
kind:  Cluster
metadata:
  name: my-cluster
spec:
  hosts:
    - role: controller
      localhost:
        enabled: true
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
    - role: worker
      ssh:
        address: 43.88.62.134
        user: user
        keyPath: .ssh/id_rsa
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
    - role: worker
      ssh:
        address: 43.88.62.133
        user: user
        keyPath: .ssh/id_rsa
      files:
      - name: containerd-config
        src: /tmp/containerd.toml
        dstDir: /etc/k0s/
        perm: "0755"
        dirPerm: null
  k0s:
    version: 1.24.4+k0s.0
    config:
      spec:
        network:
          provider: calico

/tmp/k0s/containerd.toml file

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

Then run the command: k0sctl apply --config /path/to/k0sctl.yaml

Deploy NVIDIA GPU Operator

values.yaml file

operator:
  defaultRuntime: containerd

toolkit:
  version: v1.10.0-ubuntu20.04
  env:
    - name: CONTAINERD_CONFIG
      value: /etc/k0s/containerd.toml
    - name: CONTAINERD_SOCKET
      value: /run/k0s/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

driver:
  manager:
    image: k8s-driver-manager
    repository: nvcr.io/nvidia/cloud-native
    version: v0.4.0
    imagePullPolicy: IfNotPresent
    env:
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
  repoConfig:
    configMapName: repo-config
  version: "495.29.05"

validator:
version: "v1.11.0"

Install Helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Now, add the NVIDIA Helm repository:

helm repo add nvidia https://nvidia.github.io/gpu-operator \
   && helm repo update
helm install --wait --generate-name \
     nvidia/gpu-operator
helm upgrade     --install     --namespace=gpu-operator     --create-namespace     --wait     --values=values.yaml    gpu-operator     nvidia/gpu-operator

1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ time="2023-01-10T11:51:53Z" level=info msg="Successfully loaded config" time="2023-01-10T11:51:53Z" level=info msg="Config version: 2" time="2023-01-10T11:51:53Z" level=info msg="Updating config" time="2023-01-10T11:51:53Z" level=info msg="Successfully updated config" time="2023-01-10T11:51:53Z" level=info msg="Flushing config" time="2023-01-10T11:51:53Z" level=info msg="Successfully flushed config" time="2023-01-10T11:51:53Z" level=info msg="Sending SIGHUP signal to containerd" time="2023-01-10T11:51:53Z" level=info msg="Successfully signaled containerd" time="2023-01-10T11:51:53Z" level=info msg="Completed 'setup' for containerd" time="2023-01-10T11:51:53Z" level=info msg="Waiting for signal"

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1601 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1601 G /usr/lib/xorg/Xorg 9MiB | | 1 N/A N/A 1736 G /usr/bin/gnome-shell 8MiB | +-----------------------------------------------------------------------------+

nekwar commented 3 weeks ago

Hi @captainsk7! Have you managed to solve this issue?