NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.75k stars 282 forks source link

no runtime for "nvidia" is configured #662

Open joshuacox opened 7 months ago

joshuacox commented 7 months ago

1. Issue or feature description

When following the quickstart I end up with this error in k describe po -n gpu-operator gpu-feature-discovery-6tk4h

Warning FailedCreatePodSandBox 0s (x5 over 49s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

2. Steps to reproduce the issue

#!/bin/bash
kind delete cluster --name bionic-gpt-cluster
kind create cluster --name bionic-gpt-cluster --config=kind-config.yaml
kind export kubeconfig --name bionic-gpt-cluster
# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false

3. Information to attach (optional if deemed irrelevant)

with my kind-config.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  # If we don't do this, then we can't connect on linux
  apiServerAddress: "0.0.0.0"
kubeadmConfigPatchesJSON6902:
- group: kubeadm.k8s.io
  version: v1beta3
  kind: ClusterConfiguration
  patch: |
    - op: add
      path: /apiServer/certSANs/-
      value: host.docker.internal
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/etc/containerd/certs.d"

Common error checking:

and docker run --rm nvidia/cuda:12.3.1-devel-centos7 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Sun Jan 21 20:24:49 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   41C    P8     8W / 220W |    100MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

and /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

and /etc/containerd/config.toml

disabled_plugins = ["cri"]
version = 1

[plugins]

  [plugins.cri]

    [plugins.cri.containerd]
      default_runtime_name = "nvidia"

      [plugins.cri.containerd.runtimes]

        [plugins.cri.containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            Runtime = "/usr/bin/nvidia-container-runtime"
I0121 20:28:50.870066       1 main.go:154] Starting FS watcher.
I0121 20:28:50.870195       1 main.go:161] Starting OS watcher.
I0121 20:28:50.870674       1 main.go:176] Starting Plugins.
I0121 20:28:50.870703       1 main.go:234] Loading configuration.
I0121 20:28:50.870918       1 main.go:242] Updating config with default resource matching patterns.
I0121 20:28:50.871290       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0121 20:28:50.871307       1 main.go:256] Retreiving plugins.
W0121 20:28:50.871782       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0121 20:28:50.871846       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0121 20:28:50.871896       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0121 20:28:50.871903       1 factory.go:115] Incompatible platform detected
E0121 20:28:50.871909       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0121 20:28:50.871914       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0121 20:28:50.871920       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0121 20:28:50.871925       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0121 20:28:50.871934       1 main.go:287] No devices found. Waiting indefinitely.
sudo journalctl -r -u kubelet
-- No entries --

Additional information that might help better understand your environment and reproduce the bug:

docker version
Client: Docker Engine - Community
 Version:           25.0.0
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        e758fe5
 Built:             Thu Jan 18 17:09:59 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.0
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       615dfdf
  Built:            Thu Jan 18 17:09:59 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.27
  GitCommit:        a1496014c916f9e62104b33d1bb5bd03b0858e59
 nvidia:
  Version:          1.1.11
  GitCommit:        v1.1.11-0-g4bccb38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml

and the helm below fails as well:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false

uname -a Linux saruman 6.1.0-17-amd64 NVIDIA/k8s-device-plugin#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

none that I see?

sudo dmesg |grep -i nvidia
[    2.829492] nvidia: loading out-of-tree module taints kernel.
[    2.829501] nvidia: module license 'NVIDIA' taints kernel.
[    2.846803] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    2.961803] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[    2.962598] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    3.011519] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
[    3.017901] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input8
[    3.139762] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
[    3.246519] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    3.246521] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[    3.288796] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input9
[    3.288989] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input10
[    3.328821] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input11
[    4.018783] audit: type=1400 audit(1705866938.070:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=774 comm="apparmor_parser"
[    4.019493] audit: type=1400 audit(1705866938.070:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=774 comm="apparmor_parser"
[ 1754.666104] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 1754.677753] nvidia-uvm: Loaded the UVM driver, major device number 237.
dpkg -l |grep -i nvidia 
ii  firmware-nvidia-gsp                     525.147.05-4~deb12u1                    amd64        NVIDIA GSP firmware
ii  glx-alternative-nvidia                  1.2.2                                   amd64        allows the selection of NVIDIA as GLX provider
ii  libcuda1:amd64                          525.147.05-4~deb12u1                    amd64        NVIDIA CUDA Driver Library
ii  libegl-nvidia0:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA binary EGL library
ii  libgl1-nvidia-glvnd-glx:amd64           525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL/GLX library (GLVND variant)
ii  libgles-nvidia1:amd64                   525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia2:amd64                   525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL|ES 2.x library
ii  libglx-nvidia0:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA binary GLX library
ii  libnvcuvid1:amd64                       525.147.05-4~deb12u1                    amd64        NVIDIA CUDA Video Decoder runtime library
ii  libnvidia-allocator1:amd64              525.147.05-4~deb12u1                    amd64        NVIDIA allocator runtime library
ii  libnvidia-cfg1:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-container-tools               1.14.3-1                                amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64              1.14.3-1                                amd64        NVIDIA container runtime library
ii  libnvidia-egl-gbm1:amd64                1.1.0-2                                 amd64        GBM EGL external platform library for NVIDIA
ii  libnvidia-egl-wayland1:amd64            1:1.1.10-1                              amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-eglcore:amd64                 525.147.05-4~deb12u1                    amd64        NVIDIA binary EGL core libraries
ii  libnvidia-encode1:amd64                 525.147.05-4~deb12u1                    amd64        NVENC Video Encoding runtime library
ii  libnvidia-glcore:amd64                  525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL/GLX core libraries
ii  libnvidia-glvkspirv:amd64               525.147.05-4~deb12u1                    amd64        NVIDIA binary Vulkan Spir-V compiler library
ii  libnvidia-ml1:amd64                     525.147.05-4~deb12u1                    amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-ptxjitcompiler1:amd64         525.147.05-4~deb12u1                    amd64        NVIDIA PTX JIT Compiler library
ii  libnvidia-rtcore:amd64                  525.147.05-4~deb12u1                    amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
ii  nvidia-alternative                      525.147.05-4~deb12u1                    amd64        allows the selection of NVIDIA as GLX provider
ii  nvidia-container-toolkit                1.14.3-1                                amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base           1.14.3-1                                amd64        NVIDIA Container Toolkit Base
ii  nvidia-driver                           525.147.05-4~deb12u1                    amd64        NVIDIA metapackage
ii  nvidia-driver-bin                       525.147.05-4~deb12u1                    amd64        NVIDIA driver support binaries
ii  nvidia-driver-libs:amd64                525.147.05-4~deb12u1                    amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii  nvidia-egl-common                       525.147.05-4~deb12u1                    amd64        NVIDIA binary EGL driver - common files
ii  nvidia-egl-icd:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA EGL installable client driver (ICD)
ii  nvidia-installer-cleanup                20220217+3~deb12u1                      amd64        cleanup after driver installation with the nvidia-installer
ii  nvidia-kernel-common                    20220217+3~deb12u1                      amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                      525.147.05-4~deb12u1                    amd64        NVIDIA binary kernel module DKMS source
ii  nvidia-kernel-support                   525.147.05-4~deb12u1                    amd64        NVIDIA binary kernel module support files
ii  nvidia-legacy-check                     525.147.05-4~deb12u1                    amd64        check for NVIDIA GPUs requiring a legacy driver
ii  nvidia-modprobe                         535.54.03-1~deb12u1                     amd64        utility to load NVIDIA kernel modules and create device nodes
ii  nvidia-persistenced                     525.85.05-1                             amd64        daemon to maintain persistent software state in the NVIDIA driver
ii  nvidia-settings                         525.125.06-1~deb12u1                    amd64        tool for configuring the NVIDIA graphics driver
ii  nvidia-smi                              525.147.05-4~deb12u1                    amd64        NVIDIA System Management Interface
ii  nvidia-support                          20220217+3~deb12u1                      amd64        NVIDIA binary graphics driver support files
ii  nvidia-vdpau-driver:amd64               525.147.05-4~deb12u1                    amd64        Video Decode and Presentation API for Unix - NVIDIA driver
ii  nvidia-vulkan-common                    525.147.05-4~deb12u1                    amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64                 525.147.05-4~deb12u1                    amd64        NVIDIA Vulkan installable client driver (ICD)
ii  xserver-xorg-video-nvidia               525.147.05-4~deb12u1                    amd64        NVIDIA binary Xorg driver

nvidia-container-cli -V cli-version: 1.14.3 lib-version: 1.14.3 build date: 2023-10-19T11:32+00:00 build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

the above page no longer exists.

sudo journalctl -u nvidia-container-toolkit -- No entries --

joshuacox commented 7 months ago

Of note, I have also tried without KinD and instead using k0s with the exact same result.

elezar commented 7 months ago

Could you confirm that you're able to run nvidia-smi in the Kind worker node?

joshuacox commented 7 months ago

I can confirm that it does not run inside kind:

on the bare metal:

nvidia-smi
Tue Jan 23 17:10:33 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   50C    P8    12W / 220W |    260MiB /  8192MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3883      G   /usr/lib/xorg/Xorg                146MiB |
|    0   N/A  N/A      4041      G   /usr/bin/gnome-shell               67MiB |
|    0   N/A  N/A      6091      G   /usr/bin/nautilus                  16MiB |
|    0   N/A  N/A     78264      G   ...b/firefox-esr/firefox-esr       10MiB |
|    0   N/A  N/A    702357      G   vlc                                 6MiB |
+-----------------------------------------------------------------------------+

from a container inside of k0s:

k logs nv-5dc699dbc6-xwhwt

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found

and from inside kind:

k logs nv-5df8456f86-9gkwf

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found

with this as my deployment:

cat nv-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    kompose.cmd: ./kompose convert -f docker-compose.yml
    kompose.version: 1.22.0 (955b78124)
  labels:
    io.kompose.service: nv
  name: nv
spec:
  replicas: 1
  selector:
    matchLabels:
      io.kompose.service: nv
  template:
    metadata:
      annotations:
        kompose.cmd: ./kompose convert -f docker-compose.yml
        kompose.version: 1.22.0 (955b78124)
      labels:
        io.kompose.network/noworky-default: "true"
        io.kompose.service: nv
    spec:
      containers:
        - args:
            - nvidia-smi
          image: nvidia/cuda:12.3.1-devel-centos7
          name: nv
      restartPolicy: Always
klueska commented 7 months ago

What are you doing to inject GPU support into the docker container that kind starts to represent the k8s node?

Something like this is necessary: https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275

Example: https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/clusters/kind/scripts/kind-cluster-config.yaml#L52

joshuacox commented 7 months ago

Using the example config you supplied I get the same results:

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: nvidia-smi: not found

I forgot to include that config file:

/etc/nvidia-container-runtime/config.toml

accept-nvidia-visible-devices-as-volume-mounts = true
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"
joshuacox commented 7 months ago

I even gave that create-cluster.sh script a try:

+++ local 'value=VERSION  ?= v0.1.0'
+++ echo v0.1.0
++ DRIVER_IMAGE_VERSION=v0.1.0
++ : k8s-dra-driver
++ : ubuntu20.04
++ : v0.1.0
++ : nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0
++ : v1.27.1
++ : k8s-dra-driver-cluster
++ : /home/thoth/k8s-dra-driver/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : v20230515-01914134-containerd_v1.7.1
++ : gcr.io/k8s-staging-kind/base:v20230515-01914134-containerd_v1.7.1
++ : kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1
+ kind create cluster --retain --name k8s-dra-driver-cluster --image kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1 --config /home/thoth/k8s-dra-driver/demo/clusters/kind/scripts/kind-cluster-config.yaml
Creating cluster "k8s-dra-driver-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1) 🖼
 ✓ Preparing nodes 📦 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining worker nodes 🚜 
Set kubectl context to "kind-k8s-dra-driver-cluster"
You can now use your cluster with:

kubectl cluster-info --context kind-k8s-dra-driver-cluster

Thanks for using kind! 😊
+ docker exec -it k8s-dra-driver-cluster-worker umount -R /proc/driver/nvidia
++ docker images --filter reference=nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0 -q
+ EXISTING_IMAGE_ID=
+ '[' '' '!=' '' ']'
+ set +x
Cluster creation complete: k8s-dra-driver-cluster

Same results though.

joshuacox commented 7 months ago

appears to be the same issue here https://github.com/NVIDIA/k8s-device-plugin/issues/478

klueska commented 7 months ago

Backing up … what about running with GPUs under docker in general (I.e. without kind).

docker run -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04 nvidia-smi

If things are not configured properly to have that work, then kind will not work either.

klueska commented 7 months ago

To be clear, that will work so long as accept-nvidia-visible-devices-as-volume-mounts = false

Once that is configured to true you would need to run:

docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi

joshuacox commented 7 months ago

Both seem to work:

docker run -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04 nvidia-smi                                                                                        24-01-23 - 22:08:54
Wed Jan 24 04:09:07 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P8    12W / 220W |    156MiB /  8192MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base) 
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi                                                                   24-01-23 - 22:09:08
Wed Jan 24 04:09:15 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P8    12W / 220W |    156MiB /  8192MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base) 
grep accept-nvidia-visible-devices-as /etc/nvidia-container-runtime/config.toml                                                    
accept-nvidia-visible-devices-as-volume-mounts = true
klueska commented 7 months ago

OK. That’s encouraging.

So you’re saying that even with that configured properly if you run the cluster-create.sh script from the k8s-dra-driver repo, docker exec into the worker node created by kind, and run nvidia-smi, it doesn’t work?

joshuacox commented 7 months ago

well at the moment ./create-cluster.sh ends with this error:

+ kind load docker-image --name k8s-dra-driver-cluster nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0
Image: "nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0" with ID "sha256:9c74ea73db6f97a5e7287e11888757504b1e5ecfde4d2e5aa8396a25749ae046" not yet present on node "k8s-dra-driver-cluster-control-plane", loading...
Image: "nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0" with ID "sha256:9c74ea73db6f97a5e7287e11888757504b1e5ecfde4d2e5aa8396a25749ae046" not yet present on node "k8s-dra-driver-cluster-worker", loading...
ERROR: failed to load image: command "docker exec --privileged -i k8s-dra-driver-cluster-control-plane ctr --namespace=k8s.io images import --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: unpacking nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0 (sha256:e9df1b5622ca4f042dcff02f580a0a18ecad4b740fe639df2349c55067ef35b7)...time="2024-01-24T04:21:59Z" level=info msg="apply failure, attempting cleanup" error="wrong diff id calculated on extraction \"sha256:f344b08ff6c5121d786112e0f588c627da349e4289e409d1fde1b3ad8845fa66\"" key="extract-191866144-_8aF sha256:6c3e7df31590f02f10cb71fc4eb27653e9b428df2e6e5421a455b062bd2e39f9"
ctr: wrong diff id calculated on extraction "sha256:f344b08ff6c5121d786112e0f588c627da349e4289e409d1fde1b3ad8845fa66"

and ./install-dra-driver.sh now fails with:

+ kubectl label node k8s-dra-driver-cluster-control-plane --overwrite nvidia.com/dra.controller=true
node/k8s-dra-driver-cluster-control-plane labeled
+ helm upgrade -i --create-namespace --namespace nvidia-dra-driver nvidia /home/thoth/k8s-dra-driver/deployments/helm/k8s-dra-driver --wait
Release "nvidia" does not exist. Installing it now.
Error: client rate limiter Wait returned an error: context deadline exceeded

the build is successful from: ./build-dra-driver.sh

so I'm kind of confused at what is wrong.

I tried doing an equivalent ctr run with:

sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi

but it is just hanging here with no output.

joshuacox commented 7 months ago

I figured out the equivalent ctr command ( I had nvidiacontainer missing above):

sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidiacontainer nvidia-smi
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/usr/bin/nvidia-smi": stat /usr/bin/nvidia-smi: no such file or directory: unknown

in comparison to the docker:

docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi
Wed Jan 24 05:23:32 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   50C    P8    12W / 220W |    117MiB /  8192MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

which I'm kind of uncertain why that file exists here, but not in the ctr form?

docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 which nvidia-smi
/usr/bin/nvidia-smi
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi500  which nvidia-smi
sudo ctr run --env NVIDIA_VISIBLE_DEVICES=all docker.io/library/ubuntu:22.04 nvidia-smi501 ls /usr/bin/nvidia-smi
ls: cannot access '/usr/bin/nvidia-smi': No such file or directory

probably some magic I'm unaware of.

klueska commented 7 months ago

ctr does not use the nvidia-container-runtime even if you have configured the CRI plugin in the containerd config to use it. The ctr command does not use CRI so it would need to be configured elsewhere to use the nvidia runtime (but that wouldn’t help with your current problem anyway of trying to get k8s to work — which does communicate with containerd over CRI).

elezar commented 7 months ago

Since I don't have k0s experience, let's start out assuming that your goal is to install the GPU Operator in a Kind cluster with GPU support. This involves two stages:

  1. Starting a kind cluster with GPUs and the driver injected
  2. Installing the GPU Operator in this cluster.

I've tried to provide more details for each of the stages below. In order to get to the bottom of this issue we would need to identify which of these is not working as expected. Once we've run through the steps for kind it may be possible to map the steps to something like k0s.

Note that as prerequisites:

  1. the CUDA driver needs to be installed on the host. Since you're able to run nvidia-smi there, that seems to already be the case.
  2. The NVIDIA Container Toolkit needs to be installed on the host. The latest release (v1.14.4) is recommended.

Starting a kind cluster with GPUs and drivers injected.

This needs to be set up as described in https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275

This means that we need to do the following:

In order to verify that the nodes have the GPU devices and Driver installed correctly one can exec into the Kind worker node and run nvidia-smi:

docker exec -ti <node-cluster> nvidia-smi -L

This should give the same output as on the host. I noted in your example that you are starting a single node Kind cluster. This should not affect the behaviour, but is a difference between our cluster definitions and the ones that you use.

Installing the GPU Operator on the Kind cluster

At this point, the Kind cluster represents a k8s cluster with ony the GPU Driver installed. Even though the NVIDIA Container Toolkit is installed on the host, it has not been injected into the nodes.

This means that we should do one of the following:

For the Kind demo included in this repo, we don't use the GPU operator and as such we install the container toolkit when creating the cluster: https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L38-L47

Note that since the Kind nodes themselves are effectively Debian nodes and are not officially supported. Most of this might be due to driver cotainer limitations and may not be applicable in this case, since we are dealing with a preinstalled driver.

joshuacox commented 7 months ago

on the host:

 nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.14.4
commit: d167812ce3a55ec04ae2582eff1654ec812f42e1

cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

 cat /etc/nvidia-container-runtime/config.toml 
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

docker exec -it 3251f /bin/bash                                                                                                                                                                                        ✭
root@k8s-dra-driver-cluster-worker:/# nvidia-smi
Wed Jan 24 15:10:56 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   47C    P8    12W / 220W |    169MiB /  8192MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

root@k8s-dra-driver-cluster-worker:/# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3070 (UUID: GPU-b83f1b66-74d7-a38e-932e-ef815cb45105)

However I seem to be stuck on the install inside the worker:

root@k8s-dra-driver-cluster-worker:/# apt-get install -y nvidia-container-toolkit     
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
nvidia-container-toolkit is already the newest version (1.15.0~rc.1-1).
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
 nvidia-container-toolkit : Depends: nvidia-container-toolkit-base (= 1.15.0~rc.1-1) but it is not going to be installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
root@k8s-dra-driver-cluster-worker:/# apt-get install -y nvidia-container-toolkit-base
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  nvidia-container-toolkit-base
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
3 not fully installed or removed.
Need to get 2361 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Get:1 https://nvidia.github.io/libnvidia-container/experimental/deb/amd64  nvidia-container-toolkit-base 1.15.0~rc.1-1 [2361 kB]
Fetched 2361 kB in 0s (10.6 MB/s)                  
debconf: delaying package configuration, since apt-utils is not installed
(Reading database ... 11315 files and directories currently installed.)
Preparing to unpack .../nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit-base (1.15.0~rc.1-1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
 unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Errors were encountered while processing:
 /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
root@k8s-dra-driver-cluster-worker:/# apt --fix-broken install
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Correcting dependencies... Done
The following additional packages will be installed:
  nvidia-container-toolkit-base
The following NEW packages will be installed:
  nvidia-container-toolkit-base
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
3 not fully installed or removed.
Need to get 2361 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 https://nvidia.github.io/libnvidia-container/experimental/deb/amd64  nvidia-container-toolkit-base 1.15.0~rc.1-1 [2361 kB]
Fetched 2361 kB in 0s (11.9 MB/s)                  
debconf: delaying package configuration, since apt-utils is not installed
(Reading database ... 11315 files and directories currently installed.)
Preparing to unpack .../nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit-base (1.15.0~rc.1-1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
 unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link
Errors were encountered while processing:
 /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
joshuacox commented 7 months ago

of note, I am using the kind cluster config from this repo:

https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/clusters/kind/scripts/kind-cluster-config.yaml#L52

so no longer single-node.

elezar commented 7 months ago

For "reasons" we were injecting the /usr/bin/nvidia-ctk binary from the host into the container for the k8s-dra-driver. This is what is causing:

dpkg: error processing archive /var/cache/apt/archives/nvidia-container-toolkit-base_1.15.0~rc.1-1_amd64.deb (--unpack):
 unable to make backup link of './usr/bin/nvidia-ctk' before installing new version: Invalid cross-device link

Remove the lines here in the kind cluster config. (Or unmount /usr/bin/nvidia-ctk before trying to install the toolkit).

I have an open action item to improve the installation of the toolkit in the DRA driver repo, but have not gotten around to it.

joshuacox commented 7 months ago

so unmounting /usr/bin/nvidia-ctk fixed the apt issues, and I can install nvidia-container-toolkit just fine, but that doesn't solve the problem, the nvidia-device-plugin-daemonset still seems unable to see the GPU

k logs -n kube-system nvidia-device-plugin-daemonset-d82pg
I0125 03:54:44.043725       1 main.go:154] Starting FS watcher.
I0125 03:54:44.043771       1 main.go:161] Starting OS watcher.
I0125 03:54:44.043840       1 main.go:176] Starting Plugins.
I0125 03:54:44.043849       1 main.go:234] Loading configuration.
I0125 03:54:44.043895       1 main.go:242] Updating config with default resource matching patterns.
I0125 03:54:44.043975       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0125 03:54:44.043979       1 main.go:256] Retreiving plugins.
W0125 03:54:44.044136       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0125 03:54:44.044156       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0125 03:54:44.044172       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0125 03:54:44.044174       1 factory.go:115] Incompatible platform detected
E0125 03:54:44.044176       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0125 03:54:44.044178       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0125 03:54:44.044179       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0125 03:54:44.044181       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0125 03:54:44.044185       1 main.go:287] No devices found. Waiting indefinitely.
elezar commented 7 months ago

@joshuacox is containerd in the Kind node configured to use the nvidia runtime. In addition, if you don't set it to be the default you will have to add a runtimeClass and specify this when installing the plugin.

See https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L50-L55 where we do this for the device plugin.

If you're installing the GPU Operator with --set toolkit.enabled=true this should be taken care of for you.

joshuacox commented 7 months ago

I am just fine with setting toolkit.enabled=true or any other flags, I just want it to work.

Seems to be getting closer, do I need to umount another symlink here?

k logs -ngpu-operator nvidia-operator-validator-j6hfp -c driver-validation        
time="2024-01-25T10:46:28Z" level=info msg="version: 8072420d"
time="2024-01-25T10:46:28Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Thu Jan 25 10:46:28 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P8    11W / 220W |    152MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
time="2024-01-25T10:46:28Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2024-01-25T10:46:28Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.1.0-17-amd64\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

that was from a ./create-cluster.sh (in /k8s-dra-driver/demo/clusters/kind)

with this afterwards:

#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster

docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd" 

helm install \
     --wait \
     --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=true \
      --set toolkit.enabled=true
elezar commented 7 months ago

This issue is probably due to the symlink creation not working under kind. Please update the environement for the validator in the ClusterPolicy to disable the creation of symlinks as described in the error message.

See also https://github.com/NVIDIA/gpu-operator/issues/567

joshuacox commented 7 months ago

Environment for the validator in ClusterPolicy?

I have a tiny section of the daemonset that has a clusterpolicy

k get daemonset -n gpu-operator nvidia-operator-validator -o yaml|grep -C10 -i clusterpolicy 
    manager: kube-controller-manager
    operation: Update
    subresource: status
    time: "2024-01-25T15:25:42Z"
  name: nvidia-operator-validator
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: nvidia.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterPolicy
    name: cluster-policy
    uid: 1c2e2c3d-b21e-4767-8dd7-18c1535552de
  resourceVersion: "23601"
  uid: 30f847a6-654e-4136-b362-f912eb344d4c
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nvidia-operator-validator
      app.kubernetes.io/part-of: gpu-operator

Which all of this seems way beyond the documentation. @elezar is this because as you said "Debian nodes and are not officially supported". If so what nodes are supported? On this page:

https://nvidia.github.io/libnvidia-container/stable/deb/

it says:

ubuntu18.04, ubuntu20.04, ubuntu22.04, debian10, debian11

so is this all because my host OS is debian 12?

klueska commented 7 months ago

It just means when you start the operator, additionally pass:

--set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION"
--set validator.driver.env[0].value="true"
joshuacox commented 7 months ago
Error: INSTALLATION FAILED: 1 error occurred:
        * ClusterPolicy.nvidia.com "cluster-policy" is invalid: spec.validator.driver.env[0].value: Invalid value: "boolean": spec.validator.driver.env[0].value in body must be of type string: "boolean"

I also tried removing the quotes around true to match my other set lines, and got the exact same results.

#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster

docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"

helm install \
     --wait \
     --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=true \
     --set toolkit.enabled=true \
     --set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION" \
     --set validator.driver.env[0].value=true

I am also not seeing a validator section in the values.yaml:

https://github.com/NVIDIA/k8s-device-plugin/blob/v0.14.3/deployments/helm/nvidia-device-plugin/values.yaml

Am I looking in the wrong place?

klueska commented 7 months ago

use --set-string

not all possible values are shown in the top-level values.yaml

joshuacox commented 7 months ago

omg @klueska that one works!

kgp -n gpu-operator                                                                                                                                                                      ✭
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-jkfwl                                       1/1     Running     0          3m59s
gpu-operator-1706209589-node-feature-discovery-gc-7ccd95f7qcvpg   1/1     Running     0          4m12s
gpu-operator-1706209589-node-feature-discovery-master-7cdfmh5zt   1/1     Running     0          4m12s
gpu-operator-1706209589-node-feature-discovery-worker-wcwsp       1/1     Running     0          4m12s
gpu-operator-1706209589-node-feature-discovery-worker-xdcxd       1/1     Running     0          4m12s
gpu-operator-c4fd7b4b7-rv28r                                      1/1     Running     0          4m12s
nvidia-container-toolkit-daemonset-n994z                          1/1     Running     0          3m59s
nvidia-cuda-validator-76zm5                                       0/1     Completed   0          3m42s
nvidia-dcgm-exporter-b6cs5                                        1/1     Running     0          3m59s
nvidia-device-plugin-daemonset-4mbb2                              1/1     Running     0          3m59s
nvidia-operator-validator-z26kp                                   1/1     Running     0          3m59s

and to be clear, for any of you stumbling in from the internet here is my complete additional steps, beyond ./create-cluster.sh:

#!/bin/bash
#
export KIND_CLUSTER_NAME=k8s-dra-driver-cluster

docker exec -it "${KIND_CLUSTER_NAME}-worker" bash -c "umount /usr/bin/nvidia-ctk && apt-get update && apt-get install -y gpg && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && apt-get update && apt-get install -y nvidia-container-toolkit && nvidia-ctk config --set nvidia-container-runtime.modes.cdi.annotation-prefixes=nvidia.cdi.k8s.io/ && nvidia-ctk runtime configure --runtime=containerd --cdi.enabled && systemctl restart containerd"

helm install \
     --wait \
     --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=true \
     --set toolkit.enabled=true \
     --set validator.driver.env[0].name="DISABLE_DEV_CHAR_SYMLINK_CREATION" \
     --set-string validator.driver.env[0].value="true"

Now then why did I have to do all this extra work over and above the documentation? Is it just because I'm on debian 12 (I started on Arch linux before opening this issue I decided debian might be more stable). If this is the expected behavior I'll gladly make a PR documenting all this, but somehow I feel this is not the case? I am installing jammy22.04 to a partition to test some more.

klueska commented 7 months ago

You're probably the first to run the operator under kind.

joshuacox commented 7 months ago

Hmmm, now I am going to have to give this another shot using another method, as I said I've tried k0s above and will give that a second try now that I have a working sanity check. I am familiar bootstraping a cluster using kubeadm and kubespray both, I even scripted it all out with another project kubash.

Are there any other setups that anyone has tried? What is 'supported'?

klueska commented 7 months ago

I've transferred this issue to the gpu-operator repo (since that's what the issue was really related to). I'll let the operator devs answer your last question.

elezar commented 7 months ago

@joshuacox just for reference. The compatibility with Debian that is an issue here is not that of the NVIDIA Container Toolkit (or even the device plugin), but that of the GPU Operator. For the official support matrix see: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-operating-systems-and-kubernetes-platforms

Note that it is my understanding that this is largely due to the driver container, but there may be some subtle issues that arrise from not having qualified the stack on the target operating system.

For what it's worth, we are starting to look at using kind for basic internal tests, and as we address some rough edges for this these should make it into the released versions -- although the question of official platform support is not something that I can speak to at present.

joshuacox commented 7 months ago

@elezar and @klueska thank you guys for helping so much! And thanks for the transfer to this repo, this is probably where I should've submitted the issue in the first place.

@elezar how can I help in facilitating in building these internal tests? I am looking around this repo, I don't see a demo directory like we were dealing with above, is that the sort of thing we might want to build here? I'd certainly be interested in facilitating any of this process that I can.

elezar commented 7 months ago

@elezar how can I help in facilitating in building these internal tests? I am looking around this repo, I don't see a demo directory like we were dealing with above, is that the sort of thing we might want to build here? I'd certainly be interested in facilitating any of this process that I can.

Althought @shivamerla and @cdesiniotis should also chime in here, I think creating a PR adding a demo folder including a basic README.md that runs through getting the GPU Operator installed on kind -- mirroring what we have for the k8s-dra-driver and the k8s-device-plugin -- would be a good start.

cdesiniotis commented 7 months ago

I think creating a PR adding a demo folder including a basic README.md that runs through getting the GPU Operator installed on kind

This is fine by me. @joshuacox contributions are welcome!

@joshuacox there is one minor detail I would like to point out. In your helm install command, you explicitly set driver.enabled=true which is actually not necessary in this case. The kind node already has access to the driver installation from the host, so the GPU Operator does not need to install the driver. In fact, you won't see a pod named nvidia-driver in your pod list because the operator detected that the NVIDIA driver was already installed and disabled the containerized driver deployment for you.

elezar commented 7 months ago

@joshuacox there is one minor detail I would like to point out. In your helm install command, you explicitly set driver.enabled=true which is actually not necessary in this case. The kind node already has access to the driver installation from the host, so the GPU Operator does not need to install the driver. In fact, you won't see a pod named nvidia-driver in your pod list because the operator detected that the NVIDIA driver was already installed and disabled the containerized driver deployment for you.

To clarify: since driver.enabled=true is the default, the GPU Operator correctly identifies a preinstalled driver and skips the deployment of the driver container. It may be better to leave out the flag, or explicitly set it ot false to avoid confusion.

joshuacox commented 7 months ago

@elezar @cdesiniotis I have set it to false for now, I have a WIP branch here.

I'm not seeing any nvidia-driver pods, but I definitely have a lot more pods and more importantly an allocatable GPU with the release chart. At the moment if install the release chart nvidia/gpu-operator, I get something like this:

kubectl get po -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-mbmdj                                       1/1     Running     0          2m57s
gpu-operator-657b8ffcc-h4wsh                                      1/1     Running     0          3m17s
nvidia-container-toolkit-daemonset-gknj7                          1/1     Running     0          2m58s
nvidia-cuda-validator-f949w                                       0/1     Completed   0          2m39s
nvidia-dcgm-exporter-gpc7b                                        1/1     Running     0          2m58s
nvidia-device-plugin-daemonset-mpm6r                              1/1     Running     0          2m58s
nvidia-gpu-operator-node-feature-discovery-gc-64bc8485cd-4w7bw    1/1     Running     0          3m17s
nvidia-gpu-operator-node-feature-discovery-master-7fb4d54954j9c   1/1     Running     0          3m17s
nvidia-gpu-operator-node-feature-discovery-worker-gf9dr           1/1     Running     0          3m17s
nvidia-gpu-operator-node-feature-discovery-worker-wzhq4           1/1     Running     0          3m17s
nvidia-operator-validator-7rqbz                                   1/1     Running     0          2m58s

yet I only get these pods when I use the local chart:

kubectl get po -n gpu-operator
NAME                                                              READY   STATUS    RESTARTS   AGE
gpu-operator-55df7d9cdd-m5xbm                                     1/1     Running   0          3m57s
nvidia-gpu-operator-node-feature-discovery-gc-64bc8485cd-knqvz    1/1     Running   0          3m57s
nvidia-gpu-operator-node-feature-discovery-master-7fb4d549zsx6l   1/1     Running   0          3m57s
nvidia-gpu-operator-node-feature-discovery-worker-6kh69           1/1     Running   0          3m57s
nvidia-gpu-operator-node-feature-discovery-worker-ktnnh           1/1     Running   0          3m57s

with the only difference between the two scripts being:

diff install-operator.sh install-release-operator.sh
35c35
<   ${PROJECT_DIR}/deployments/gpu-operator
---
>   nvidia/gpu-operator

I am running full delete cluster, create cluster and install operator with the demo.sh e.g.

for the local chart:

./demo.sh local

for the release chart:

./demo.sh release

Something eludes me as to what the difference is at the moment, I'll do some diff'ing around to investigate. I'll go ahead and prep a PR soon, but it's still a WIP for now.

joshuacox commented 7 months ago

@elezar and @klueska , the only real difference I can see is the gdrcopy section on the local driver, am I missing something else?

 diff -r /tmp/gpu-operator-release/gpu-operator /tmp/gpu-operator/gpu-operator
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/clusterpolicy.yaml /tmp/gpu-operator/gpu-operator/templates/clusterpolicy.yaml
9c9
<     helm.sh/chart: gpu-operator-v23.9.1
---
>     helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
<     app.kubernetes.io/version: "v23.9.1"
---
>     app.kubernetes.io/version: "devel-ubi8"
25c25
<       helm.sh/chart: gpu-operator-v23.9.1
---
>       helm.sh/chart: gpu-operator-v1.0.0-devel
38c38
<     version: "v23.9.1"
---
>     version: "devel-ubi8"
268c268,274
<     version: "v23.9.1"
---
>     version: "devel-ubi8"
>     imagePullPolicy: IfNotPresent
>   gdrcopy:
>     enabled: false
>     repository: nvcr.io/nvidia/cloud-native
>     image: gdrdrv
>     version: "v2.4.1"
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/operator.yaml /tmp/gpu-operator/gpu-operator/templates/operator.yaml
9c9
<     helm.sh/chart: gpu-operator-v23.9.1
---
>     helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
<     app.kubernetes.io/version: "v23.9.1"
---
>     app.kubernetes.io/version: "devel-ubi8"
25c25
<         helm.sh/chart: gpu-operator-v23.9.1
---
>         helm.sh/chart: gpu-operator-v1.0.0-devel
27c27
<         app.kubernetes.io/version: "v23.9.1"
---
>         app.kubernetes.io/version: "devel-ubi8"
39c39
<         image: nvcr.io/nvidia/gpu-operator:v23.9.1
---
>         image: nvcr.io/nvidia/gpu-operator:devel-ubi8
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/rolebinding.yaml /tmp/gpu-operator/gpu-operator/templates/rolebinding.yaml
9c9
<     helm.sh/chart: gpu-operator-v23.9.1
---
>     helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
<     app.kubernetes.io/version: "v23.9.1"
---
>     app.kubernetes.io/version: "devel-ubi8"
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/role.yaml /tmp/gpu-operator/gpu-operator/templates/role.yaml
9c9
<     helm.sh/chart: gpu-operator-v23.9.1
---
>     helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
<     app.kubernetes.io/version: "v23.9.1"
---
>     app.kubernetes.io/version: "devel-ubi8"
diff --color -r /tmp/gpu-operator-release/gpu-operator/templates/serviceaccount.yaml /tmp/gpu-operator/gpu-operator/templates/serviceaccount.yaml
9c9
<     helm.sh/chart: gpu-operator-v23.9.1
---
>     helm.sh/chart: gpu-operator-v1.0.0-devel
11c11
<     app.kubernetes.io/version: "v23.9.1"
---
>     app.kubernetes.io/version: "devel-ubi8"

I have a draft PR open here

joshuacox commented 7 months ago

@klueska @elezar @cdesiniotis PR is open and ready if only the release chart is considered. I am still having issues with the local chart in the deployments directory, I have added details of the issue in the PR, and I've streamlined the scripts to illustrate the problem.

In short, release works great e.g.

./demo.sh release

However, the local install with gdrcopy both enabled and disabled are falling a bit short. e.g.

./demo.sh gdrcopy ./demo.sh local

I'm failing to see the real difference though in the actual chart.