NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.67k stars 609 forks source link

`/demo/clusters/kind/create-cluster.sh` fails with `umount: /proc/driver/nvidia: not mounted` #811

Open mbana opened 1 month ago

mbana commented 1 month ago

1. Quick Debug Information

$ cat /etc/os-release                                     
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a       
Linux mbana-1 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ nvidia-container-runtime --version
NVIDIA Container Runtime version 1.15.0
commit: ddeeca392c7bd8b33d0a66400b77af7a97e16cef
spec: 1.2.0

runc version 1.1.12
commit: v1.1.12-0-g51d5e94
spec: 1.0.2-dev
go: go1.21.11
libseccomp: 2.5.3
$ docker version                                       
Client: Docker Engine - Community
 Version:           26.1.4
 API version:       1.45
 Go version:        go1.21.11
 Git commit:        5650f9b
 Built:             Wed Jun  5 11:28:57 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.4
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       de5c9cf
  Built:            Wed Jun  5 11:28:57 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.33
  GitCommit:        d2d58213f83a351ca8f528a95fbd145f5654e957
 nvidia:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ nvidia-container-cli -V
cli-version: 1.15.0
lib-version: 1.15.0
build date: 2024-04-15T13:36+00:00
build revision: 6c8f1df7fd32cea3280cf2a2c6e931c9b3132465
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ kind version                                  
kind v0.23.0 go1.22.4 linux/amd6

2. Issue or feature description

The ./demo/clusters/kind/create-cluster.sh seems to fail:

$ ./demo/clusters/kind/create-cluster.sh
+ set -o pipefail
+ source /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
+++ pwd
++ SCRIPTS_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../../..
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../..
+++ pwd
++ PROJECT_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin
+++ from_versions_mk DRIVER_NAME
+++ local makevar=DRIVER_NAME
++++ grep -E '^\s*DRIVER_NAME\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=DRIVER_NAME := k8s-device-plugin'
+++ echo k8s-device-plugin
++ DRIVER_NAME=k8s-device-plugin
+++ from_versions_mk REGISTRY
+++ local makevar=REGISTRY
++++ grep -E '^\s*REGISTRY\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=REGISTRY ?= nvcr.io/nvidia'
+++ echo nvcr.io/nvidia
++ DRIVER_IMAGE_REGISTRY=nvcr.io/nvidia
+++ from_versions_mk VERSION
+++ local makevar=VERSION
++++ grep -E '^\s*VERSION\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=VERSION ?= v0.16.0-rc.1'
+++ echo v0.16.0-rc.1
++ DRIVER_IMAGE_VERSION=v0.16.0-rc.1
++ : k8s-device-plugin
++ : ubuntu22.04
++ : v0.16.0-rc.1
++ : nvcr.io/nvidia/k8s-device-plugin:v0.16.0-rc.1
++ : v1.29.1
++ : k8s-device-plugin-cluster
++ : /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : kindest/node:v1.29.1
+ /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/build-kind-image.sh
+ set -o pipefail
+ source /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
+++ pwd
++ SCRIPTS_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../../..
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../..
+++ pwd
++ PROJECT_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin
+++ from_versions_mk DRIVER_NAME
+++ local makevar=DRIVER_NAME
++++ grep -E '^\s*DRIVER_NAME\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=DRIVER_NAME := k8s-device-plugin'
+++ echo k8s-device-plugin
++ DRIVER_NAME=k8s-device-plugin
+++ from_versions_mk REGISTRY
+++ local makevar=REGISTRY
++++ grep -E '^\s*REGISTRY\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=REGISTRY ?= nvcr.io/nvidia'
+++ echo nvcr.io/nvidia
++ DRIVER_IMAGE_REGISTRY=nvcr.io/nvidia
+++ from_versions_mk VERSION
+++ local makevar=VERSION
++++ grep -E '^\s*VERSION\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=VERSION ?= v0.16.0-rc.1'
+++ echo v0.16.0-rc.1
++ DRIVER_IMAGE_VERSION=v0.16.0-rc.1
++ : k8s-device-plugin
++ : ubuntu22.04
++ : v0.16.0-rc.1
++ : nvcr.io/nvidia/k8s-device-plugin:v0.16.0-rc.1
++ : v1.29.1
++ : k8s-device-plugin-cluster
++ : /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : kindest/node:v1.29.1
++ docker images --filter reference=kindest/node:v1.29.1 -q
+ EXISTING_IMAGE_ID=171ed79cf912
+ '[' 171ed79cf912 '!=' '' ']'
+ exit 0
+ /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/create-kind-cluster.sh
+ set -o pipefail
+ source /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
+++ pwd
++ SCRIPTS_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../../..
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../..
+++ pwd
++ PROJECT_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin
+++ from_versions_mk DRIVER_NAME
+++ local makevar=DRIVER_NAME
++++ grep -E '^\s*DRIVER_NAME\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=DRIVER_NAME := k8s-device-plugin'
+++ echo k8s-device-plugin
++ DRIVER_NAME=k8s-device-plugin
+++ from_versions_mk REGISTRY
+++ local makevar=REGISTRY
++++ grep -E '^\s*REGISTRY\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=REGISTRY ?= nvcr.io/nvidia'
+++ echo nvcr.io/nvidia
++ DRIVER_IMAGE_REGISTRY=nvcr.io/nvidia
+++ from_versions_mk VERSION
+++ local makevar=VERSION
++++ grep -E '^\s*VERSION\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=VERSION ?= v0.16.0-rc.1'
+++ echo v0.16.0-rc.1
++ DRIVER_IMAGE_VERSION=v0.16.0-rc.1
++ : k8s-device-plugin
++ : ubuntu22.04
++ : v0.16.0-rc.1
++ : nvcr.io/nvidia/k8s-device-plugin:v0.16.0-rc.1
++ : v1.29.1
++ : k8s-device-plugin-cluster
++ : /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : kindest/node:v1.29.1
+ kind create cluster --retain --name k8s-device-plugin-cluster --image kindest/node:v1.29.1 --config /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
Creating cluster "k8s-device-plugin-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.29.1) đŸ–ŧ
 ✓ Preparing nodes đŸ“Ļ đŸ“Ļ  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹ī¸ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining worker nodes 🚜 
Set kubectl context to "kind-k8s-device-plugin-cluster"
You can now use your cluster with:

kubectl cluster-info --context kind-k8s-device-plugin-cluster

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/
+ docker exec -it k8s-device-plugin-cluster-worker umount -R /proc/driver/nvidia
umount: /proc/driver/nvidia: not mounted

3. Information to attach (optional if deemed irrelevant)

Common error checking:

$  nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Tue Jul  9 11:21:27 2024
Driver Version                            : 550.90.07
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:05:00.0
    Product Name                          : Quadro RTX 4000
    Product Brand                         : Quadro RTX
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1324121092960
    GPU UUID                              : GPU-fddff5e2-b0d9-3d1e-544a-bc5450cc1092
    Minor Number                          : 0
    VBIOS Version                         : 90.04.87.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x500
    Board Part Number                     : 900-5G160-2550-000
    GPU Part Number                       : 1EB1-850-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G160.0500.00.01
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x05
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x1EB110DE
        Bus Id                            : 00000000:05:00.0
        Sub System Id                     : 0x12A010DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 8192 MiB
        Reserved                          : 225 MiB
        Used                              : 1 MiB
        Free                              : 7967 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 3 MiB
        Free                              : 253 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 26 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 8.75 W
        Current Power Limit               : 125.00 W
        Requested Power Limit             : 125.00 W
        Default Power Limit               : 125.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 125.00 W
    GPU Memory Power Readings 
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 1005 MHz
        Memory                            : 6501 MHz
    Default Applications Clocks
        Graphics                          : 1005 MHz
        Memory                            : 6501 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 6501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None
$ cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=systemd"],
    "bip": "192.168.99.1/24",
    "default-shm-size": "1G",
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m",
        "max-file": "1"
    },
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    }
}

Additional information that might help better understand your environment and reproduce the bug:

$ env PAGER=cat dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                 Version            Architecture Description
+++-====================================-==================-============-=========================================================
un  libgldispatch0-nvidia                <none>             <none>       (no description available)
ii  libnvidia-cfg1-550:amd64             550.90.07-0ubuntu1 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                   <none>             <none>       (no description available)
un  libnvidia-common                     <none>             <none>       (no description available)
ii  libnvidia-common-550                 550.90.07-0ubuntu1 all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                    <none>             <none>       (no description available)
ii  libnvidia-compute-550:amd64          550.90.07-0ubuntu1 amd64        NVIDIA libcompute package
ii  libnvidia-container-tools            1.15.0-1           amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.15.0-1           amd64        NVIDIA container runtime library
un  libnvidia-decode                     <none>             <none>       (no description available)
ii  libnvidia-decode-550:amd64           550.90.07-0ubuntu1 amd64        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                     <none>             <none>       (no description available)
ii  libnvidia-encode-550:amd64           550.90.07-0ubuntu1 amd64        NVENC Video Encoding runtime library
un  libnvidia-extra                      <none>             <none>       (no description available)
ii  libnvidia-extra-550:amd64            550.90.07-0ubuntu1 amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                       <none>             <none>       (no description available)
ii  libnvidia-fbc1-550:amd64             550.90.07-0ubuntu1 amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                         <none>             <none>       (no description available)
ii  libnvidia-gl-550:amd64               550.90.07-0ubuntu1 amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ml.so.1                    <none>             <none>       (no description available)
un  nvidia-384                           <none>             <none>       (no description available)
un  nvidia-390                           <none>             <none>       (no description available)
un  nvidia-common                        <none>             <none>       (no description available)
un  nvidia-compute-utils                 <none>             <none>       (no description available)
ii  nvidia-compute-utils-550             550.90.07-0ubuntu1 amd64        NVIDIA compute utilities
un  nvidia-container-runtime             <none>             <none>       (no description available)
un  nvidia-container-runtime-hook        <none>             <none>       (no description available)
ii  nvidia-container-toolkit             1.15.0-1           amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base        1.15.0-1           amd64        NVIDIA Container Toolkit Base
ii  nvidia-dkms-550                      550.90.07-0ubuntu1 amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel                   <none>             <none>       (no description available)
un  nvidia-docker                        <none>             <none>       (no description available)
ii  nvidia-docker2                       2.13.0-1           all          nvidia-docker CLI wrapper
ii  nvidia-driver-550                    550.90.07-0ubuntu1 amd64        NVIDIA driver metapackage
un  nvidia-driver-550-open               <none>             <none>       (no description available)
un  nvidia-driver-550-server             <none>             <none>       (no description available)
un  nvidia-driver-550-server-open        <none>             <none>       (no description available)
un  nvidia-driver-binary                 <none>             <none>       (no description available)
ii  nvidia-firmware-550-550.54.15        550.54.15-0ubuntu1 amd64        Firmware files used by the kernel module
ii  nvidia-firmware-550-550.90.07        550.90.07-0ubuntu1 amd64        Firmware files used by the kernel module
un  nvidia-firmware-550-server-550.54.15 <none>             <none>       (no description available)
un  nvidia-firmware-550-server-550.90.07 <none>             <none>       (no description available)
un  nvidia-kernel-common                 <none>             <none>       (no description available)
ii  nvidia-kernel-common-550             550.90.07-0ubuntu1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source                 <none>             <none>       (no description available)
ii  nvidia-kernel-source-550             550.90.07-0ubuntu1 amd64        NVIDIA kernel source package
un  nvidia-opencl-icd                    <none>             <none>       (no description available)
un  nvidia-persistenced                  <none>             <none>       (no description available)
ii  nvidia-prime                         0.8.17.1           all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                      555.42.02-0ubuntu1 amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary               <none>             <none>       (no description available)
un  nvidia-smi                           <none>             <none>       (no description available)
un  nvidia-utils                         <none>             <none>       (no description available)
ii  nvidia-utils-550                     550.90.07-0ubuntu1 amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-550        550.90.07-0ubuntu1 amd64        NVIDIA binary Xorg driver

I am not sure what the script is trying to do but when I exec into the worker, it is reporting the correct GPU information:

$ docker exec -it k8s-device-plugin-cluster-worker bash
root@k8s-device-plugin-cluster-worker:/# ls -lah /proc/driver/nvidia
total 0
dr-xr-xr-x 11 root root 0 Jul  9 10:16 .
dr-xr-xr-x  8 root root 0 Jul  9 10:16 ..
dr-xr-xr-x  5 root root 0 Jul  9 10:24 capabilities
dr-xr-xr-x  3 root root 0 Jul  9 10:24 gpus
-r--r--r--  1 root root 0 Jul  9 10:24 params
dr-xr-xr-x  3 root root 0 Jul  9 10:24 patches
-rw-r--r--  1 root root 0 Jul  9 10:24 registry
-rw-r--r--  1 root root 0 Jul  9 10:24 suspend
-rw-r--r--  1 root root 0 Jul  9 10:24 suspend_depth
-r--r--r--  1 root root 0 Jul  9 10:24 version
dr-xr-xr-x  3 root root 0 Jul  9 10:24 warnings
root@k8s-device-plugin-cluster-worker:/# ls -lah /proc/driver/nvidia/gpus
total 0
dr-xr-xr-x  3 root root 0 Jul  9 10:24 .
dr-xr-xr-x 11 root root 0 Jul  9 10:16 ..
dr-xr-xr-x  5 root root 0 Jul  9 10:25 0000:05:00.0
root@k8s-device-plugin-cluster-worker:/# cat /proc/driver/nvidia/gpus/0000\:05\:00.0/information
Model:       Quadro RTX 4000
IRQ:         64
GPU UUID:    GPU-fddff5e2-b0d9-3d1e-544a-bc5450cc1092
Video BIOS:      90.04.87.00.01
Bus Type:    PCIe
DMA Size:    47 bits
DMA Mask:    0x7fffffffffff
Bus Location:    0000:05:00.0
Device Minor:    0
GPU Excluded:    No
mbana commented 1 month ago

This fixed the startup issue for me:

$ cat /etc/nvidia-container-runtime/config.toml  
# We inject all NVIDIA GPUs using the nvidia-container-runtime.
# This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
# in `/etc/nvidia-container-runtime/config.toml`
accept-nvidia-visible-devices-as-volume-mounts = true
...
$ sudo systemctl restart docker && sudo systemctl restart containerd

However when I attempt to run a GPU workload I get the following error:

$ ./demo/clusters/kind/create-cluster.sh
$ ./demo/clusters/kind/install-plugin.sh
$ ```yaml
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF
$ kubectl logs gpu-pod            
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]