NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.68k stars 606 forks source link

How to mount containerPath to a hostPath for discover NVIDIA libraries w/o CDI spec #632

Closed Dragoncell closed 2 weeks ago

Dragoncell commented 5 months ago

Hello,

During the E2E test of changes in GPU Operator to support COS (https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1061), I found out that to discover the nvidia lib, it requries the specific PATH/LD_LIBRARY_PATH on the pod spec:

after the pod is running

$ kubectl get pods -n gpu-operator
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rr2x2                                1/1     Running     0          4h16m
gpu-operator-66575c8958-sslch                              1/1     Running     0          4h16m
noperator-node-feature-discovery-gc-6968c7c64-g7w7r        1/1     Running     0          4h16m
noperator-node-feature-discovery-master-749679f664-dvs48   1/1     Running     0          4h16m
noperator-node-feature-discovery-worker-glhxw              1/1     Running     0          4h16m
nvidia-container-toolkit-daemonset-wvpvx                   1/1     Running     0          4h16m
nvidia-cuda-validator-z84ks                                0/1     Completed   0          4h15m
nvidia-dcgm-exporter-9r87v                                 1/1     Running     0          4h16m
nvidia-device-plugin-daemonset-fp7hm                       1/1     Running     0          4h16m
nvidia-operator-validator-hstkb                            1/1     Running     0          4h16m

and deploy the GPU workload

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-base-ubuntu20.04
    command: ["bash", "-c"]
    args: 
    - |-
      # export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
      # export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
      nvidia-smi;
    resources:
      limits: 
        nvidia.com/gpu: "1"

I looked at the OCI spec of the container, the PATH looks like PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

In GKE's device plugin case, we expect that nvidia bin under /usr/local. (https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/145797868c0f6bd6a0f37c0295f06dfe5fa94265/cmd/nvidia_gpu/nvidia_gpu.go#L42)

Is there something similar we can configure in the k8s device plugin as well so that container path /usr/local could mount to a nvidia bin dir, which is /home/kubernetes/bin/nvidia on the host ? Thanks

Dragoncell commented 5 months ago

/cc @cdesiniotis @elezar @bobbypage

Dragoncell commented 5 months ago

Looked at the CDI spec of device plugin genereated, it mounts container path /host/home/kubernetes/bin/nvidia/bin to host path /home/kubernetes/bin/nvidia/bin (https://github.com/NVIDIA/k8s-device-plugin/blob/bf58cc405af03d864b1502f147815d4c2271ab9a/cmd/nvidia-device-plugin/plugin-manager.go#L50)

In this case, from the code (https://github.com/NVIDIA/k8s-device-plugin/blame/bf58cc405af03d864b1502f147815d4c2271ab9a/internal/cdi/cdi.go#L155), what's the suggestion change ?

$ kubectl logs nvidia-device-plugin-daemonset-fp7hm -n gpu-operator
Defaulted container "nvidia-device-plugin" out of: nvidia-device-plugin, toolkit-validation (init)
NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/kubernetes/bin/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/kubernetes/bin/nvidia/bin
Starting nvidia-device-plugin
I0404 19:00:20.786406       1 main.go:154] Starting FS watcher.
I0404 19:00:20.786557       1 main.go:161] Starting OS watcher.
I0404 19:00:20.786976       1 main.go:176] Starting Plugins.
I0404 19:00:20.786994       1 main.go:234] Loading configuration.
I0404 19:00:20.787155       1 main.go:242] Updating config with default resource matching patterns.
I0404 19:00:20.787381       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar",
        "cdi-annotations"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "nvidia.cdi.k8s.io/",
      "nvidiaCTKPath": "/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk",
      "containerDriverRoot": "/host"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ],
    "mig": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0404 19:00:20.787399       1 main.go:256] Retreiving plugins.
time="2024-04-04T19:00:20Z" level=info msg="Auto-detected mode as \"nvml\""
I0404 19:00:20.789106       1 factory.go:107] Detected NVML platform: found NVML library
I0404 19:00:20.789136       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
time="2024-04-04T19:00:20Z" level=info msg="Generating CDI spec for resource: k8s.device-plugin.nvidia.com/gpu"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia0 as /dev/nvidia0"
time="2024-04-04T19:00:20Z" level=warning msg="Failed to evaluate symlink /host/dev/dri/by-path/pci-0000:00:03.0-card; ignoring"
time="2024-04-04T19:00:20Z" level=warning msg="Failed to evaluate symlink /host/dev/dri/by-path/pci-0000:00:03.0-render; ignoring"
time="2024-04-04T19:00:20Z" level=info msg="Using driver version 535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia-modeset as /dev/nvidia-modeset"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia-uvm as /dev/nvidia-uvm"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidiactl as /dev/nvidiactl"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0 as /home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate glvnd/egl_vendor.d/10_nvidia.json: pattern glvnd/egl_vendor.d/10_nvidia.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate vulkan/implicit_layer.d/nvidia_layers.json: pattern vulkan/implicit_layer.d/nvidia_layers.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate egl/egl_external_platform.d/15_nvidia_gbm.json: pattern egl/egl_external_platform.d/15_nvidia_gbm.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate egl/egl_external_platform.d/10_nvidia_wayland.json: pattern egl/egl_external_platform.d/10_nvidia_wayland.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/nvoptix.bin: pattern nvidia/nvoptix.bin not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libcuda.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libcuda.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.129.03"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_ga10x.bin as /home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_ga10x.bin"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_tu10x.bin as /home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_tu10x.bin"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-smi as /home/kubernetes/bin/nvidia/bin/nvidia-smi"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-debugdump as /home/kubernetes/bin/nvidia/bin/nvidia-debugdump"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-persistenced as /home/kubernetes/bin/nvidia/bin/nvidia-persistenced"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control as /home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server as /home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
I0404 19:00:20.835426       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0404 19:00:20.836469       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0404 19:00:20.839254       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Dragoncell commented 4 months ago

With below change: https://github.com/NVIDIA/k8s-device-plugin/pull/666

Tested it out locally

helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=gcr.io/jiamingxu-gke-dev --set operator.version=v0422_05 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=gcr.io/jiamingxu-gke-dev  --set toolkit.version=v4 --set validator.repository=gcr.io/jiamingxu-gke-dev --set validator.version=v0417_1 --set devicePlugin.version=v0422_6 --set devicePlugin.repository=gcr.io/jiamingxu-gke-dev

with config of k8s device plugin

NVIDIA_DRIVER_ROOT=/home/kubernetes/bin/nvidia
CONTAINER_DRIVER_ROOT=/host/home/kubernetes/bin/nvidia
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

the CDI spec looks good

{
  "cdiVersion": "v0.5.0",
  "kind": "k8s.device-plugin.nvidia.com/gpu",
  "devices": [
    {
      "name": "GPU-13f2a0cd-9ac8-a110-68c4-b0e9bd769db1",
      "containerEdits": {
        "deviceNodes": [
          {
            "path": "/dev/nvidia0",
            "hostPath": "/dev/nvidia0"
          }
        ]
      }
    }
  ],
  "containerEdits": {
    "env": [
      "NVIDIA_VISIBLE_DEVICES=void"
    ],
    "deviceNodes": [
      {
        "path": "/dev/nvidia-modeset",
        "hostPath": "/dev/nvidia-modeset"
      },
      {
        "path": "/dev/nvidia-uvm",
        "hostPath": "/dev/nvidia-uvm"
      },
      {
        "path": "/dev/nvidia-uvm-tools",
        "hostPath": "/dev/nvidia-uvm-tools"
      },
      {
        "path": "/dev/nvidiactl",
        "hostPath": "/dev/nvidiactl"
      }
    ],
    "hooks": [
      {
        "hookName": "createContainer",
        "path": "/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "update-ldcache",
          "--folder",
          "/lib64"
        ]
      }
    ],
    "mounts": [
      {
        "hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control",
        "containerPath": "/bin/nvidia-cuda-mps-control",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server",
        "containerPath": "/bin/nvidia-cuda-mps-server",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-debugdump",
        "containerPath": "/bin/nvidia-debugdump",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-persistenced",
        "containerPath": "/bin/nvidia-persistenced",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-smi",
        "containerPath": "/bin/nvidia-smi",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.129.03",
        "containerPath": "/lib64/libEGL_nvidia.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.129.03",
        "containerPath": "/lib64/libGLESv1_CM_nvidia.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.129.03",
        "containerPath": "/lib64/libGLESv2_nvidia.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.129.03",
        "containerPath": "/lib64/libGLX_nvidia.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libcuda.so.535.129.03",
        "containerPath": "/lib64/libcuda.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.129.03",
        "containerPath": "/lib64/libcudadebugger.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.129.03",
        "containerPath": "/lib64/libnvcuvid.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.129.03",
        "containerPath": "/lib64/libnvidia-allocator.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.129.03",
        "containerPath": "/lib64/libnvidia-cfg.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0",
        "containerPath": "/lib64/libnvidia-egl-gbm.so.1.1.0",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.129.03",
        "containerPath": "/lib64/libnvidia-eglcore.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.129.03",
        "containerPath": "/lib64/libnvidia-encode.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.129.03",
        "containerPath": "/lib64/libnvidia-fbc.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.129.03",
        "containerPath": "/lib64/libnvidia-glcore.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.129.03",
        "containerPath": "/lib64/libnvidia-glsi.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.129.03",
        "containerPath": "/lib64/libnvidia-glvkspirv.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.129.03",
        "containerPath": "/lib64/libnvidia-gtk2.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.129.03",
        "containerPath": "/lib64/libnvidia-gtk3.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.129.03",
        "containerPath": "/lib64/libnvidia-ml.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.129.03",
        "containerPath": "/lib64/libnvidia-ngx.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.129.03",
        "containerPath": "/lib64/libnvidia-nvvm.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.129.03",
        "containerPath": "/lib64/libnvidia-opencl.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.129.03",
        "containerPath": "/lib64/libnvidia-opticalflow.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.129.03",
        "containerPath": "/lib64/libnvidia-pkcs11-openssl3.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.129.03",
        "containerPath": "/lib64/libnvidia-pkcs11.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.129.03",
        "containerPath": "/lib64/libnvidia-ptxjitcompiler.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.129.03",
        "containerPath": "/lib64/libnvidia-rtcore.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.129.03",
        "containerPath": "/lib64/libnvidia-tls.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.129.03",
        "containerPath": "/lib64/libnvidia-vulkan-producer.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.129.03",
        "containerPath": "/lib64/libnvidia-wayland-client.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.129.03",
        "containerPath": "/lib64/libnvoptix.so.535.129.03",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      }
    ]
  }
}

For workload without PATH/LD_LIBRARY_PATH

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-base-ubuntu20.04
    command: ["bash", "-c"]
    args: 
    - |-
      nvidia-smi;
      sleep 10000;
    resources:
      limits: 
        nvidia.com/gpu: "1"

Creation failed with error like

$ kubectl get pods
NAME         READY   STATUS                 RESTARTS   AGE
my-gpu-pod   0/1     CreateContainerError   0          56s

$ kubectl describe pod my-gpu-pod
....
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  13s                default-scheduler  Successfully assigned default/my-gpu-pod to gke-cluster-cos-custom-d-default-pool-b11d602e-ampq
  Normal   Pulled     12s (x2 over 12s)  kubelet            Container image "nvidia/cuda:11.0.3-base-ubuntu20.04" already present on machine
  Warning  Failed     12s                kubelet            Error: failed to generate container "0cfd3543fda1813f204a7154f8ef1e933183b40d72f223d0e6b6ede6c904ec77" spec: failed to generate spec: lstat /home/kubernetes/bin/nvidia/dev/nvidiactl: no such file or directory
  Warning  Failed     12s                kubelet            Error: failed to generate container "03a136525925fe8777b772d80234fdf98a715795340a6cc615cb18e0c1116f3a" spec: failed to generate spec: lstat /home/kubernetes/bin/nvidia/dev/nvidiactl: no such file or directory
elezar commented 4 months ago

I have updated #666 to include a fix for this. An additional hostDevRoot helm value is added that can be explicitly set to / on systems where the root to /dev on the host is / and not equal to nvidiaDriverRoot.

Dragoncell commented 4 months ago

Thanks for the update

With the latest change https://github.com/NVIDIA/k8s-device-plugin/pull/666, and below config

NVIDIA_DRIVER_ROOT=/home/kubernetes/bin/nvidia
CONTAINER_DRIVER_ROOT=/host/home/kubernetes/bin/nvidia
NVIDIA_DEV_ROOT=/
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk

Tested pod with nvidia-smi run is working as expected !!!

$ kubectl apply -f test-pod-smi.yaml
pod/my-gpu-pod created

$ kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
my-gpu-pod   1/1     Running   0          5s

$ kubectl logs my-gpu-pod
Tue Apr 23 19:33:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   36C    P8              16W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
github-actions[bot] commented 1 month ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 2 weeks ago

This issue was automatically closed due to inactivity.