Closed Dragoncell closed 2 weeks ago
/cc @cdesiniotis @elezar @bobbypage
Looked at the CDI spec of device plugin genereated, it mounts container path /host/home/kubernetes/bin/nvidia/bin
to host path /home/kubernetes/bin/nvidia/bin
(https://github.com/NVIDIA/k8s-device-plugin/blob/bf58cc405af03d864b1502f147815d4c2271ab9a/cmd/nvidia-device-plugin/plugin-manager.go#L50)
In this case, from the code (https://github.com/NVIDIA/k8s-device-plugin/blame/bf58cc405af03d864b1502f147815d4c2271ab9a/internal/cdi/cdi.go#L155), what's the suggestion change ?
$ kubectl logs nvidia-device-plugin-daemonset-fp7hm -n gpu-operator
Defaulted container "nvidia-device-plugin" out of: nvidia-device-plugin, toolkit-validation (init)
NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/kubernetes/bin/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/kubernetes/bin/nvidia/bin
Starting nvidia-device-plugin
I0404 19:00:20.786406 1 main.go:154] Starting FS watcher.
I0404 19:00:20.786557 1 main.go:161] Starting OS watcher.
I0404 19:00:20.786976 1 main.go:176] Starting Plugins.
I0404 19:00:20.786994 1 main.go:234] Loading configuration.
I0404 19:00:20.787155 1 main.go:242] Updating config with default resource matching patterns.
I0404 19:00:20.787381 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar",
"cdi-annotations"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "nvidia.cdi.k8s.io/",
"nvidiaCTKPath": "/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk",
"containerDriverRoot": "/host"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0404 19:00:20.787399 1 main.go:256] Retreiving plugins.
time="2024-04-04T19:00:20Z" level=info msg="Auto-detected mode as \"nvml\""
I0404 19:00:20.789106 1 factory.go:107] Detected NVML platform: found NVML library
I0404 19:00:20.789136 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
time="2024-04-04T19:00:20Z" level=info msg="Generating CDI spec for resource: k8s.device-plugin.nvidia.com/gpu"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia0 as /dev/nvidia0"
time="2024-04-04T19:00:20Z" level=warning msg="Failed to evaluate symlink /host/dev/dri/by-path/pci-0000:00:03.0-card; ignoring"
time="2024-04-04T19:00:20Z" level=warning msg="Failed to evaluate symlink /host/dev/dri/by-path/pci-0000:00:03.0-render; ignoring"
time="2024-04-04T19:00:20Z" level=info msg="Using driver version 535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia-modeset as /dev/nvidia-modeset"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidia-uvm as /dev/nvidia-uvm"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/dev/nvidiactl as /dev/nvidiactl"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0 as /home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate glvnd/egl_vendor.d/10_nvidia.json: pattern glvnd/egl_vendor.d/10_nvidia.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate vulkan/implicit_layer.d/nvidia_layers.json: pattern vulkan/implicit_layer.d/nvidia_layers.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate egl/egl_external_platform.d/15_nvidia_gbm.json: pattern egl/egl_external_platform.d/15_nvidia_gbm.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate egl/egl_external_platform.d/10_nvidia_wayland.json: pattern egl/egl_external_platform.d/10_nvidia_wayland.json not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/nvoptix.bin: pattern nvidia/nvoptix.bin not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libcuda.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libcuda.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.129.03"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.129.03 as /home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.129.03"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_ga10x.bin as /home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_ga10x.bin"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_tu10x.bin as /home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03/gsp_tu10x.bin"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-smi as /home/kubernetes/bin/nvidia/bin/nvidia-smi"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-debugdump as /home/kubernetes/bin/nvidia/bin/nvidia-debugdump"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-persistenced as /home/kubernetes/bin/nvidia/bin/nvidia-persistenced"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control as /home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control"
time="2024-04-04T19:00:20Z" level=info msg="Selecting /host/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server as /home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-04T19:00:20Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
I0404 19:00:20.835426 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0404 19:00:20.836469 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0404 19:00:20.839254 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
With below change: https://github.com/NVIDIA/k8s-device-plugin/pull/666
Tested it out locally
helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=gcr.io/jiamingxu-gke-dev --set operator.version=v0422_05 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=gcr.io/jiamingxu-gke-dev --set toolkit.version=v4 --set validator.repository=gcr.io/jiamingxu-gke-dev --set validator.version=v0417_1 --set devicePlugin.version=v0422_6 --set devicePlugin.repository=gcr.io/jiamingxu-gke-dev
with config of k8s device plugin
NVIDIA_DRIVER_ROOT=/home/kubernetes/bin/nvidia
CONTAINER_DRIVER_ROOT=/host/home/kubernetes/bin/nvidia
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
the CDI spec looks good
{
"cdiVersion": "v0.5.0",
"kind": "k8s.device-plugin.nvidia.com/gpu",
"devices": [
{
"name": "GPU-13f2a0cd-9ac8-a110-68c4-b0e9bd769db1",
"containerEdits": {
"deviceNodes": [
{
"path": "/dev/nvidia0",
"hostPath": "/dev/nvidia0"
}
]
}
}
],
"containerEdits": {
"env": [
"NVIDIA_VISIBLE_DEVICES=void"
],
"deviceNodes": [
{
"path": "/dev/nvidia-modeset",
"hostPath": "/dev/nvidia-modeset"
},
{
"path": "/dev/nvidia-uvm",
"hostPath": "/dev/nvidia-uvm"
},
{
"path": "/dev/nvidia-uvm-tools",
"hostPath": "/dev/nvidia-uvm-tools"
},
{
"path": "/dev/nvidiactl",
"hostPath": "/dev/nvidiactl"
}
],
"hooks": [
{
"hookName": "createContainer",
"path": "/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk",
"args": [
"nvidia-ctk",
"hook",
"update-ldcache",
"--folder",
"/lib64"
]
}
],
"mounts": [
{
"hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-control",
"containerPath": "/bin/nvidia-cuda-mps-control",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-cuda-mps-server",
"containerPath": "/bin/nvidia-cuda-mps-server",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-debugdump",
"containerPath": "/bin/nvidia-debugdump",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-persistenced",
"containerPath": "/bin/nvidia-persistenced",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/bin/nvidia-smi",
"containerPath": "/bin/nvidia-smi",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libEGL_nvidia.so.535.129.03",
"containerPath": "/lib64/libEGL_nvidia.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libGLESv1_CM_nvidia.so.535.129.03",
"containerPath": "/lib64/libGLESv1_CM_nvidia.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libGLESv2_nvidia.so.535.129.03",
"containerPath": "/lib64/libGLESv2_nvidia.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libGLX_nvidia.so.535.129.03",
"containerPath": "/lib64/libGLX_nvidia.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libcuda.so.535.129.03",
"containerPath": "/lib64/libcuda.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libcudadebugger.so.535.129.03",
"containerPath": "/lib64/libcudadebugger.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvcuvid.so.535.129.03",
"containerPath": "/lib64/libnvcuvid.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-allocator.so.535.129.03",
"containerPath": "/lib64/libnvidia-allocator.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-cfg.so.535.129.03",
"containerPath": "/lib64/libnvidia-cfg.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-egl-gbm.so.1.1.0",
"containerPath": "/lib64/libnvidia-egl-gbm.so.1.1.0",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-eglcore.so.535.129.03",
"containerPath": "/lib64/libnvidia-eglcore.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-encode.so.535.129.03",
"containerPath": "/lib64/libnvidia-encode.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-fbc.so.535.129.03",
"containerPath": "/lib64/libnvidia-fbc.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-glcore.so.535.129.03",
"containerPath": "/lib64/libnvidia-glcore.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-glsi.so.535.129.03",
"containerPath": "/lib64/libnvidia-glsi.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-glvkspirv.so.535.129.03",
"containerPath": "/lib64/libnvidia-glvkspirv.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk2.so.535.129.03",
"containerPath": "/lib64/libnvidia-gtk2.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-gtk3.so.535.129.03",
"containerPath": "/lib64/libnvidia-gtk3.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-ml.so.535.129.03",
"containerPath": "/lib64/libnvidia-ml.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-ngx.so.535.129.03",
"containerPath": "/lib64/libnvidia-ngx.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-nvvm.so.535.129.03",
"containerPath": "/lib64/libnvidia-nvvm.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-opencl.so.535.129.03",
"containerPath": "/lib64/libnvidia-opencl.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-opticalflow.so.535.129.03",
"containerPath": "/lib64/libnvidia-opticalflow.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.129.03",
"containerPath": "/lib64/libnvidia-pkcs11-openssl3.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-pkcs11.so.535.129.03",
"containerPath": "/lib64/libnvidia-pkcs11.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.129.03",
"containerPath": "/lib64/libnvidia-ptxjitcompiler.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-rtcore.so.535.129.03",
"containerPath": "/lib64/libnvidia-rtcore.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-tls.so.535.129.03",
"containerPath": "/lib64/libnvidia-tls.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-vulkan-producer.so.535.129.03",
"containerPath": "/lib64/libnvidia-vulkan-producer.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvidia-wayland-client.so.535.129.03",
"containerPath": "/lib64/libnvidia-wayland-client.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
},
{
"hostPath": "/home/kubernetes/bin/nvidia/lib64/libnvoptix.so.535.129.03",
"containerPath": "/lib64/libnvoptix.so.535.129.03",
"options": [
"ro",
"nosuid",
"nodev",
"bind"
]
}
]
}
}
For workload without PATH/LD_LIBRARY_PATH
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-base-ubuntu20.04
command: ["bash", "-c"]
args:
- |-
nvidia-smi;
sleep 10000;
resources:
limits:
nvidia.com/gpu: "1"
Creation failed with error like
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-gpu-pod 0/1 CreateContainerError 0 56s
$ kubectl describe pod my-gpu-pod
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13s default-scheduler Successfully assigned default/my-gpu-pod to gke-cluster-cos-custom-d-default-pool-b11d602e-ampq
Normal Pulled 12s (x2 over 12s) kubelet Container image "nvidia/cuda:11.0.3-base-ubuntu20.04" already present on machine
Warning Failed 12s kubelet Error: failed to generate container "0cfd3543fda1813f204a7154f8ef1e933183b40d72f223d0e6b6ede6c904ec77" spec: failed to generate spec: lstat /home/kubernetes/bin/nvidia/dev/nvidiactl: no such file or directory
Warning Failed 12s kubelet Error: failed to generate container "03a136525925fe8777b772d80234fdf98a715795340a6cc615cb18e0c1116f3a" spec: failed to generate spec: lstat /home/kubernetes/bin/nvidia/dev/nvidiactl: no such file or directory
I have updated #666 to include a fix for this. An additional hostDevRoot
helm value is added that can be explicitly set to /
on systems where the root to /dev
on the host is /
and not equal to nvidiaDriverRoot
.
Thanks for the update
With the latest change https://github.com/NVIDIA/k8s-device-plugin/pull/666, and below config
NVIDIA_DRIVER_ROOT=/home/kubernetes/bin/nvidia
CONTAINER_DRIVER_ROOT=/host/home/kubernetes/bin/nvidia
NVIDIA_DEV_ROOT=/
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
Tested pod with nvidia-smi run is working as expected !!!
$ kubectl apply -f test-pod-smi.yaml
pod/my-gpu-pod created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-gpu-pod 1/1 Running 0 5s
$ kubectl logs my-gpu-pod
Tue Apr 23 19:33:55 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 36C P8 16W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.
Hello,
During the E2E test of changes in GPU Operator to support COS (https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1061), I found out that to discover the nvidia lib, it requries the specific PATH/LD_LIBRARY_PATH on the pod spec:
after the pod is running
and deploy the GPU workload
I looked at the OCI spec of the container, the PATH looks like PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
In GKE's device plugin case, we expect that nvidia bin under
/usr/local
. (https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/145797868c0f6bd6a0f37c0295f06dfe5fa94265/cmd/nvidia_gpu/nvidia_gpu.go#L42)Is there something similar we can configure in the k8s device plugin as well so that container path
/usr/local
could mount to a nvidia bin dir, which is /home/kubernetes/bin/nvidia on the host ? Thanks