Closed EKami closed 2 months ago
Note that your default containerd runtime is set to:
default_runtime_name = "runc"
this means, that unless you trigger the device plugin containers to use the nvidia
runtime, they will not have the required access to the devices and drivers.
The simplest way to address this (if you don't want to update the default runtime to nvidia
) is to use a runtime class. First create one on your cluster:
kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
name: nvidia
then instruct this to be used when running the device plugin:
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.15.1 \
--set runtimeClassName=nvidia \
--namespace nvidia-device-plugin \
--create-namespace \
--set-file config.map.config=/tmp/nvidia-config.yaml
Note that workloads also need to use the nvidia
runtime in this case.
Thanks a bunch, using nvidia as the default runtime indeed solved my problem :) .
1. Quick Debug Information
2. Issue or feature description
I'm having issues installing the plugin on my k0s node and would appreciate any guidance as to what I might be doing wrong. It boils down to this error:
Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
from mynvdp-nvidia-device-plugin
pod in my cluster. Here are the steps I followed for the installation of the plugin on my GPU node:containerd.toml
:With this configuration (located in
/etc/k0s/containerd.toml
for k0s) I've started my node with this:Then I've installed the plugin with:
where
/tmp/nvidia-config.yaml
content is:After installing with helm, my pod is errored out:
After getting into more details, I get:
I've been trying for days to solve this issue. I think the embedded containerd in k0s is properly configured as I'm able to run a GPU job from it:
The only part that I'm missing is this
libnvidia-ml.so.1
. I'm not sure how to solve this one.3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host:==============NVSMI LOG==============
Timestamp : Tue Jul 16 09:57:31 2024 Driver Version : 555.52.04 CUDA Version : 12.5
Attached GPUs : 1 GPU 00000000:00:1E.0 Product Name : Tesla T4 Product Brand : NVIDIA Product Architecture : Turing Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled Addressing Mode : None ...
$ dpkg -l 'nvidia' (no description available)
ii libnvidia-cfg1-555:amd64 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any (no description available)
un libnvidia-common (no description available)
ii libnvidia-common-555 555.52.04-0ubuntu0~gpu24.04.1 all Shared files used by the NVIDIA libraries
un libnvidia-compute (no description available)
ii libnvidia-compute-555:amd64 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.16.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.16.0-1 amd64 NVIDIA container runtime library
un libnvidia-decode (no description available)
ii libnvidia-decode-555:amd64 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVIDIA Video Decoding runtime libraries
un libnvidia-encode (no description available)
ii libnvidia-encode-555:amd64 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVENC Video Encoding runtime library
un libnvidia-extra (no description available)
ii libnvidia-extra-555:amd64 555.52.04-0ubuntu0~gpu24.04.1 amd64 Extra libraries for the NVIDIA driver
un libnvidia-fbc1 (no description available)
ii libnvidia-fbc1-555:amd64 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
un libnvidia-gl (no description available)
ii libnvidia-gl-555:amd64 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un libnvidia-ml.so.1 (no description available)
un nvidia-384 (no description available)
un nvidia-390 (no description available)
un nvidia-compute-utils (no description available)
ii nvidia-compute-utils-555 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVIDIA compute utilities
un nvidia-container-runtime (no description available)
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.16.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.16.0-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-dkms-555 555.52.04-0ubuntu0~gpu24.04.1 amd64 NVIDIA DKMS package
_or_
rpm -qa 'nvidia' or: command not found dpkg-query: no packages found matching nvidiarpm dpkg-query: no packages found matching -qa Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-====================================-=============================-============-========================================================= un libgldispatch0-nvidia$ nvidia-container-cli -V cli-version: 1.16.0 lib-version: 1.16.0 build date: 2024-07-15T13:41+00:00 build revision: 4c2494f16573b585788a42e9c7bee76ecd48c73d build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
❯ k describe node ip-172-31-10-219 Name: ip-172-31-10-219 Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
gpu-memory-MiB=15360
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-31-10-219
kubernetes.io/os=linux
nvidia.com/gpu.present=true
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 16 Jul 2024 18:00:09 +0900
Taints: nvidia.com/gpu=true:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-172-31-10-219
AcquireTime:
RenewTime: Tue, 16 Jul 2024 19:05:57 +0900
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
MemoryPressure False Tue, 16 Jul 2024 19:01:30 +0900 Tue, 16 Jul 2024 18:00:09 +0900 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 16 Jul 2024 19:01:30 +0900 Tue, 16 Jul 2024 18:00:09 +0900 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 16 Jul 2024 19:01:30 +0900 Tue, 16 Jul 2024 18:00:09 +0900 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 16 Jul 2024 19:01:30 +0900 Tue, 16 Jul 2024 18:00:19 +0900 KubeletReady kubelet is posting ready status Addresses: InternalIP: 172.31.10.219 Hostname: ip-172-31-10-219 Capacity: cpu: 4 ephemeral-storage: 100476656Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16167172Ki pods: 110 Allocatable: cpu: 4 ephemeral-storage: 92599286017 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16064772Ki pods: 110 System Info: Machine ID: ec234d677095858fadc478a73216d0cd System UUID: ec234d67-7095-858f-adc4-78a73216d0cd Boot ID: 8a0f75e9-7ed9-4d37-87c6-c1732b461aac Kernel Version: 6.8.0-1010-aws OS Image: Ubuntu 24.04 LTS Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.18 Kubelet Version: v1.30.2+k0s Kube-Proxy Version: v1.30.2+k0s PodCIDR: 10.244.3.0/24 PodCIDRs: 10.244.3.0/24 Non-terminated Pods: (4 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
kube-system konnectivity-agent-bthh5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m kube-system kube-proxy-wcbz6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m kube-system kube-router-vt5f9 250m (6%) 0 (0%) 16Mi (0%) 0 (0%) 65m nvidia-device-plugin nvdp-nvidia-device-plugin-rkq99 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits
cpu 250m (6%) 0 (0%) memory 16Mi (0%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: