Error in nvidia-device-plugin pod.

premalathak12 commented 1 month ago

1. Quick Debug Information

OS/Version(Ubuntu22.04):
Kernel Version:
Container Runtime Type/Version(CRI-O):
K8s Flavor/Version(K8s 1.28):

2. Issue or feature description

Getting error : for nvdia-device-plugin pod getting this error : 1 factory.go:31] No valid resources detected, creating a null CDI handler I0417 07:12:59.758922 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0417 07:12:59.758977 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0417 07:12:59.758986 1 factory.go:115] Incompatible platform detected E0417 07:12:59.758992 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0417 07:12:59.758997 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0417 07:12:59.759003 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0417 07:12:59.759008 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0417 07:12:59.773152 1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed root@ami:/home/ami#

3. Information to attach (optional if deemed irrelevant)

Common error checking: -able to get nvdia-smi output in worker node which is the gpu host.

crio is running in master node
configured nvida-runtime for crio
installed till nvdia-container-toolkit by following this link in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/integrating-telemetry-kubernetes.html in both master and worker node.

Additional information that might help better understand your environment and reproduce the bug: nvidia version : cli-version: 1.15.0 lib-version: 1.15.0 build date: 2024-04-15T13:36+00:00 build revision: 6c8f1df7fd32cea3280cf2a2c6e931c9b3132465 build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64

elezar commented 1 month ago

@premalathak12 is the nvidia runtime configured as the default runtime in crio? If not, a runtime class must be created and associated with the nvidia runtime, and the runtime class specified when reploying the plugin.

premalathak12 commented 1 month ago

@elezar Thanks for the quick reply. Output for runtime command in both master and worker-gpu node : crictl config --get runtime-endpoint unix:///var/run/crio/crio.sock

root@ami:/home/ami# grep runtime /etc/crio/crio.conf [crio.runtime] [crio.runtime.runtimes] [crio.runtime.runtimes.nvidia] runtime_path = "/usr/bin/nvidia-container-runtime" runtime_type = "oci"

I ran these two commands : _Configuring CRI-O Configure the container runtime by using the nvidia-ctk command:

sudo nvidia-ctk runtime configure --runtime=crio The nvidia-ctk command modifies the /etc/crio/crio.conf file on the host. The file is updated so that CRI-O can use the NVIDIA Container Runtime.

Restart the CRI-O daemon:

sudo systemctl restart crio_

This section has no data : https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-sample-workloads-with-containerd-or-cri-o can you suggest any sample workload for this to verify runtime ?

NVIDIA / k8s-device-plugin