NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.72k stars 614 forks source link

cannot generate nvidia.com/gpu.xxx labels on node #733

Closed double12gzh closed 1 week ago

double12gzh commented 7 months ago

Description

with node-feature-discover, gpu-feature-discovery and nvidia-device-plugin deployed, it is expected that some labels, such as, nvidia.com/gpu.product, nvidia.com/gpu.replica and etc. But there are no such labels on node.

How to find the root cause

  1. exec into gpu-feature-discovery , found there is no content in /etc/kubernetes/node-feature-discovery/features.d/gfd.
  2. in gpu-feature-discovery, cannot find nvml lib and nvidia-smi comand image

Others

if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.

elezar commented 7 months ago

@double12gzh how is k8s configured to make use of GPUs? If you're using containerd or cri-o, you will have to configure the NVIDIA Container Runtime for each of these. In addition -- if this is not the default runtime in either case, you will have to create a runtimeClass in k8s and deploy the device plugin and GFD using this runtime class.

if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.

Do you mean that if GFD is restarted the labels are generated? Note that GFD should trigger a regeneration of lables after a certain amout of time. This should retrigger the NVML initialization, but if the driver was not available when the container was started the container would still not detect the devices.

double12gzh commented 7 months ago

Great thanks for you kind reply.

Yes, We are using containerd and configured "nvidia" as the default runtime.

image

Actually, I run command kubectl delete pod -n xxx gpu-feature-discovery and wait the pod is successfully created. Then I find that the NVML is initialized correctly and the labels are successfully labeled on node.

If I don't delete GFD pod, the NVML will be never be successfully initialized, though the detection is run periodly.

elezar commented 7 months ago

@double12gzh how is the driver installed on your system? Is it preinstalled and available at the point in time where the pods are started for the first time?

double12gzh commented 7 months ago

yes, it is preinstalled and available before pods are started.

double12gzh commented 7 months ago

nvidia-smi command is workable on host, but it when pod was firstly create, nvidia-smi is not workable in pod. After I delete the GFD pod and wait it is created, tried to kubectl exec into pod and run nvidia-smi, it works well.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 1 week ago

This issue was automatically closed due to inactivity.