Closed double12gzh closed 1 week ago
@double12gzh how is k8s configured to make use of GPUs? If you're using containerd or cri-o, you will have to configure the NVIDIA Container Runtime for each of these. In addition -- if this is not the default runtime in either case, you will have to create a runtimeClass in k8s and deploy the device plugin and GFD using this runtime class.
if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.
Do you mean that if GFD is restarted the labels are generated? Note that GFD should trigger a regeneration of lables after a certain amout of time. This should retrigger the NVML initialization, but if the driver was not available when the container was started the container would still not detect the devices.
Great thanks for you kind reply.
Yes, We are using containerd and configured "nvidia" as the default runtime.
Actually, I run command kubectl delete pod -n xxx gpu-feature-discovery
and wait the pod is successfully created. Then I find that the NVML is initialized correctly and the labels are successfully labeled on node.
If I don't delete GFD pod, the NVML will be never be successfully initialized, though the detection is run periodly.
@double12gzh how is the driver installed on your system? Is it preinstalled and available at the point in time where the pods are started for the first time?
yes, it is preinstalled and available before pods are started.
nvidia-smi command is workable on host, but it when pod was firstly create, nvidia-smi is not workable in pod.
After I delete the GFD pod and wait it is created, tried to kubectl exec
into pod and run nvidia-smi
, it works well.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.
Description
with node-feature-discover, gpu-feature-discovery and nvidia-device-plugin deployed, it is expected that some labels, such as, nvidia.com/gpu.product, nvidia.com/gpu.replica and etc. But there are no such labels on node.
How to find the root cause
Others
if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.