NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 634 forks source link

Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected #231

Open ArthurMelin opened 3 years ago

ArthurMelin commented 3 years ago

We maintain a k8s cluster with multiple nodes that each have 4 Nvidia GPUs. Occasionally, one of the GPUs crashes. While that's unfortunate, the main issue is that a single GPU crashing causes the 3 other GPUs become unallocatable. All pod scheduled on the node won't start because of the following error:

Pod Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected

Also, our application that use those GPUs is managed by a Deployment. When a GPU crashes, the Deployment attempts to recreate a Pod without removing the previous Failed Pod, which accumulates (we saw up to 12k Pods) slowing down the entire cluster.

In the daemon set config, we already set --fail-on-init-error=false.

Common error checking:

Additional information that might help better understand your environment and reproduce the bug:

seaurching commented 1 year ago

I have same Error。

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

fighterhit commented 4 months ago

Hi @elezar, I have same Error. This error occurs when the device plugin is running. Other gpus are normal. It seems that any pod that contains this bad gpu will have this error (device plugin or application pod). However, the device plugin does not detect this error and reduce the number of available gpus on the node. This will cause pods to be continuously scheduled to this node and assigned this bad gpu. I want to confirm whether the device plugin can detect this error and exclude this GPU and reduce the number of available gpus.

image