Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected

ArthurMelin commented 3 years ago

We maintain a k8s cluster with multiple nodes that each have 4 Nvidia GPUs. Occasionally, one of the GPUs crashes. While that's unfortunate, the main issue is that a single GPU crashing causes the 3 other GPUs become unallocatable. All pod scheduled on the node won't start because of the following error:

Pod Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected

Also, our application that use those GPUs is managed by a Deployment. When a GPU crashes, the Deployment attempts to recreate a Pod without removing the previous Failed Pod, which accumulates (we saw up to 12k Pods) slowing down the entire cluster.

In the daemon set config, we already set --fail-on-init-error=false.

Common error checking:

The output of nvidia-smi -a on your host: Unable to determine the device handle for GPU 0000:C1:00.0: GPU is lost. Reboot the system to recover this GPU

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version: 20.10.2
Kernel version from uname -a: Linux node-11 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*': 450.102.04-0ubuntu0.20.04.1
NVIDIA container library version from nvidia-container-cli -V: 1.3.1

seaurching commented 1 year ago

I have same Error。

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

fighterhit commented 4 months ago

Hi @elezar, I have same Error. This error occurs when the device plugin is running. Other gpus are normal. It seems that any pod that contains this bad gpu will have this error (device plugin or application pod). However, the device plugin does not detect this error and reduce the number of available gpus on the node. This will cause pods to be continuously scheduled to this node and assigned this bad gpu. I want to confirm whether the device plugin can detect this error and exclude this GPU and reduce the number of available gpus.

NVIDIA / k8s-device-plugin

Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected #231