Open ArthurMelin opened 3 years ago
I have same Error。
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
Hi @elezar, I have same Error. This error occurs when the device plugin is running. Other gpus are normal. It seems that any pod that contains this bad gpu will have this error (device plugin or application pod). However, the device plugin does not detect this error and reduce the number of available gpus on the node. This will cause pods to be continuously scheduled to this node and assigned this bad gpu. I want to confirm whether the device plugin can detect this error and exclude this GPU and reduce the number of available gpus.
We maintain a k8s cluster with multiple nodes that each have 4 Nvidia GPUs. Occasionally, one of the GPUs crashes. While that's unfortunate, the main issue is that a single GPU crashing causes the 3 other GPUs become unallocatable. All pod scheduled on the node won't start because of the following error:
Also, our application that use those GPUs is managed by a Deployment. When a GPU crashes, the Deployment attempts to recreate a Pod without removing the previous Failed Pod, which accumulates (we saw up to 12k Pods) slowing down the entire cluster.
In the daemon set config, we already set
--fail-on-init-error=false
.Common error checking:
nvidia-smi -a
on your host:Unable to determine the device handle for GPU 0000:C1:00.0: GPU is lost. Reboot the system to recover this GPU
Additional information that might help better understand your environment and reproduce the bug:
docker version
:20.10.2
uname -a
:Linux node-11 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
:450.102.04-0ubuntu0.20.04.1
nvidia-container-cli -V
:1.3.1