Open jslouisyou opened 1 month ago
I filed same issue in https://github.com/NVIDIA/k8s-device-plugin also:
I agree that it seems like Xid 94 is essentially an application error and should not disable the device.
But as a workaround you can tell it to ignore this by setting the device plugin's environment variable DP_DISABLE_HEALTHCHECKS
to 94
.
Hello, NVIDIA team.
I recently faced an issue while GPU resources (
nvidia.com/gpu
) can be shown fromkubelet
are not recovered (e.g. 7 -> 8) even any XID error is resolved.I got
nvidia-device-plugin-daemonset
fromgpu-operator
and I'm using gpu-operator v23.9.2.Here's more details:
I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:
nvidia-device-plugin-daemonset
reports that there is XID 94 error is coming out in one of GPU card:But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from
nvidia-smi
:But even if XID error is resolved,
nvidia-device-plugin-daemonset
won't try to fetch new status of GPU cards and reports tokubelet
, sokubelet
thinks that only some of GPU cards can be used.After I restarted
nvidia-device-plugin-daemonset
pod, at then it reportskubelet
that they can use 8 GPU cards (the number ofnvidia.com/gpu
is changed inAllocatable
):I think
nvidia-device-plugin-daemonset
should fetch status correctly and report tokubelet
. Could you please take a look this issue?Thanks.