NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 305 forks source link

GPU resources are not recovered even XID error is resolved #1065

Open jslouisyou opened 1 month ago

jslouisyou commented 1 month ago

Hello, NVIDIA team.

I recently faced an issue while GPU resources (nvidia.com/gpu) can be shown from kubelet are not recovered (e.g. 7 -> 8) even any XID error is resolved.

I got nvidia-device-plugin-daemonset from gpu-operator and I'm using gpu-operator v23.9.2.

Here's more details:

I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:

Capacity:
  cpu:                        128
  ephemeral-storage:          7441183616Ki
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2113276288Ki
  nvidia.com/gpu:             8
  pods:                       110
Allocatable:
  cpu:                        128
  ephemeral-storage:          6857794809152
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2062682496Ki
  nvidia.com/gpu:             7       <=========== here
  pods:                       110

nvidia-device-plugin-daemonset reports that there is XID 94 error is coming out in one of GPU card:

I1025 02:19:08.002792       1 health.go:151] Skipping non-nvmlEventTypeXidCriticalError event: {Device:{Handle:0x7f0dcf40bdf8} EventType:2 EventData:0 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048144       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048185       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.048239       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.049436       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.049451       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.049483       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.059938       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.059948       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.059980       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.074343       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.074366       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.074389       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41

But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from nvidia-smi:

$ nvidia-smi
Fri Oct 25 11:35:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1A:00.0 Off |                    2 |
| N/A   33C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:40:00.0 Off |                    0 |
| N/A   31C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:53:00.0 Off |                    0 |
| N/A   31C    P0              74W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:66:00.0 Off |                    0 |
| N/A   33C    P0              69W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9C:00.0 Off |                    0 |
| N/A   35C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C0:00.0 Off |                    0 |
| N/A   32C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D2:00.0 Off |                    0 |
| N/A   34C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:E4:00.0 Off |                    0 |
| N/A   31C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

But even if XID error is resolved, nvidia-device-plugin-daemonset won't try to fetch new status of GPU cards and reports to kubelet, so kubelet thinks that only some of GPU cards can be used.

After I restarted nvidia-device-plugin-daemonset pod, at then it reports kubelet that they can use 8 GPU cards (the number of nvidia.com/gpu is changed in Allocatable):

Capacity:
  cpu:                        128
  ephemeral-storage:          7441183616Ki
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2113276288Ki
  nvidia.com/gpu:             8
  pods:                       110
Allocatable:
  cpu:                        128
  ephemeral-storage:          6857794809152
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2062682496Ki
  nvidia.com/gpu:             8       <=========== here is changed
  pods:                       110

I think nvidia-device-plugin-daemonset should fetch status correctly and report to kubelet. Could you please take a look this issue?

Thanks.

jslouisyou commented 1 month ago

I filed same issue in https://github.com/NVIDIA/k8s-device-plugin also:

https://github.com/NVIDIA/k8s-device-plugin/issues/1014

nwilliams-bdai commented 2 weeks ago

I agree that it seems like Xid 94 is essentially an application error and should not disable the device. But as a workaround you can tell it to ignore this by setting the device plugin's environment variable DP_DISABLE_HEALTHCHECKS to 94.