NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.76k stars 617 forks source link

How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease #502

Open yizhouv5 opened 8 months ago

yizhouv5 commented 8 months ago

1. Quick Debug Information

2. Issue or feature description

Normally, the node has two H800 GPU Cards:

nvidia-smi

Tue Feb 6 00:10:32 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H800 PCIe Off | 00000000:0E:00.0 Off | 0 | | N/A 42C P0 83W / 350W | 13653MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H800 PCIe Off | 00000000:16:00.0 Off | 0 | | N/A 44C P0 88W / 350W | 729MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 850967 C python3 13640MiB | | 1 N/A N/A 838449 C python3 716MiB | +---------------------------------------------------------------------------------------+

Node status allocatable field value:

Capacity: cpu: 240 ephemeral-storage: 3747419676Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1056406464Ki nvidia.com/gpu: 2 pods: 110 Allocatable: cpu: 239800m ephemeral-storage: 3453621967684 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1054833600Ki nvidia.com/gpu: 2 pods: 110

I manually disable one GPU by using nvidia-smi drain. I need to restart the nvidia-device-plugin or kubelet to make the number of Gpus available to allocatable take effect, as shown in the following example: nvidia-smi --id 0000:16:00.0 --persistence-mode 0 nvidia-smi drain --pciid 0000:16:00.0 --modify 1

kubectl describe node nodename

Allocatable: cpu: 239800m ephemeral-storage: 3453621967684 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1054833600Ki nvidia.com/gpu: 1 pods: 110

The question is which GPU faults are detected by the nvidia-device-plugin,then exposed to kubelet for dynamically updating the node's allocatable gpu field values, another question is whether there is a way to simulate gpu failure (such as hardware, drivers, etc.)

Please provide some technical suggestions,thanks

elezar commented 8 months ago

@yizhouv5 the device plugin reacts to a subset of NVML events that are associated with critical Xid errors: https://github.com/NVIDIA/k8s-device-plugin/blob/7e6e3765be7414717b8a8e3972cd936cccc9384a/internal/rm/health.go#L94

We also have a filter that skips a list of errors that are considered application and not device errors: https://github.com/NVIDIA/k8s-device-plugin/blob/7e6e3765be7414717b8a8e3972cd936cccc9384a/internal/rm/health.go#L65-L71

There are however errors that do not trigger the NVML events that we are known not to handle. Xid 119, for example. There is work on our backlog to improve the error handling of the plugin, but we don't have a clear timeline of when this will be addressed.

With regards to "simulating" failures or marking GPUs as explicitly unhealty, we don't have a mechanism to do this at present. It seems as if what you actually want to do is explicitly drain a device -- marking it as unhealthy so that no jobs can use this resource. As a matter of interest, does the nvidia-smi drain command trigger an event?

yizhouv5 commented 8 months ago

@elezar Hi,If I use nvidia-smi drain to manually disable one GPU, no events will appear in the nvidia-device-plugin log. The label nvidia.com/gpu.count value of the node is updated in the gpu-operator-node-feature-discovery log. The contents of the /etc/kubernetes/node-feature-discovery/features.d/gfd file will also be refreshed every minute to reflect the number of Gpus available on the node. May I ask whether manually disabling GPU can also be put into nvidi-Device-plugin for monitoring as a fault scenario, and the allocatable nvidia.com/gpu field value can be dynamically updated without restarting device-plugin or kubelet.

1、kubectl describe node test-175-master-5woadnkp | grep 'nvidia.com/gpu.count' nvidia.com/gpu.count=1

2、kubectl logs -f -n gpu-operator gpu-operator-node-feature-discovery-worker-6wz9b I0206 02:20:46.955268 1 nfd-worker.go:561] starting feature discovery... I0206 02:20:46.955709 1 nfd-worker.go:573] feature discovery completed I0206 02:20:47.076735 1 nfd-worker.go:726] updating NodeFeature object "test-175-master-5woadnkp"

cat /etc/kubernetes/node-feature-discovery/features.d/gfd nvidia.com/cuda.runtime.major=12 nvidia.com/gpu.replicas=1 nvidia.com/gpu.product=NVIDIA-H800-PCIe nvidia.com/gpu.memory=81559 nvidia.com/cuda.driver.rev=03 nvidia.com/gpu.count=1 nvidia.com/gpu.compute.minor=0 nvidia.com/cuda.driver.major=535 nvidia.com/gpu.compute.major=9 nvidia.com/cuda.driver.minor=54 nvidia.com/cuda.runtime.minor=2 nvidia.com/gfd.timestamp=1704275476 nvidia.com/gpu.family=hopper nvidia.com/mig.capable=true nvidia.com/mig.strategy=mixed

elezar commented 8 months ago

If I use nvidia-smi drain to manually disable one GPU, no events will appear in the nvidia-device-plugin log. The label nvidia.com/gpu.count value of the node is updated in the gpu-operator-node-feature-discovery log. The contents of the /etc/kubernetes/node-feature-discovery/features.d/gfd file will also be refreshed every minute to reflect the number of Gpus available on the node.

It is unlikely that an associated event -- if triggered -- would show up in the device plugin logs. We only register a subset of devices and these are most likely ignored. My question was whether there are other NVML events (possibly visible under nvidia-smi) that we could be watching for too.

With regards to the GFD behaviour, the reason that this works the way it does is that GFD has a loop that iterates over the devices and updates the labels accordingly.

May I ask whether manually disabling GPU can also be put into nvidi-Device-plugin for monitoring as a fault scenario, and the allocatable nvidia.com/gpu field value can be dynamically updated without restarting device-plugin or kubelet.

This does sound like a valid feature request. We will look at what is required for this and provide a more concrete answer.

cc @klueska

yizhouv5 commented 8 months ago

@elezar @klueska Hi, If I unbind the GPU card and the driver, can this be considered a malfunction scenario? In this way, device-plugin can detect it and dynamically update node status Capacity and Allocatable nvidia.com/gpu field values. If you want to use it again, you can bind it back manually. Does the GPU support this feature? For example: echo /sys/bus/pci/drivers/nvidia/0000\:0e\:00.0 > unbind