NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.79k stars 620 forks source link

GPU health status exposure and remediation methods #519

Open aidan-canva opened 8 months ago

aidan-canva commented 8 months ago

This is more of a general question of the nvidia/k8s-device-plugin and its current 'responsibilities' and understanding its plans and scope for the future - specifically around GPU health detection and remediation. I've been down a bit of a rabbit hole and figured summarizing my findings and asking here would be the best place to get clarity.

GPUs on nodes can start as unhealthy (upon boot) or become unhealthy at a later stage. Nodes can have more than one GPU. Recent versions of k8s-device-plugin have started to implement some health checking, mostly checking for EventTypeXidCriticalError events and marking the device as unhealthy as part of the k8s device plugin spec.

From what I can find:

A device being marked as unhealthy does not have any impact on a Pod(s?) that has the device attached. It only affects the ability to schedule NEW Pods onto the device. Some external process monitoring the device status (or a Pod health check/general Pod crash) would have to handle this situation. Is that correct?

This device plugin (from what I can see?) doesn't currently expose the device health status externally (ie via a /metrics) endpoint so it is difficult to have an external process monitor a devices health status and action it. From this, it is also difficult to tie a device to Pod(s) that are using it.

dcgm-exporter recently added the exposure of EventTypeXidCriticalError counters. This opens opportunity to create a process to operate on nodes with 'unhealthy' devices (ie cordon and drain) but doesn't address the situation where there is a node with multiple GPU's and you only want to evict the impacted Pod(s).

The other gap (from my understanding) is in doing extended health checking. There are other (perhaps more application specific) circumstances in which a GPU should be marked as unhealthy (ie DCGM_FI_DEV_RETIRED_PENDING > 0, indicating restart required). Is the future scope of k8s-device-plugin to support these kinds of bespoke health checks? Or should the device plugin expose the ability to mark a device as unhealthy from an external process?

Apologies if I have made some incorrect assumptions or if all of this is already explained somewhere (I've trawled through historical issues across various repos) - happy to be pointed at other docs/repos/solutions that address some of these questions.

aidan-canva commented 7 months ago

@elezar @klueska any thoughts?