[Feature] GPU Node Health and Remediation

We currently have a problem with the Kubelet crashing on GPU nodes at startup (TrackingID#2402010050000797). If it is starting because of unschedulable pods, another GPU node will start up one the first Kubelet crashes. Very annoying and cost driving. To get around that we use labels to force pods to be placed on the GPU nodes without using the GPU resource block. Then another node isn't started and the start of the pod is "just" delayed for half a minute.

Then in our software we wait up to 10 minutes for the GPU driver to come up and sometimes it doesn't.

Our application is a proprietary data pipeline built around KEDA ScaledJobs and when the GPU driver does not come up we have no option but to fail the job, wait for the GPU node to scale down and then reschedule the job. A real hassle as we haven't automated that bit yet. We haven't filed an issue regarding this yet as we are hoping this is a secondary issue to the first.

But what I would like is for the node to be marked as unhealthy in some way when the driver does not come up so another node can be scaled up and new pods for my jobs scheduled on them.

Azure / AKS

[Feature] GPU Node Health and Remediation #4256