Open futurwasfree opened 3 months ago
GPU pods get evicted and re-scheduled again. After re-scheduling they end up in CrashLoopBackOff state.
Which pods end up in CrashLoopBackOff
? Application pods or pods managed by the GPU Operator, like device-plugin
, gpu-feature-discovery
, etc.?
@cdesiniotis It's application pods that exit with > 0 code when it fails to communicate to cuda (no driver mounted to container).
it sounds a bit like race condition in a sense might or might not finish everything related to upgrade and nvidia-driver will be available. but most of the time it's an application pod without any driver mounted (after eviction and upgrade procedure).
manual kill signal sent to pod fixes it (every time).
1. Quick Debug Information
2. Issue or feature description
GPU pods end up in endless
CrashLoopBackoff
state due to missing driver (andnvidia-smi
) and manual pod termination required (kill). After pod termination everything runs just fine.3. Steps to reproduce the issue
Configuration In this setup we pass
--set driver.upgradePolicy.autoUpgrade=false
and lettingk8s-driver-manager
handle the update.NVIDIADriver
has the following configuration associated with it:Repro Kill the
nvidia-driver-daemonset
pod and trigger the driver reinstall on a node with GPU enabled pods already running. GPU pods get evicted and re-scheduled again. After re-scheduling they end up inCrashLoopBackOff
state.4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com