querying 8443/status endpoint would report all 8 GPUs are available, but at least one was marked as unhealthy.
Rarely you can recover from this error by bouncing nvdp-nvidia-device-plugin pod on the node where it was marked unhealthy.
But the point is that inventory-operator should ideally detect this as otherwise GPU deployments will be stuck in "Pending" until all 8 GPUs will become available again:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m40s default-scheduler 0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling..
Warning FailedScheduling 2m38s default-scheduler 0/8 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 7 Insufficient nvidia.com/gpu. preemption: 0/8 nodes are available: 1 Preemption is not helpful for scheduling, 7 No preemption victims found for incoming pod..
Logs https://gist.github.com/andy108369/cac9f968f1c6a3eb7c6e92135b8afd42
querying 8443/status endpoint would report all 8 GPUs are available, but at least one was marked as unhealthy.
Rarely you can recover from this error by bouncing
nvdp-nvidia-device-plugin
pod on the node where it was marked unhealthy. But the point is that inventory-operator should ideally detect this as otherwise GPU deployments will be stuck in "Pending" until all 8 GPUs will become available again: