akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

inventory-operator: doesn't detect when `nvdp-nvidia-device-plugin` marks GPU as unhealthy #249

Open andy108369 opened 2 months ago

andy108369 commented 2 months ago

Logs https://gist.github.com/andy108369/cac9f968f1c6a3eb7c6e92135b8afd42

querying 8443/status endpoint would report all 8 GPUs are available, but at least one was marked as unhealthy.

Rarely you can recover from this error by bouncing nvdp-nvidia-device-plugin pod on the node where it was marked unhealthy. But the point is that inventory-operator should ideally detect this as otherwise GPU deployments will be stuck in "Pending" until all 8 GPUs will become available again:

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m40s  default-scheduler  0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  2m38s  default-scheduler  0/8 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 7 Insufficient nvidia.com/gpu. preemption: 0/8 nodes are available: 1 Preemption is not helpful for scheduling, 7 No preemption victims found for incoming pod..
andy108369 commented 2 months ago

related: https://github.com/akash-network/support/issues/244 https://github.com/akash-network/support/issues/240 https://github.com/akash-network/support/issues/207