IBM / autopilot

A tool to detect infrastructure issues on cloud native AI systems
Apache License 2.0
16 stars 13 forks source link

New node label for reserving nodes #41

Closed cmisale closed 1 month ago

cmisale commented 2 months ago

To ease the integration with queuing systems like Kueue, we want to let autopilot add a temporary label to nodes when trying to run an invasive health check.

The suggested label is autopilot.ibm.com/gpuhealth=TESTING

This way, any workload managed by a queue that doesn't have a toleration on that label, can not occupy the node.

jimcadden commented 2 months ago

Should be a quick