Open nerscchris opened 3 weeks ago
I'd like to add several clarifications/distinctions.
Understanding the health of a compute node is valuable to multiple systems. They don't all need the same information or operate at the same frequency. One common example of using health information is to determine if a node is ready/able to start new work. Another is to detect and remediate hardware failures. The scheduler is unlikely to get involved with hardware issues and really only needs to know that a node is unavailable. The system administrators responsible for the reliability of the system will need far more detail. Which piece of hardware has failed? What remediation actions have already been attempted?
Building one system that addresses both of these use cases and many others may not be well optimized for the most common uses. We should consider the scope for health checks as we pursue this discussion.
Strategy for health checks
What needs to change:
In the OpenCHAMI meeting today there was a fair bit of discussion around health checks and how best to integrate them, the main points brought up were (to my recollection - please add any I missed) were:
What do you propose?
So to summarise this as an ask I would suggest:
What alternatives/examples exist?
There are things like:
fmn-check-fabric
script - will run a set of tests across Slingshot switches, can also run individual testsAll of these can inform what is planned here.
Other Considerations?