Strategy for health checks

What needs to change:

In the OpenCHAMI meeting today there was a fair bit of discussion around health checks and how best to integrate them, the main points brought up were (to my recollection - please add any I missed) were:

Some sites rely entirely on Slurm for health checks for compute nodes
Some sites reported that they need broader coverage than that and HPE noted that some sites may not use Slurm or PBS at all
NERSC reported they have thought about (but had no time to implement) moving health checks to GOSS and then checking their status from the Slurm node health check. This would allow health checks to be used more widely on say DVS servers which do not run Slurm.
NERSCs health check system has internal dependencies, so for instance we run checks for nvidia issues being reported in dmesg and if we find anything there we drain the node and avoid all future GPU health checks to prevent possible hangs.

What do you propose?

So to summarise this as an ask I would suggest:

OpenCHAMI should provide a framework to allow sites to compose an ordered series of health checks from the wider community and site specific checks. These checks should be tailored at run time for the node type they are running on and should allow depedencies to be specified to include/exclude checks based on previous results.

They should also be callable individually - without dependency checks - or as an orchestrated whole

What alternatives/examples exist?

There are things like:

NHC (Michael Jennings) and the (unpublished) NERSC health check system - both are targeting nodes in a batch system
GOSS - used by HPE's CSM for some testing
The CSM health checks - used in CSM for NCN and k8s checks and also for PostgreSQL cluster checks
The Slingshot fmn-check-fabric script - will run a set of tests across Slingshot switches, can also run individual tests

All of these can inform what is planned here.

Other Considerations?

Should not discriminate on whether or not a batch system is used
Should allow a site to pick and chose based on their needs, as well as easily add/modify checks to suit their requirements
Should be in a form which allows sites to run individual tests if needed, or as an orchestrated whole
Output should be in a form that will allow sites to easily view and consume it and should be in a user-requestable format (defaulting to human readable - tooling should be able to request YAML or JSON if they need that)

I'd like to add several clarifications/distinctions.

Understanding the health of a compute node is valuable to multiple systems. They don't all need the same information or operate at the same frequency. One common example of using health information is to determine if a node is ready/able to start new work. Another is to detect and remediate hardware failures. The scheduler is unlikely to get involved with hardware issues and really only needs to know that a node is unavailable. The system administrators responsible for the reliability of the system will need far more detail. Which piece of hardware has failed? What remediation actions have already been attempted?

Building one system that addresses both of these use cases and many others may not be well optimized for the most common uses. We should consider the scope for health checks as we pursue this discussion.

OpenCHAMI / roadmap

[RFD] Strategy for health checks #38

Strategy for health checks

What needs to change:

What do you propose?

What alternatives/examples exist?

Other Considerations?