OpenCHAMI / roadmap

Public Roadmap Project for Ochami
MIT License
0 stars 0 forks source link

[RFD] Strategy for health checks #38

Open nerscchris opened 3 weeks ago

nerscchris commented 3 weeks ago

Strategy for health checks

What needs to change:

In the OpenCHAMI meeting today there was a fair bit of discussion around health checks and how best to integrate them, the main points brought up were (to my recollection - please add any I missed) were:

What do you propose?

So to summarise this as an ask I would suggest:

OpenCHAMI should provide a framework to allow sites to compose an ordered series of health checks from the wider community and site specific checks. These checks should be tailored at run time for the node type they are running on and should allow depedencies to be specified to include/exclude checks based on previous results.

They should also be callable individually - without dependency checks - or as an orchestrated whole

What alternatives/examples exist?

There are things like:

All of these can inform what is planned here.

Other Considerations?

alexlovelltroy commented 3 weeks ago

I'd like to add several clarifications/distinctions.

Understanding the health of a compute node is valuable to multiple systems. They don't all need the same information or operate at the same frequency. One common example of using health information is to determine if a node is ready/able to start new work. Another is to detect and remediate hardware failures. The scheduler is unlikely to get involved with hardware issues and really only needs to know that a node is unavailable. The system administrators responsible for the reliability of the system will need far more detail. Which piece of hardware has failed? What remediation actions have already been attempted?

Building one system that addresses both of these use cases and many others may not be well optimized for the most common uses. We should consider the scope for health checks as we pursue this discussion.