Taj and I spent of a chunk of time today trying to figure out why a new OpenShift node was having problems, and eventually tracked it down to hardware problems on the node causing periodic reboots. This, of course, required access to the BMC to identify the problem, which got me thinking that it would be nice to provide a mechanism for people using ESI-managed machines to
self diagnose this sort of problem
avoid being assigned machines with hardware problems
This would also be beneficial for administrators:
We would know about hardware problems before they become a user-facing problem
I would love to see a hardware monitoring service that would monitor BMC event logs -- either through redfish polling, or receiving SNMP traps, or syslog messages, etc -- and would take some sort of action based on this information:
For any node reporting problems, set a property on the node object indicating that the node is unhealthy. We would use this in the ESI command line or in a web UI to flag unhealthy nodes. This would make it easy to expose health information to the end user; at the moment, it is effectively impossible for a person using an ESI-managed machine to determine if there are hardware faults (particularly if the faults are preventing access to the machine).
For nodes not currently assigned, make them unavailable (assign them to the "hwbroken" project, put them into maintenance mode, whatever). This would avoid assigning to someone a node with hardware problems.
Taj and I spent of a chunk of time today trying to figure out why a new OpenShift node was having problems, and eventually tracked it down to hardware problems on the node causing periodic reboots. This, of course, required access to the BMC to identify the problem, which got me thinking that it would be nice to provide a mechanism for people using ESI-managed machines to
This would also be beneficial for administrators:
I would love to see a hardware monitoring service that would monitor BMC event logs -- either through redfish polling, or receiving SNMP traps, or syslog messages, etc -- and would take some sort of action based on this information:
For any node reporting problems, set a property on the node object indicating that the node is unhealthy. We would use this in the ESI command line or in a web UI to flag unhealthy nodes. This would make it easy to expose health information to the end user; at the moment, it is effectively impossible for a person using an ESI-managed machine to determine if there are hardware faults (particularly if the faults are preventing access to the machine).
For nodes not currently assigned, make them unavailable (assign them to the "hwbroken" project, put them into maintenance mode, whatever). This would avoid assigning to someone a node with hardware problems.
@hakasapl mentioned that this may overlap somewhat with the MOCA Monitoring/Metrics epic, but: