CCI-MOC / esi

Elastic Secure Infrastructure project
6 stars 13 forks source link

Hardware health monitoring (As A Service!) #605

Open larsks opened 3 months ago

larsks commented 3 months ago

Taj and I spent of a chunk of time today trying to figure out why a new OpenShift node was having problems, and eventually tracked it down to hardware problems on the node causing periodic reboots. This, of course, required access to the BMC to identify the problem, which got me thinking that it would be nice to provide a mechanism for people using ESI-managed machines to

This would also be beneficial for administrators:

I would love to see a hardware monitoring service that would monitor BMC event logs -- either through redfish polling, or receiving SNMP traps, or syslog messages, etc -- and would take some sort of action based on this information:

@hakasapl mentioned that this may overlap somewhat with the MOCA Monitoring/Metrics epic, but: