canonical / hardware-observer-operator

A charm to setup prometheus exporter for IPMI, RedFish and RAID devices from different vendors.
Apache License 2.0
7 stars 14 forks source link

ECC metrics #274

Open aieri opened 6 days ago

aieri commented 6 days ago

ECC memory correction counts are useful for predicting DIMM failures. Having a dedicated metric would be very useful, as well as a related alert (e.g. rate of correctable errors over $TIME above $THRESHOLD).

It looks like this could be provided in two ways:

I don't know if using rasdaemon would provide benefits over reading the values directly given that in our case the data analysis would happen on the prometheus side.