ECC metrics - Githubissues

ECC memory correction counts are useful for predicting DIMM failures. Having a dedicated metric would be very useful, as well as a related alert (e.g. rate of correctable errors over $TIME above $THRESHOLD).

It looks like this could be provided in two ways:

export values in the memory controller subdirectory /sys/devices/system/edac/mc/mc0/ directly
export values from the rasdaemon DB. Note:
- rasdaemon is available in universe
- https://github.com/sanecz/prometheus-rasdaemon-exporter is an example of a rasdaemon-specific exporter

I don't know if using rasdaemon would provide benefits over reading the values directly given that in our case the data analysis would happen on the prometheus side.

canonical / hardware-observer-operator

ECC metrics #274