ECC memory correction counts are useful for predicting DIMM failures. Having a dedicated metric would be very useful, as well as a related alert (e.g. rate of correctable errors over $TIME above $THRESHOLD).
It looks like this could be provided in two ways:
export values in the memory controller subdirectory /sys/devices/system/edac/mc/mc0/ directly
I don't know if using rasdaemon would provide benefits over reading the values directly given that in our case the data analysis would happen on the prometheus side.
ECC memory correction counts are useful for predicting DIMM failures. Having a dedicated metric would be very useful, as well as a related alert (e.g. rate of correctable errors over $TIME above $THRESHOLD).
It looks like this could be provided in two ways:
/sys/devices/system/edac/mc/mc0/
directlyrasdaemon
DB. Note:I don't know if using rasdaemon would provide benefits over reading the values directly given that in our case the data analysis would happen on the prometheus side.