canonical / hardware-observer-operator

A charm to setup prometheus exporter for IPMI, RedFish and RAID devices from different vendors.
Apache License 2.0
7 stars 15 forks source link

Alert on missing metrics for enabled exporters #85

Open aieri opened 10 months ago

aieri commented 10 months ago

If we autodetect the presence of a specific piece of hardware and the collector is enabled, we should expect the relative metrics to be present. If they are not (due to a bug, or to malfunctioning hardware that is no longer detected by the kernel), we should produce an alert.

aieri commented 10 months ago

we should also confirm we only run the autodetection routine upon installation: we don't want to start disabling exporters if a raid array vanishes because it's faulty

gabrielcocenza commented 2 months ago

Created a PR https://github.com/canonical/grafana-agent-operator/pull/134 on grafana-agent to add rule to check physical disks that are removed.

For NICs we need to discuss a little bit more the strategy because right now there isn't an easy metric to know if the NIC is physical or not

pengwyn commented 2 months ago

+1 for this issue. I've just noticed on one cloud we have completely lost monitoring on redfish due to some problem with the BMCs and we've had no alert that there was a problem.

Although in the case I'm looking at, I wonder if the charm is interpreting an error code from the ipmi checks as "There is no IPMI here", and then reconfiguring itself. Each unit is saying something like:

INFO unit.hardware-observer/0.juju-log IPMI sensors monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-norman1-ceph1.localhost: internal IPMI error
INFO unit.hardware-observer/0.juju-log IPMI SEL monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_cmd_dcmi_get_power_reading: bad completion code
INFO unit.hardware-observer/0.juju-log IPMI DCMI monitoring is not available
WARNING unit.hardware-observer/0.update-status Get Device ID command failed: 0xc0 Node busy
ERROR unit.hardware-observer/0.juju-log unexpected error occurs when connecting to redfish: HTTPSConnectionPool(host='none', port=443): Max retries exceeded with url: /redfish/v1/ (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f6267e84b50>: Failed to resolve 'none' ([Errno -3] Temporary failure in name resolution)"))
INFO unit.hardware-observer/0.juju-log Redfish is not available