Open aieri opened 10 months ago
we should also confirm we only run the autodetection routine upon installation: we don't want to start disabling exporters if a raid array vanishes because it's faulty
Created a PR https://github.com/canonical/grafana-agent-operator/pull/134 on grafana-agent to add rule to check physical disks that are removed.
For NICs we need to discuss a little bit more the strategy because right now there isn't an easy metric to know if the NIC is physical or not
+1 for this issue. I've just noticed on one cloud we have completely lost monitoring on redfish due to some problem with the BMCs and we've had no alert that there was a problem.
Although in the case I'm looking at, I wonder if the charm is interpreting an error code from the ipmi checks as "There is no IPMI here", and then reconfiguring itself. Each unit is saying something like:
INFO unit.hardware-observer/0.juju-log IPMI sensors monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-norman1-ceph1.localhost: internal IPMI error
INFO unit.hardware-observer/0.juju-log IPMI SEL monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_cmd_dcmi_get_power_reading: bad completion code
INFO unit.hardware-observer/0.juju-log IPMI DCMI monitoring is not available
WARNING unit.hardware-observer/0.update-status Get Device ID command failed: 0xc0 Node busy
ERROR unit.hardware-observer/0.juju-log unexpected error occurs when connecting to redfish: HTTPSConnectionPool(host='none', port=443): Max retries exceeded with url: /redfish/v1/ (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f6267e84b50>: Failed to resolve 'none' ([Errno -3] Temporary failure in name resolution)"))
INFO unit.hardware-observer/0.juju-log Redfish is not available
If we autodetect the presence of a specific piece of hardware and the collector is enabled, we should expect the relative metrics to be present. If they are not (due to a bug, or to malfunctioning hardware that is no longer detected by the kernel), we should produce an alert.