canonical / prometheus-hardware-exporter

Prometheus Hardware Exporter is an exporter for Hardware Observer
GNU General Public License v3.0
10 stars 9 forks source link

`ipmi_sel_state` metric empty despite SEL entries being present #44

Closed przemeklal closed 1 year ago

przemeklal commented 1 year ago

In Prometheus:

Query:
ipmi_sel_command_success{juju_unit="ceph-osd/0"}

Result:
ipmi_sel_command_success{instance="openstack_c2eab20b-21b7-4f8e-813b-5ae038e9cbb4_ceph-osd_ceph-osd/0", job="hardware-observer-storage_0_default", juju_application="ceph-osd", juju_model="openstack", juju_model_uuid="c2eab20b-21b7-4f8e-813b-5ae038e9cbb4", juju_unit="ceph-osd/0"}

On the actual machine, we see:

$ sudo ipmi-sel  --output-event-state --interpret-oem-data --entity-sensor-names
ID   | Date        | Time     | Name                            | Type              | State    | Event
1    | Sep-25-2023 | 08:16:07 | Battery VBAT                    | Battery           | Critical | battery failed
2    | Sep-25-2023 | 08:17:42 | Battery VBAT                    | Battery           | Critical | battery failed

However, there are no ipmi_sel_state metrics from this (or any other machine).

The server model is Supermicro AS -1124US-TNRP.

przemeklal commented 1 year ago

It might be a permissions problem.

chanchiwai-ray commented 1 year ago

Related to #28

przemeklal commented 1 year ago

Example user scenario affected by the behaviour introduced in #28:

  1. Uncorrectable memory error is reported (ipmi_sel_state) at 00:00 utc.
  2. The alert is generated. 5 minutes pass.
  3. At 00:05 UTC, the metric disappears since the SEL entry is now older than 300s.
  4. The alert gets resolved without any intervention. The hardware issue is still there.