canonical / hardware-observer-operator

A charm to setup prometheus exporter for IPMI, RedFish and RAID devices from different vendors.
Apache License 2.0
7 stars 14 forks source link

hardware-exporter service not present on 2 machines out of 13 #219

Closed przemeklal closed 2 months ago

przemeklal commented 3 months ago

Channel latest/stable, revision 59.

Steps to reproduce:

  1. juju deploy ch:hardware-observer --channel latest/stable
  2. juju add-relation hardware-observer infra-node
  3. juju add-relation hardware-observer nova-compute-kvm
  4. juju add-relation hardware-observer grafana-agent-host:cos-agent
  5. juju attach-resource hardware-observer perccli-deb=PERCCLI_7.2313.0_A14_Linux/perccli_007.2313.0000.0000_all.deb
  6. juju config redfish-username=redacted redfish-password=redacted

On two random nova-compute the hardawre-exporter service is not running, its systemd unit is not even created:

ubuntu@redacted-15:~$ systemctl status hardware-exporter
Unit hardware-exporter.service could not be found.

versus

ubuntu@redacted-7:~$ systemctl status hardware-exporter
● hardware-exporter.service - HTTP service for prometheus hardware exporter.
     Loaded: loaded (/etc/systemd/system/hardware-exporter.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-04-16 10:31:47 UTC; 5min ago

Juju status looks like this:

hardware-observer/2   active    idle   Unit is ready
hardware-observer/1*  active    idle   Unit is ready
hardware-observer/0   active    idle   Unit is ready
hardware-observer/5   active    idle   Unit is ready
hardware-observer/3   blocked   idle   Invalid config: 'redfish-username' or 'redfish-password'
hardware-observer/11  active    idle   Unit is ready
hardware-observer/4   active    idle   Unit is ready
hardware-observer/9   active    idle   Unit is ready
hardware-observer/8   blocked   idle   Invalid config: 'redfish-username' or 'redfish-password'
hardware-observer/10  active    idle   Unit is ready
hardware-observer/6   active    idle   Unit is ready
hardware-observer/7   active    idle   Unit is ready

Credentials are correct on all nodes and Redfish is enabled everywhere in iDRAC.

Logs attached. unit-hw-obs.log

przemeklal commented 3 months ago

Workaround: Copy /etc/systemd/system/hardware-exporter.service from another, working unit and edit to match the current unit name. sudo systemctl daemon-reload and sudo systemctl start hardware-exporter.service.

dashmage commented 2 months ago

On first glance, this looks to be caused be due to early redfish credential validation in the install event handler which was removed with rev54. So with later releases, ideally this shouldn't be causing an issue and the exporter should be installed (ie, systemd service file placed) correctly.

But if perccli were not installed correctly, this would cause the config changed event handler to keep deferring and the blocked status is wiped too. So I'm wondering whether that could be the case?

Edit: could be similar to #205 where the resource fails to install correctly causing the exporter service to also not be installed.

dashmage commented 2 months ago

Closing for similar reasons to this PR. In case we're able to obtain any new data to be able to reproduce this issue again, please go ahead and reopen it.