canonical / hardware-observer-operator

A charm to setup prometheus exporter for IPMI, RedFish and RAID devices from different vendors.
Apache License 2.0
7 stars 14 forks source link

Recreate sdr cache if it's out of date or invalid #213

Closed sudeephb closed 3 months ago

sudeephb commented 3 months ago

Recreate the SDR cache if it's out of date, and hence the ipmi-sel and ipmi-sensor collectors won't be disabled, just because SDR cache was out of date.

Pjack commented 3 months ago

After installing hardware-observer, the whitelist will be generated once during installation. Subsequently, the list of collectors should not be changed dynamically.

I think the code in get_redfish_conn_params is incorrect. We should not use bmc_hw_verifier to get the list again. The list may be different because of some temporary network connectivity issue.

sudeephb commented 3 months ago

After installing hardware-observer, the whitelist will be generated once during installation. Subsequently, the list of collectors should not be changed dynamically.

I think the code in get_redfish_conn_params is incorrect. We should not use bmc_hw_verifier to get the list again. The list may be different because of some temporary network connectivity issue.

If we use the whitelist generated during installation, anything that was missed during installation(because of temporary network issues, for example), will never be added, even though they exist.

Pjack commented 3 months ago

After installing hardware-observer, the whitelist will be generated once during installation. Subsequently, the list of collectors should not be changed dynamically. I think the code in get_redfish_conn_params is incorrect. We should not use bmc_hw_verifier to get the list again. The list may be different because of some temporary network connectivity issue.

If we use the whitelist generated during installation, anything that was missed during installation(because of temporary network issues, for example), will never be added, even though they exist.

That's true and it's the design in hardware-observer. We have another feature to re-generate the whiltelist as a juju-action. #96

If the whitelist is changed dynamically, the corresponding metrics/alert rules will be affected, potentially resulting in the loss of monitoring for broken hardware and failure to trigger alerts.

Therefore, the only source of truth: the whitelist generated during installation.

jneo8 commented 3 months ago

It's a trade-off in charm design.

We can choose either

One question for #202 is how often this happen and user want auto-recover or manually refresh.

Pjack commented 3 months ago

Conclusion: