canonical / hardware-observer-operator

A charm to setup prometheus exporter for IPMI, RedFish and RAID devices from different vendors.
Apache License 2.0
7 stars 14 forks source link

fix: Remove redfish credential validation in install event handler #192

Closed dashmage closed 4 months ago

dashmage commented 4 months ago

Fixes #190

The exporter.check_active method being called in the restart_exporter method (from the update status event handler) will fail in case the exporter installation has not been done. This installation process takes place as part of the install event handler.

The error happens because the check_installed function in service.py (called as a decorator for the various exporter service helper methods) would return False when the exporter isn't installed. This would cause all the Exporter class' methods in service.py (restart, check_active, check_health etc.) to return False. The issue occurs when the on_update_status() --> restart_exporter() raises an exception when exporter.check_active consistently returns False.

This PR removes the redfish credential validation step during the install event since it's likely to not have the right configuration during the installation. Moreover, this can leave the charm in an inconsistent state where the exporter is not properly installed, leading to an error during the execution of other lifecycle events.

dashmage commented 4 months ago

Will fix unit tests after rebasing changes from #191 once it's merged.

dashmage commented 4 months ago

It doesn't seem to me that I will fix the issue.. Have you test if it can fix the issue ? Since the validation of redfish config did not remove from install or upgrade event, there's still possibility that operator deploys the charm with incorrect credential and causing the exporter not being installed.

Good thing you brought this up. So I did test this initially and the hook didn't fail for the charm and no exceptions were raised. So I thought the code was good. But upon checking once more carefully, another issue comes up. With the latest change, even after I configure the redfish creds, the status doesn't change from BlockedStatus with invalid redfish creds message. This might be because the ActiveStatus is not being reached at the end of the update status hook and it keeps looping when exporter_installed is not set. So I might have to revert that change.

^ @Pjack

dashmage commented 4 months ago

This commit just cleans up the unit test names and docstrings since many of them were incorrect.

dashmage commented 4 months ago

I've also added an extra check in test_install_redfish_enabled_with_incorrect_credential to make sure that the exporter is installed when when there's wrong redfish credentials. This ensures that the issue being fixed will be caught by the modified unit test.

dashmage commented 4 months ago

The code and logic looks good to me .

I would like to confirm where should we validate this behavior?

1. config or credential is invalid

2. User change the configuration or provide correct credential.

3. Verify if the charm will become active state again.

Maybe in hardware-indendent functional test?

It would probably be in the hardware dependent functional tests if the redfish creds are invalid and we want to test that after the correct config is provided, the charm is in active status. Currently we only have the test to check if we're able to query the redfish metrics.

If you're talking about any hardware independent config (exporter log level, port etc.), we already have functional tests for the behavior you've mentioned. But we do need to make some changes to verify that the charm is active afterwards. Currently it just verifies if the config is correctly changed in the underlying file.

It would be good to have another refactoring PR for the functional tests as well so we can follow through on these requirements.

dashmage commented 4 months ago

I'm a little bit lost with the PR title, description and the bug that points the fix.

The bug description says:

This is because the blocked status message for invalid redfish creds has been overwritten by the missing relation message while executing the update status hook.

I'm not seeing any changes in the update status hook.

What are we fixing by removing the validation of the exporter configs on the install hook?

Sorry about the confusion. We made some findings over the course the PR due to which we had to change the originally intended fix. I've made some changes to the PR description that hopefully clarifies your doubts. If not, feel free to respond with any questions and I'll try my best to address them.