Closed dashmage closed 4 months ago
I think this issue is due to using ErrorStatus
, which should not be used as it mention in description here. At-least the error showing in logs should not be like this.
From my knowledge, the charm should never use it on it's own.
Good catch, it's setting the ErrorStatus
as part of the restart_exporter
method here. This should probably be replaced with . Raising the exception directly as mentioned in Robert's comment might be a better idea since if the exporter crashed, the user probably can't do too much to get hardware observer functional again.BlockedStatus
Blocked status or raising an exception, which will end up as ErrorState, but not setting it directly.
Did you check why this exporter failed to restart? My guess is that the charm did not install the exporter at all because it did not pass the validation on_install / on_upgrade event. For quick fix, we can consider removing the config validation step on install and upgrade events because config_changed event will be followed by those events, and the config validation is done there as well
You also mentioned another issue: the charm status is overwritten on_update_status. We should consider using on_collect_unit_status, and avoid a callback of self._on_update_status(event)
on each event.
My guess is that the charm did not install the exporter at all because it did not pass the validation on_install / on_upgrade event. For quick fix, we can consider removing the config validation step on install and upgrade events because config_changed event will be followed by those events, and the config validation is done there as well
Yep I traced the logic and that's why. Since it doesn't pass validation on install, all the exporter related functions fail (check_health
, check_active
etc) and this causes the error to be raised. The quick fix idea is good but the logic flow is very convoluted currently. We might need to revisit the lifecycle again later to streamline it.
Edit: Got an even simpler fix which might be better. When running exporter.check_health
we aren't checking whether the exporter is installed or not. If we do that, then we don't end up having this problem.
You also mentioned another issue: the charm status is overwritten on_update_status. We should consider using on_collect_unit_status, and avoid a callback of self._on_update_status(event) on each event.
Coincidentally, I was going through the same idea and was reading up the docs for it so I can raise another PR. That would definitely help.
Thanks @chanchiwai-ray for your inputs :smile:
Deploy ubuntu, hardware-observer and grafana-agent on a machine with redfish.
hardware-observer revision is 48
Here are the juju debug logs
The charm should ideally be in blocked status because of the invalid redfish credentials with the message:
Invalid config: 'redfish-username' or 'redfish-password'
. But the charm actually gets blocked with the messageMissing relation: [cos-agent]
. This is because the blocked status message for invalid redfish creds has been overwritten by the missing relation message while executing the update status hook.Now relate the hardware-observer and grafana-agent charms.
This does not change the existing blocked status message for missing cos-agent relation immediately. Only once the update hook runs again (after 5m by default) does the charm's status change to the invalid redfish credentials one.
Now set the redfish credential config options
The config change hook fails and the charm goes into error state with the following logs,