canonical / hardware-observer-operator

A charm to setup prometheus exporter for IPMI, RedFish and RAID devices from different vendors.
Apache License 2.0
7 stars 15 forks source link

Exporter should be in error state instead of blocked if installation fails #203

Open dashmage opened 6 months ago

dashmage commented 6 months ago

The exporter is installed as part of the charm's install event handler. Currently, if the installation of the exporter fails, we put the charm into blocked status. But ideally, since the user cannot do anything to recover from the failure, we should let the error bubble up and have the charm be in error status.

Pjack commented 6 months ago

However, another thing showed up to me while reading the _on_install_or_upgrade. If it fails to install the charm is setting to blocked state, and this does not look right to me. Blocked means that the user should manually do something. What the user should do if the installation fails? IMO, we should raise an exception and let to be on error state.

If installation fails due to network connectivity issues, the operator is responsible for addressing the network problem, as it's not a bug within the charm itself. Therefore, setting the status to "blocked" is reasonable in this scenario. In summary, any error state indicates a potential bug.

Similar discussion here https://discourse.charmhub.io/t/its-probably-ok-for-a-unit-to-go-into-error-state/13022/24 image

My understanding come from these two documents.

error : The unit is in error, likely from a hook failure. The operator will need to file a bug blocked : The charm is stuck. Human attention/intervention is required.

Once the charm enter error state, it's hard to recover from it. I tend to avoid it if possible, especially when I don't think it's a bug.

chanchiwai-ray commented 6 months ago

However, another thing showed up to me while reading the _on_install_or_upgrade. If it fails to install the charm is setting to blocked state, and this does not look right to me. Blocked means that the user should manually do something. What the user should do if the installation fails? IMO, we should raise an exception and let to be on error state.

If installation fails due to network connectivity issues, the operator is responsible for addressing the network problem, as it's not a bug within the charm itself. Therefore, setting the status to "blocked" is reasonable in this scenario. In summary, any error state indicates a potential bug.

Yes, that's not a bug within the charm itself, so if the user file a bug, we should explain that it's not a bug and potentially offer a way to avoid it. A blocked status in this case is a dead end, install hook will not run again even when operator fixed the issue; on the other hand, an error state can be recovered: juju resolved --all will retry the failed hook

gabrielcocenza commented 6 months ago

I agree with @chanchiwai-ray . The install hook is one shot and if we block it and return, there is no way to run it again. If a similar issue was on config-change or update-status, then blocking the unit is totally fine

aieri commented 2 months ago

is it possible to defer an install hook? I agree that if the installation fails for external reasons it's not a charm bug, hence we should not produce a stack trace. On the other hand, we should retry automatically and notify the user of why installation did not complete.

Pjack commented 6 days ago

After the discussion in the OpenStack Exporter Operator project, I believe we now have clear guidelines and know how to proceed. https://github.com/canonical/openstack-exporter-operator/pull/110/files https://github.com/canonical/openstack-exporter-operator/pull/100#discussion_r1758175152

Please modify the behavior: Raise the error and let operator framework handle it. Here is the reference code https://github.com/canonical/openstack-exporter-operator/blob/main/src/charm.py#L148