canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
247 stars 119 forks source link

Consider exposing `_collected_statuses` somehow #1371

Closed dimaqq closed 2 months ago

dimaqq commented 2 months ago

What do folks think about adding:

https://matrix.to/#/!eNCNzcteYUDDYpStsu:ubuntu.com/$KN0-vPSBxg9jqzoLkLtg_k9P2QGlxEMoDTR8tQUg_go?via=ubuntu.com&via=matrix.org&via=rory.gay

@james-garner-canonical said

A pattern for handling setting your charm's status that's mentioned in the juju sdk k8s tutorial is to do something like

class MyCharm(ops.CharmBase):
    def __init__(self, framework: ops.Framework) -> None:
        ...
        framework.observe(self.on.collect_unit_status, self._on_collect_status)

    def _on_collect_status(self, event: ops.CollectStatusEvent) -> None:
        if "some_condition":
            event.add_status(ops.BlockedStatus("msg"))
        # if ... etc
        # unconditionally:
        event.add_status(ops.ActiveStatus())

and let this take care of setting the status to the most significant one. I think this is quite nice, but can you only call add_status when handling this event? Seems to be the intention, as it's a method of ops.CollectStatusEvent specifically. But setting up a dummy ops.CollectStatusEvent and monkey patching in your charm's framework elsewhere does seem to work fine since collected statuses are ultimately stored on the unit or app. So this seems to work fine:

class MyCharm(ops.CharmBase):
    ...

    def _add_status(self, status: ops.StatusBase) -> None:
        """Dirty hack to add status outside _on_collect_status for auto prioritisation."""
        event = ops.CollectStatusEvent(ops.Handle(parent=None, key=None, kind="whatever"))
        event.framework = self.framework
        event.add_status(status)

    def _my_observer(self, event: ops.BaseEvent) -> None:
        if "something_bad":
            self._add_status(ops.BlockedStatus("explanation"))

I'm not sure if this is more of an Ops Library Development question or a Charm Development question, but what's the recommended approach here, and/or what's the reason for implicitly limiting add_event to methods observing collect_unit_status or collect_app_status (where you'll have an ops.CollectStatusEvent)?

@dimaqq replied

I'm not sure about the best API off the top of my head, but looking at the event code, you should be able to do smth like self.app._collected_statuses.append(...)

A good API to do this, for both app and units, may very well be a nice feature request / feature that we may implement. (after a discussion ofc)

benhoyt commented 2 months ago

This might appear to work, but because it's not evaluating the status of all the pieces/components of the charm, it has the same issue that people had before collect-status was introduced, where the charm would set the status to active even though one of the components was still blocked/waiting. Copying my comment from the chat:

You could do this, but I wouldn't recommend it. add_status is only supposed to be called inside the collect-status event, because that's intended to gather the current "state of the world": in other words, everything that could affect the status of the app or unit should be evaluated at that point. This is evaluated after every Juju event, to ensure the status is complete and up to date for all the "components" that the charm is managing.

What we found before the collect-status API was introduced was that people were setting status on a charm with (say) two components, and then one component would get un-blocked and set the status to active ... but the other component of the charm was still actually blocked! And I think that sort of thing would happen with the approach you've suggested too.

You have to evaluate the status of all pieces/components of the charm, either by evaluating them at the time (the collect-status approach) or by keeping state and remembering what they were previously (an approach I believe the OpenStack charms take).

tonyandrewmeyer commented 2 months ago

It's a vastly larger project, but I think the "per relation status" (or some other form of multiple statuses per unit) idea that I've heard talked about would solve this in a better way.

james-garner-canonical commented 2 months ago

After conversation with Ben and Tony, I now see the issue with the pattern I proposed. For me this is most clearly illustrated with an example:

class MyCharm(ops.CharmBase):
    def __init__(self, framework: ops.Framework):
        ...
        framework.observe(self.on.event_1, self._observer_1)
        framework.observe(self.on.event_2, self._observer_2)
        framework.observe(self.on.collect_unit_status, self._on_collect_status)

    def _observer_1(self, event):
        # something is wrong!
        # a dirty hack to call add_status or unit._collected_statuses.append etc
        # with ops.BlockedStatus(), for example

    def _observer_2(self, event):
        # everything is ok, don't call anything related to status

    def _on_collect_status(self, event):
        event.add_status(ops.ActiveStatus())

So if we have an event_1 event, then our charm runs __init__, _observer_1 (adding ops.BlockedStatus() to the list), and then finally _on_collect_status, which adds the default ops.ActiveStatus() and then resolves to setting the status as blocked since that's higher priority.

Next, we have an event_2. The issue with event_1 hasn't been resolved, but now we have __init__, _observer_2 (not adding any statuses), and finally _on_collect_status, which adds the default ops.ActiveStatus() and then resolves to setting the status as active since that's the only status it knows about. This is wrong, as the status should remain blocked, since the issue with event_1 is unresolved.

If we instead used the recommended pattern of checking all the things that could be wrong in _on_collect_status, then the charm would know the issue was unresolved, and end up keeping the status as blocked after handling event_2.

benhoyt commented 2 months ago

Yep, that's exactly right, @james-garner-canonical -- thanks for the example code. Given that, I'm going to close this as not planned.