canonical / observability-libs

A collection of charm libraries curated by the Observability team.
https://charmhub.io/observability-libs
Apache License 2.0
3 stars 8 forks source link

[resource-limits] Add `last_error` public attribute, to enable charm to use collect-status #63

Closed sed-i closed 10 months ago

sed-i commented 11 months ago

Issue

Status updates are communicated via custom events.

Solution

Add last_error public attribute, to enable charm to collect-status.

In tandem with:

sed-i commented 11 months ago

Is this how you'd expect collect-status to be used in components such as libs, @benhoyt @tonyandrewmeyer?

benhoyt commented 11 months ago

@sed-i Sort of ... though it looks like in this case the charm lib is firing a custom patch_failed event and has is_ready() and last_error methods/attributes. So you probably wouldn't actually do this in the charm lib at all. Keep that nice and stand-alone, and leave the status handling to the charm itself. So the charm would do something like this:

class SomeCharm(ops.CharmBase):
    def __init__(self, *args):
        # ...
        self.resources_patch = KubernetesComputeResourcesPatch(...)
        self.framework.observe(charm.on.collect_unit_status, self._on_collect_unit_status)

    def _on_collect_unit_status(self, event):
        if self.resources_patch.is_ready():
            event.add_status(ops.ActiveStatus())
        elif self.resources_patch.last_error:
            event.add_status(ops.BlockedStatus(self.resources_patch.last_error))
        else:
            event.add_status(ops.MaintenanceStatus('waiting for patch'))

Note that because KubernetesComputeResourcesPatch has a last_error, you could just use this directly, instead of listening to patch_failed.

That would be my suggestion, at any rate.

sed-i commented 11 months ago

Thanks @benhoyt! The custom event is how we currently propagate status updates. Certainly that part won't be needed with collect-status. I tend to agree that it's probably best to have only the charm use add_status.

benhoyt commented 11 months ago

One slightly unfortunate thing to note about collect-status is that your charm has to be "all in" with it. You can't really switch half of your charm to use collect-status / add_status and leave the rest using unit.status = x, because the unit.status assignment will be overwritten with the (partially-implemented) collect-status status.

sed-i commented 11 months ago

I don't mind this approach, but it feels a bit awkward in that is_ready would return False iff a last error has been set, which means that if you get a False, then you have to go search for a last error.

Not sure what you mean. Could you point to a line in code?

PietroPasotti commented 10 months ago

I don't mind this approach, but it feels a bit awkward in that is_ready would return False iff a last error has been set, which means that if you get a False, then you have to go search for a last error.

Not sure what you mean. Could you point to a line in code?

Can't point to a line in the code because it's about the API it exposes. IIUC the current API implies that, as a user of this lib, the charm would:

lib = Lib()

if lib.is_ready():
    # go about your happy path

else:
    # retrieve the error msg
    error = lib.last_error

Which I think is a bit awkward. An alternative I was proposing is:

lib = Lib()

status: Success | Failure = lib.get_status()
if isinstance(status, Success):
    # go about your happy path

else:  # Failure
    # retrieve the error msg
    error = status.message

I realize this deviates from the nice is_ready pattern we've been gliding so far, but fact is now we're introducing a new piece of data: it's not just 'are you ready? --> YES | NO', now it's: 'are you ready, and if not, why? --> YES | NO (because)'

And if we plan to build this out to all other is_ready methods we have in our charm libs, it might be worth it to figure out a generalization-worthy pattern.

sed-i commented 10 months ago

That is a good point. What immediately comes to mind is the result package.

But I am not sure we want to couple readiness with errors:

PietroPasotti commented 10 months ago

That is a good point. What immediately comes to mind is the result package.

Ha, never heard of that. Interesting!

But I am not sure we want to couple readiness with errors:

  • Could there be a circumstance where we want is_ready to return True, while at the same time having an error (Blocked)?

At charm level, sure, something like the 'degraded' status @jnsgruk was brainstorming about a few Pragues ago, but in this case we're talking about a pattern charm libraries, which are (for the most part, at least our libs) simple enough to either be 'ready' or 'not ready', without intermediate 'ready-but-kind-of-borky' states.

  • last_error is set during operations, not necessarily during "get status", so we would need that variable anyway (perhaps prepend with an underscore, _last_error). Which means later on we could hide this behind a "get status" if we choose to.

Fair enough, I'm game.

  • Could "last error" and "is ready" go out of sync? last_error gets reset per charm reinit so if an issue is resolved during custom hooks, is_ready could be True, but last_error still holds the past.

Which makes me think, last_error may also quickly become stale as it currently stands (independently of is_ready).

if not lib.is_ready():
    err = lib.last_error  # set because not ready
    lib.do_something_to_fix_error()
    err2 = err.lib.last_error  # still set because it's only cleared on __init__
sed-i commented 10 months ago

Which makes me think, last_error may also quickly become stale as it currently stands (independently of is_ready).

I'm not sure we can do anything about it when the lib has multiple entry points (is_ready, on_config_changed -> patch).

Even if I give each error individual attention,

@dataclass
class LibErrors:
    obtain_limit: Optional[str]  # Error obtaining resource limit from user function
    validity: Optional[str]  # Resource spec is invalid
    config: Optional[str]  # ConfigError
    api: Optional[str]  # ApiError
    patch: Optional[str]  # Failed applying patch
        try:
            do_something()
            self.errors.obtain_limit = None
        except ValueError as e:
            msg = f"Failed obtaining resource limit spec: {e}"
            self.errors.obtain_limit = msg

They could still be stale for the same reason.

If we stop using custom events, then...

sed-i commented 10 months ago

I closed the tandem alertmanager PR for now. Let's try to come up with a solid collect status pattern first. Closing this too, for now.