canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
245 stars 119 forks source link

Requesting clarification of docstring on event.defer() #713

Closed balbirthomas closed 1 year ago

balbirthomas commented 2 years ago

Issue

It will be useful if the docstrings of event.defer() explicitly emphasise

These two points essential imply that a Charm has no means to enforce any ordering on the execution of event handlers beyond what is inherent in the Juju event model. This I think is a key concept and may be worth emphasising even in our sdk docs.

Even though the doc strings may be alluding to these facts, it is easily possible to be misled by a statement such as "any deferred events get processed before the event (or action) that caused the current invocation of the charm."

balbirthomas commented 2 years ago

On second thoughts I would like to make a correction/qualification to my own statements above.

The statement I would like to qualify is "a Charm has no means to enforce any ordering on the execution of event handlers beyond what is inherent in the Juju event model." I think we should emphasise any "inherent" ordering of events by Juju is only valid if the none of the event handlers in a charm invokes a defer().

sed-i commented 2 years ago

There is another subtlety about defer() when used as an overarching pattern, that I think is worth mentioning in the docs:

[If many] startup event hooks use defer if a precondition is not met, [then] the charm may end up in stagnation until the next update-status. If the default update-status hook interval of 5min is kept by the user, then relying on update-status is not too bad, but if the user creates a controller or model with a default of, say, 60 min, then it may be too long. -- cos-config/12

rwcarlsen commented 2 years ago

Yeah - it's definitely worth mentioning that events deferred at certain app lifecycle times may not see any events other than update-status for a long time.

balbirthomas commented 2 years ago

Having a charm's lifecycle progression depending on the periodic occurrence of update status event is a bit worrying though, not in the least because the periodicity can be rather long. However I agree given the current semantics of defer() this could become a necessary evil for the following reasons.

This would be the case for charms managing any application where there is a prescribed order of events that need to happen in deploying or managing the application workloads particularly because application workloads need not respond instantaneously to requested changes in behaviour.

rwcarlsen commented 2 years ago

As a general rule - you shouldn't assume anything has happened previously inside charm code. Always check/confirm your priors. Do you think pebble is running in the workload because pebble_ready ran previously? - probably should check container.can_connect anyway, etc.

pengale commented 2 years ago

Thinking about this ...

In machine charms, I might expect the following code in an install hook:

    def _on_install(self, event):
        try:
            apt.install("some_package")
        except AptError as e:
            self.model.unit.status = BlockedStatus("Failed to install: {}".format(e))

The call to the apt library (in the operator-libs-linux) blocks while the package is being installed. Given that we are already blocking, it's not necessarily horrible practice to do something like:

    while True:
        started = self._is_my_service_started()
        if started: break
        time.sleep(0.1)

(I know that this contradicts some earlier statements on my part.)

The problem is that you have no chance to set your status to inform the operator that they need to wait, because the framework won't commit your changes until it is done executing a hook. This is somewhat okay in install, where the status is set to Maintenance anyway. But it is very much not okay in a config-changed handler, where the status might be Active, but the charm is secretly in Maintenance while it awaits the service being ready.

(It's not a great pattern in install, either. I really would prefer to call .defer, and set MaintenanceStatus("Waiting for service to start")

In sidecar charms, you can probably modify your container entry point so that it blocks pebble until the service is ready, which gives you a nice pebble-ready event you can react to. But that's not a universal solution.

We really need workload events :-/

rwcarlsen commented 2 years ago

Maybe we need a way to "commit" a status change mid-hook... Has that been considered/discussed before?

pengale commented 2 years ago

Maybe we need a way to "commit" a status change mid-hook... Has that been considered/discussed before?

I don't think that this would play nicely with the Juju agent, which loads, runs, and then commits a hook event.

I think that it might be possible to .emit a new event, rather than deferring, which would be interesting. I need to go and check my understanding of the internals, but I think that the following would work:

def _on_config_changed(self, event):

    if not self._liveness_check():
        status = MaintenanceStatus("Awaiting liveness check")
        if self.model.unit.status == status:
            # Sleep on the second and later times through.
            time.sleep(1)
        self.model.unit.status = status
        self.on.config_changed.emit()
        return

This will work if the Juju agent is processing Ops events as standalone hooks, which I think that it does (otherwise, deferring an install hook and then attempting to execute it in a relation-changed hook's context would be awkward).

rwcarlsen commented 2 years ago

Hmm... that doesn't seem like a very clean solution either. Perhaps it's worth making a separate issue to capture this idea of managing status changes generally...

pengale commented 2 years ago

The syntax of the above routine doesn't have to be awful for charm authors. I included everything inline for clarity. The final syntax might be:

if not self._liveness_check():
    event.defer_with_retry(msg="Awaiting liveness check", retry_after=1)
    return

The event object has access to framework.model, as well as its own name, so setting the status and emitting the event can all happen inside .defer_with_retry.

We could even wrap everything in the existing defer. The new function signature might look like:

def defer(
    self, 
    message: Str = "Event {event_name} deferred", 
    status: StatusBase = MaintenanceStatus, 
    retry_after: Int = 0
):

Queuing the sleep off of the status message is a little bit delicate, but in most cases, the failure mode is just a sleep gets skipped. If we want to guard against the case where two deferred events clobber each others' status while pegging the CPU, we could hijack the snapshot and restore methods to track how many times the event has fired. (That would allow us to implement exponential backoff, too, which would be nice.)

Ultimately, the "correct" solution is to have workload events. If we can do something relatively lightweight that tides folks over until that feature is implemented, it might make sense to do so, however.

rwcarlsen commented 2 years ago

What do you mean by "workload events" - is this something that has been discussed before/elsewhere?

PietroPasotti commented 1 year ago

@balbirthomas Trying to close this issue... Let's go back to the event.defer() docstring. It currently is:

       """Defer the event to the future.

        Deferring an event from a handler puts that handler into a queue, to be
        called again the next time the charm is invoked. This invocation may be
        the result of an action, or any event other than metric events. The
        queue of events will be dispatched before the new event is processed.

        From the above you may deduce, but it's important to point out:

        * ``defer()`` does not interrupt the execution of the current event
          handler. In almost all cases, a call to ``defer()`` should be followed
          by an explicit ``return`` from the handler;

        * the re-execution of the deferred event handler starts from the top of
          the handler method (not where defer was called);

        * only the handlers that actually called ``defer()`` are called again
          (that is: despite talking about “deferring an event” it is actually
          the handler/event combination that is deferred); and

        * any deferred events get processed before the event (or action) that
          caused the current invocation of the charm.

        The general desire to call ``defer()`` happens when some precondition
        isn't yet met. However, care should be exercised as to whether it is
        better to defer this event so that you see it again, or whether it is
        better to just wait for the event that indicates the precondition has
        been met.

        For example, if ``config-changed`` is fired, and you are waiting for
        different config, there is no reason to defer the event because there
        will be a *different* ``config-changed`` event when the config actually
        changes, rather than checking to see if maybe config has changed prior
        to every other event that occurs.

        Similarly, if you need 2 events to occur before you are ready to
        proceed (say event A and B). When you see event A, you could chose to
        ``defer()`` it because you haven't seen B yet. However, that leads to:

        1. event A fires, calls defer()

        2. event B fires, event A handler is called first, still hasn't seen B
           happen, so is deferred again. Then B happens, which progresses since
           it has seen A.

        3. At some future time, event C happens, which also checks if A can
           proceed.

        """

It has been rewritten a couple of times since this issue was opened. Do you think it's clear enough as it stands?

benhoyt commented 1 year ago

I think the updated docstring that Pietro copied above is thorough enough that we can mark this fixed.