Deprecate event deferral

sed-i commented 1 year ago

event.defer() sends charm authors on the wrong path (see enumeration below) and stands in the way of new necessary features. Can we deprecate it?

Why you shouldn't use `defer`

(Adapted from https://github.com/canonical/operator/issues/935#issuecomment-1565772563)

Deferral isn't a juju concept so on upgrade all deferred events are lost (all your pending logic is lost). For example, the upgrade sequence does not emit any relation events, so charm authors would need to remember to redo all the logic that is exercised by deferrable events, every upgrade.
Events are, by definition, most relevant when they are emitted. Deferring an event introduces a somewhat non-deterministic delay in processing: in the worst case it will be re-emitted on the next update status, which can be anywhere between 10sec and 1hr (I think).
If your charm is at the mercy of update-status anyway, then you could put your logic there, instead of defer(), which would be functionally equivalent.
The availability of the defer() functionality encourages charm authors to use defer() for inter-event flow control, which, as mentioned above, breaks on upgrade.
A deferred event is always the first to be run next charm re-init, so it is not possible to have any new logic run between defer() and re-emit, making deferred events not as useful.
Very quickly a defer() pattern puts a charm in error state because "two objects claiming to be X have been created".

benhoyt commented 1 year ago

Thanks for that list of reasons, @sed-i. I actually think this is a good idea, and you're right that defers cause a bunch of problems. If you're waiting for some other precondition to be true, just wait for that precondition / event, rather than defer.

That said, it might be a bit premature to deprecate it until we have custom workload events (via the upcoming Pebble Notices). Some things the charm will need to be woken up by aren't events right now, so charms have to rely on defer (and the deferred event being re-emitted during update-status).

Thoughts on deprecating @jameinel? I know you've expressed a similar sentiment in the past.

dstathis commented 1 year ago

@benhoyt I pretty much agree with you. We should get rid of defers but we need a better waiting mechanism than "just do it in update-status". Duplicate code often gets out of sync and might cause more bugs than deferring. The alternative would be just waiting inside the hook which seems like a very bad idea as it could halt the execution of all hooks for an unbounded amount of time.

I definitely think it would be premature at this point.

ca-scribner commented 1 year ago

Agreed too. Having a way to defer_until(time) (similar to how you can ask for a retry in a kubernetes controller when you raise an error) would be fantastic and solve some tricky problems in charms that don't have a good answer right now. But until that exists, deprecating defer would have too many downsides.

gnuoy commented 1 year ago

A scenario that I hit recently which I think could easily trip up others is when conditions, such as leadership, gate deferring an event. For example if an application (lets say a database) has multiple units it is reasonable for a charm author to say that they want the lead unit to manage creating users etc in the database itself.

The request for a db user via a juju relation could come at anytime during the charms lifecycle so if this happens before the db is ready the code below will defer the event for the lead unit but not for the others

def _on_db_user_requested(self, event):
    ...
    if self.unit.is_leader():
        if self.db_ready():
            self.create_user(event)
        else:
            event.defer()
    else:
        return

If the leader pod is recreated for any reason the client request is lost. In fact, if the leadership changes for any reason before the db is ready the deferred request will not be processed (unless leadership happens to flip back to the previous leader). An alternative approach would be for all units to defer the request. But on a system where leadership is stable every non-leader unit on every hook execution will have to process and re-defer these events potentially for the lifetime of the application (in a DBaaS offering this could be a lot of events).

Perhaps I have missed a trick somewhere but I hit this very situation when testing the OpenStack k8s charms (admittedly on a system under memory pressure)

phvalguima commented 10 months ago

Hi team, I'd like to make a case for keeping defer :)

The main advantage of defer is organizing the code. I will make one example: we have 3x apps - A, B and C. They connect in this way:

B -> A <- C

And we need to run whatever logic in C- events after B-.

Now, at deployment of that bundle, we will have in unit of app A, the following events happening: install, config-changed, then B-joined, C-joined, B-changed and C-changed and there is no order guarantee in the last 4x events. For a charm developer, it results in 4! = 24 different possible orders (if we have 1x unit per app).

Some time ago, we'd use @hooks.hook() decorator to point a given event to a method. My charm would look like:

def figure_out_my_state_machine():
    # Write a method that touches all relations, and checks for
    # keys and values in each databag, or unit count, or etc

    # A method with a lot of "IFs" and potentially a C901 error
    # in my linting later on.
    ...
    <RUN SOME B-relation LOGIC >
    < RUN 1st PART OF C-relation LOGIC >

@hooks.hook("C-joined")
@hooks.hook("C-changed")
def run_c_joined():
    # Can I run C? Where am I in the state machine?
    if figure_out_my_state_machine() == state.C_can_run:
        < run the 2nd part of C-relation logic >

@hooks.hook("B-joined")
def run_b_joined():
    # Now I have a problem :)
    # I may need to rerun the "run_c_joined" above, if this
    # event is happening after C-joined arrived. So, what I
    # will have to do is the following:
    state = figure_out_my_state_machine() # Now I know where am I

    <RUN B-relation LOGIC >

    state = figure_out_my_state_machine()
    if figure_out_my_state_machine() == state.C_can_run:
        # I can run C, but maybe all the -joined and -changed
        # already happened for C. I will have to trigger it again.
        # Unfortunately, I have two options now: either call the
        # run_c_joined method here, OR do some dummy update in the
        # C relation and trigger a new wave of -changed events...
        relation_set(<set some stuff in the C-relation>)  # that will trigger a new wave of events

With defer, I can actually represent that entirely differently:

class MyCharm(CharmBase):

    def __init__(self, *args):
        super().__init__(args)
        self.framework.observe(self.on.c_joined, self._on_c_event)

    def _on_c_event(self, event):
        # This method can contain everything I need to know about C-relation
        # Because if I need to stop in the middle of the process, I can
        # defer the event and pick it up later.
        checking_c_is_valid = < RUN 1st PART OF C-relation LOGIC >
        if checking_c_is_valid:
            < run the 2nd part of C-relation logic >
        else:
            event.defer()

The defer provides a very elegant interface for me to organize my code. Actually, I can even separate the C-relation logic entirely and put it in its own class, that observes C-relation and accesses the charm if it needs some other details. I can break the entire state machine of MyCharm into smaller chunks that are self-contained, then write tests for these chunks.

The first iteration of charm lib could not do it, unfortunately. Actually, on that first iteration, it was common to do exactly as mentioned in:

The availability of the defer() functionality encourages charm authors to use defer() for inter-event flow control, which, as mentioned above, breaks on upgrade.

The second iteration (reactive), we could create custom states and gate certain functions to be called at certain moments. It was very flexible and we could break the code as we wanted. However, I think it gave us just too many options. We either overused it or oversimplified and returned to a variant of hooks.hook, with simple event logic and big methods doing a bunch of stuff. It was not easy to troubleshoot and how to figure out which stage you are in your state machine.

The defer is an interesting intermediary. I still call 1x method per event, but I have the option of having 100% of the code related to that event in the same place. Besides, troubleshooting was simplified with tools such as jhack.

We may have issues with defer, such as deferred events at upgrade, deferring at stop, how many events of the same type we can have running, or how to not wait for update-status (if it comes). I'd say these challenges are steps forward actually, and an opportunity to simplify the operator.

For example, if events have all the same parameters (or are empty), I honestly think we should silently abandon a deferred event if another is already scheduled. That means less event runs, with potentially less time touching relation databags, restarting services, etc. Likewise, there are solutions for the stop issue, as proposed in the bug.

At upgrade, I do not really see a problem to break the "deferred events" and have the upgrade-charm to be in charge of bridging any intermediary state. I can actually make the same case for a "install" event that happens at day-2, i.e. we have a bunch of units and databag is already populated. There is some figuring out in "where we are at this moment" which is inevitable. In my opinion, this problem is independent of defer.

benhoyt commented 10 months ago

That's robust push-back and excellent commentary, thanks @phvalguima. I'm putting together a notes document which I'll use to guide the agenda for the meeting at the sprint.

carlcsaposs-canonical commented 10 months ago

The defer provides a very elegant interface for me to organize my code. Actually, I can even separate the C-relation logic entirely and put it in its own class, that observes C-relation and accesses the charm if it needs some other details.

@phvalguima How are you able to separate the C relation logic entirely if handling for the C relation depends on information in the B relation?

The mysql router charm handles exactly the situation you mentioned—it has two relations & it has to handle them in a specific order.

MySQL Server <-> MySQL Router <-> App that uses database

The relation with the App depends on information in the relation with Server.

Previously, the charm was written with two separate files for each relation with their own event handler & with defers. However, these weren't cleanly separated and self-contained event handlers—each event handler contained a lot of information about the other relation. This made the charm quite difficult to read (and buggy)—so we refactored the two event handlers into a single event handler and it make things much simpler & more robust.

What I'm getting at is that if you have event ordering dependencies between two relations, I believe you also have information dependencies across the two relations—and it won't be possible to have two cleanly separated handlers

Do you see a clean design with defer for the MySQL router example? Or is your example different from the MySQL router example?

I think there's a lot of use cases for defer, most importantly that many existing charms are written to rely on the defer architecture. I'm struggling to understand if this example in particular is a good use case for defer

phvalguima commented 10 months ago

For clarification: when I say B -> A, it means that B will run most of/all the relation-set requests.

@carlcsaposs-canonical, as I understand, in your case Mysql Router is in charge of receiving its own credentials from server and then, during its lifecycle, generating 1/more app credentials and passes them to the app.

You can always mix handlers: as you are doing in reconcile. But I can also see the # noqa: C901 in that method, as I've mentioned above.

So, in my view, your charms work like: MySQL Server -> MySQL Router -> App that uses database This is the flow of configuration, whereas the dataplane works in the other direction.

With defer, we could write it as follows:

class MyRouter(CharmBase):

    def __init__(self, *args):
          super().__init__(args)
          self.framework.observe(self.framework.on.database_changed, self._database_changed)
          self.framework.observe(self.framework.on.database_broken, self._database_broken)

          self.framework.observe(self.framework.on.app_changed, self._app_changed)
          self.framework.observe(self.framework.on.app_changed, self._app_broken)
          self.load_admin_credentials_from_relation()

    def _database_changed(self, event):
          pass  # Nothing really needed, as the admin password is loaded at __init__ always

    def _database_broken(self, event):
          # revoke the users above, remove them from the database directly using mysqlsh
          # this method needs a read-only way to get the list of app users

    def _changed_changed(self, event):
         if not self.admin_credentials:
            event.defer()  # we must wait until credentials are ready
            return
        # Do what is needed to create the user

    def _app_broken(self, event):
          # revoke the user corresponding to your app

sed-i commented 10 months ago

I wonder if pebble notices could obviate defer for the use cases @carlcsaposs-canonical describes.

tonyandrewmeyer commented 5 months ago

We're not going to deprecate defer(), although we are suggesting alternatives where possible. The work is outlined in #1125.

canonical / operator