canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
245 stars 119 forks source link

Prevent the deferred events queue to contain duplicates #935

Open PietroPasotti opened 1 year ago

PietroPasotti commented 1 year ago

This is a bit of a large topic, but the broad question is: suppose my charm has deferred a foo event, should we disallow (or enable people to guard against) a situation where two or more foo deferred events are in the queue?

A typical use case for event deferrals is a 'missing preconditions' case: an event can't be processed because some preconditions are missing. If the same event is triggered again before the preconditions are met, we now have two copies of that event in the deferral queue. When at some point in the future the precondition is eventually met and the queue flushes, we'll be executing that handler multiple times quickly in succession, that may result in the worst case into bugs (or downtime) such as when a flurry of pebble service restarts hits a workload container.

Main options I see for addressing this, from more to less aggressive, are:

Do we have a 'use case' for not guarding against this, i.e. for leaving things as they are, and allow the same event to occur multiple times in a deferral queue?

jameinel commented 1 year ago

In my opinion, I'd love to guard against defer generally. If your preconditions aren't met, you'd really rather have a way to wait for those preconditions explicitly (a different event), rather than deferring this event. I realize some preconditions can't be made into events (yet). But it certainly falls into the "you shouldn't defer config-changed because of a problem with config, because you'll get another config-changed when the config actually changes". But if config-changed needed relation-data, then you get into a pretty sticky situation. Though for that example, why defer at all, vs just waiting for the other one?

I don't see any reason that you should queue up an event more than once (given a receiver and event pair). Making it an exception just means that existing charms will start going into Error state (because they won't be handling the error). Silently omitting the duplicate is a bit iffy, but might be the best option.

I'm still of the mind that calling event.defer should feel bad, as it indicates we are missing something in the model, rather than a good practice.

On Tue, May 23, 2023 at 9:16 AM PietroPasotti @.***> wrote:

This is a bit of a large topic, but the broad question is: suppose my charm has deferred a foo event, should we disallow (or enable people to guard against) a situation where two or more foo deferred events are in the queue?

A typical use case for event deferrals is a 'missing preconditions' case: an event can't be processed because some preconditions are missing. If the same event is triggered again before the preconditions are met, we now have two copies of that event in the deferral queue. When at some point in the future the precondition is eventually met and the queue flushes, we'll be executing that handler multiple times quickly in succession, that may result in the worst case into bugs (or downtime) such as when a flurry of pebble service restarts hits a workload container.

Main options I see for addressing this, from more to less aggressive, are:

  • raise an AlreadyDeferredError from Event.defer() if the same event was already deferred (on the same handler).
  • (or silently skip adding the duplicate event to the queue, but not a big fan of that idea)
  • allow the charm to inspect the deferral queue: the typical pattern would be if not self.framework.has_deferred(event): event.defer()

Do we have a 'use case' for not guarding against this, i.e. for leaving things as they are, and allow the same event to occur multiple times in a deferral queue?

— Reply to this email directly, view it on GitHub https://github.com/canonical/operator/issues/935, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQ7IP4WIFBAADVOR7K7TXHS2CBANCNFSM6AAAAAAYL4JF7Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

benhoyt commented 1 year ago

At first glance what John is saying sounds good, however, @jameinel, wouldn't that approach also be nudging people to an "on any event, reconcile everything" approach to charming? Because you'd have to check for all preconditions on every event. Whereas with defer you get the specific event that was interrupted.

In any case, it makes sense to me to avoid duplicates in the queue. I don't see why it would ever be a good thing to defer two of the same event (unless the charm is written in a very odd way). I think we should probably skip adding the deferred event, but log that it happened. It seems like this wouldn't be too hard to fix (though the _reemit code is not easy to follow!).

@PietroPasotti Out of interest, did you run into this issue in "real life" because a charm was queueing up multiple duplicate events and that was causing problems? Or is it more of a "this could be an issue in theory" kind of thing?

benhoyt commented 1 year ago

Per voice discussion with John, usually in cases where you'd defer, you need to check all the preconditions anyway. For example, if you need both relation data and config data to write out your workload's configuration file, you have to check for both of those things.

jameinel commented 1 year ago

...

@PietroPasotti https://github.com/PietroPasotti Out of interest, did you run into this issue in "real life" because a charm was queueing up multiple duplicate events and that was causing problems? Or is it more of a "this could be an issue in theory" kind of thing?

I know that early on people would do things like "my config isn't correct, let me defer()" and they would end up with 20 deferred config events.

PietroPasotti commented 1 year ago

@PietroPasotti https://github.com/PietroPasotti Out of interest, did you run into this issue in "real life" because a charm was queueing up multiple duplicate events and that was causing problems? Or is it more of a "this could be an issue in theory" kind of thing?

Someone came to me with this piece of code:

def _on_config_changed(self, event) -> None:
     <do something>
      if self.controller_container.can_connect():
           self._restart_controller_service()

      else:
          self.unit.status = MaintenanceStatus("Waiting for kserve-controller to be reachable")
          event.defer()

and I saw a risk that if the config were to change multiple times before pebble was ready, we might queue multiple deferred config-changed and end up with a flurry of container service restarts as soon as the container can connect.

sed-i commented 1 year ago

I agree with @jameinel. Iirc, we do not (should not) use defer() because:

  1. Deferral isn't a juju concept so on upgrade all deferred events are lost (all your pending logic is lost). For example, the upgrade sequence does not emit any relation events, so charm authors would need to remember to redo all the logic that is exercised by deferrable events, every upgrade.
  2. Events are, by definition, most relevant when they are emitted. Deferring an event introduces a somewhat non-deterministic delay in processing: in the worst case it will be re-emitted on the next update status, which can be anywhere between 10sec and 1hr (I think).
  3. If your charm is at the mercy of update-status anyway, then you could put your logic there, instead of defer(), which would be functionally equivalent.
  4. The availability of the defer() functionality encourages charm authors to use defer() for inter-event flow control, which, as mentioned above, breaks on upgrade.
  5. A deferred event is always the first to be run next charm re-init, so it is not possible to have any new logic run between defer() and re-emit, making deferred events not as useful.
  6. Very quickly a defer() pattern puts a charm in error state because "two objects claiming to be X have been created".
benhoyt commented 11 months ago

We could add checking for this and eliminate the duplicate event, but I'm not sure it's worth it, given we're planning to re-evaluate or even deprecate defer (part of next cycle's roadmap). So I'm closing this specific issue and we'll focus on #966.

benhoyt commented 3 months ago

We didn't end up deprecating defer, and this wasn't resolved in #1205, so re-opening this. This would be done internally by, when defer() is called, checking that the existing deferral queue doesn't contain that event. And the content compared would be the event key (observer "path" or whatever) as well as the data ("snapshot").

tonyandrewmeyer commented 1 day ago

The complicate part of this is that calling defer() doesn't save the event+snapshot to the queue. All events get saved to the queue as part of emit(), and then each notice is dropped if the event+handler was not deferred, and the snapshot dropped if no event+handler was deferred.

This is unchanged from the "initial import" commit. It seems like this is done for consistency, rather than to protect against the charm crashing and losing an event (Juju will re-emit the event in that case and since we don't commit the storage it's not saved anyway).

We could skip storing the notice and snapshot if there's already one in storage - but we'll need to do that when the following Juju event comes along, rather than on defer(), and it will require some adjustments (loading whatever snapshot is in the storage to compare against, looping through the notices to see whether one is already there) that will end up applying to all events, which raises performance questions.

Alternatively, we could do a more significant adjustment; for example, having a "happy path" that when there is no defer() completely skipping the notice+snapshot system, and only doing a notice+snapshot store on defer(). This seems more straightforward and would loosen some of the constraints that the framework has (if you don't defer) and should be a performance improvement (no need to write to the sqlite database for every event, for one thing), but would make defer even less consistent than it currently is and would need consideration for Scenario, which uses the notice+snapshot system to trigger events.