canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
244 stars 119 forks source link

Should relation data be visible to charm code from within a `relation-broken` hook? #888

Open sed-i opened 1 year ago

sed-i commented 1 year ago

Currently, harness first emits relation-broken and only then invalidates the data.

This means that relation data is still accessible by charm code while inside the relation-broken hook. Is that intentional?

Based on the event page it sounds like charm code shouldn't be able to see any data when inside the broken hook.

https://github.com/canonical/operator/blob/4d38ef7ce2fe6dfc4034db5957d9e526ce0cf3d9/ops/testing.py#L697-L702

benhoyt commented 1 year ago

Juju actually allows the charm to access (stale?) relation data in relation-broken. However...

I've just tested this using two charms related to one another, each having a relation-broken hook and accessing relation data (from "this" or the "remote" app). The Harness actually gets this fairly different from real Juju in a number of cases.

Below are the results. The "setter" charm is the charm in the relation that's setting relation data via event.relation.data[self.app]["key"] = value, and the "getter" charm is the charm that's reading relation data.

charm which app Juju result Harness result
setter this {...data...} {...data...}
setter remote {} RuntimeError
getter this {} RelationDataAccessError
getter remote {...data...} RuntimeError
Show raw data from test To test this (unit tests and real Juju), I hacked my database and webapp charms that I've been using for secrets to add logging into the relation-broken hooks. The database charm is the "setter" and it sets relation data, and the webapp charm is the "getter" as it reads relation data. `self_data` means data accessed via `event.relation.data[self.app]` and `event_data` means remote data accessed via `event.relation.data[event.app]`. ``` # real Juju unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: self_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'} unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: event_data={} unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: self_data={} unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'} # database unit tests database _on_db_relation_broken: self_data={'db_password_id': 'secret:20a43c1e-41b3-49b2-ba42-6e11ec2cfffb'} database: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event # webapp unit tests webapp _on_db_relation_broken: self_data= # masks: exception when getting self data: webapp/0 is not leader and cannot read its own application databag webapp: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event ```

It seems odd that the Harness deviates so much from real Juju, which just allows reads in each case, even if the data is not useful/stale.

I presume we intentionally raise more errors than Juju in tests to try to catch problems early -- for example, the charm probably shouldn't be accessing remote relation data during relation-broken (but Juju lets you). And I'm not sure about the RelationDataAccessError for the case when the "getter" charm tries to read its own data -- that doesn't seem correct.

As to the original issue, it seems like the data is unable to be accessed (in 3 out of 4 cases!). @sed-i, can you post the actual code you were working with when you ran into this? Were you fetching relation data? Was it in the "getter" or "setter" charm? And was it via self.app (this) or event.app (remote)?

In any case, we should decide whether we want to mimic real Juju more closely. Or we should raise an error consistently if you access relation data in relation-broken in all cases. Is there ever a valid use case for doing that? @jameinel, thoughts?

sed-i commented 1 year ago

can you post the actual code you were working with when you ran into this?

Here's the utest that expects the charmlib/juju to take care of cleaning up relation data. Specifically, the utest was expecting that whatever custom events fire as a result of self.harness.remove_relation(rel_id), would not see any relation data. I think in this test it's the remote data that is expected to go away.

Is there ever a valid use case for doing that?

The pattern we were taking in o11y charms more often than not, is that rel data represents the most up to date state. After a relation-broken there is no other event that "reruns the charm" with the updated reldata. Relation-broken is the last chance to act on a change. This way, deep charm code doesn't need to know the event it's in (no need to if event is relation-broken then update everything ignoring data).

benhoyt commented 1 year ago

I'm probably missing something obvious, but I don't quite understand -- doesn't the above table show that relation-broken on real Juju still includes the previous data? So wouldn't expecting the Harness to do something different mean the unit test will behave differently in unit tests compared to under real Juju?

sed-i commented 1 year ago

I'm not sure I understand the table correctly: On relation broken, the remaining app can read the data that was set by the departed app?

benhoyt commented 1 year ago

Yeah, that's right -- that's the last row in the table. It shows the "getter" app (i.e., the other charm from the one that set the data) being able to read data that the remote app (the "setter") set. Under real Juju it can read this data, under the Harness you currently get a RuntimeError("remote-side relation data cannot be accessed during a relation-broken event"), which doesn't seem to match reality. (@jameinel any idea why the Harness tries to be different/stricter than reality here?)

You can see this from the following log line:

unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}

The webapp charm is the "getter" in this case, and it was able to read that data -- during relation-broken -- that the database charm had set.

jameinel commented 1 year ago

During relation-broken you should be able to read the data from the unit that was on the other end, if we aren't doing that in Harness, we should allow it. (my only caveat is that really relation-departing is where you should be handling the information, by the time you get to 'relation-broken' its because the relation really is gone and you're supposed to be finalizing things, not doing more configuration based on the last thing that the remote unit shared with you.)

Note while you can read, you shouldn't be allowed to set data at that point.

I also feel that 'relation-broken' really shouldn't be "just another relation event like all the other ones". Deleting a file on disk is completely different to modifying it, or creating it. So the original argument that "I want a catch all behavior" doesn't really fit for me.

John =:->

On Tue, Jan 17, 2023 at 8:58 PM Ben Hoyt @.***> wrote:

Yeah, that's right -- that's the last row in the table. It shows the "getter" app (i.e., the other charm from the one that set the data) being able to read data that the remote app (the "setter") set. Under real Juju it can read this data, under the Harness you currently get a RuntimeError("remote-side relation data cannot be accessed during a relation-broken event"), which doesn't seem to match reality. @.*** https://github.com/jameinel any idea why the Harness tries to be different/stricter than reality here?)

You can see this from the following log line:

unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}

The webapp charm is the "getter" in this case, and it was able to read that data -- during relation-broken -- that the database charm had set.

— Reply to this email directly, view it on GitHub https://github.com/canonical/operator/issues/888#issuecomment-1386355879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQ7IB6DJU7POVV7INTODWS5E5NANCNFSM6AAAAAATZLT5HE . You are receiving this because you were mentioned.Message ID: @.***>

sed-i commented 1 year ago

Iiuc, this means that the following pattern is wrong:

def _on_relation_departed(self, _):  # or broken
    self._update_config()  # regenerate everything from current rel data

and instead we should do something like:

def _on_relation_departed(self, event):  # or broken
    self._update_config(excluding=event.relation.data)

Is that correct?

In other words, from within relation-departed/broken:

lucabello commented 1 year ago

Hi everyone, chiming in on this; is what @sed-i proposed the pattern we're supposed to follow?

sed-i commented 1 year ago

@PietroPasotti just gave me an idea: If we always defer a relation-broken event, then next hook (on update-status the latest) there will be no data left in relation data, so charm code could operate on the entire relation data, i.e. without needing to work with the delta that a relation-broken implies.

This is not a great pattern, but it conveys well our dissonance about relation-broken.

carlcsaposs-canonical commented 1 year ago

FYI in some cases (but not all) accessing the remote application data in a relation broken event causes an error

See: https://bugs.launchpad.net/juju/+bug/1960934 https://github.com/canonical/operator/blob/734e12dcfde93d7081aed5573e011128d98fd84a/ops/model.py#L1341-L1349 https://github.com/canonical/mysql-router-k8s-operator/issues/73

ca-scribner commented 1 year ago

What the Kubeflow team has seen with istio-pilot is like what @carlcsaposs-canonical reports. In live Juju when handling a relation-broken event:

  1. sometimes event.app=DEPARTING_APP (and I think relation.app is the same thing? probably an alias?)
  2. sometimes event.app=None

Up until today, we had only seen case (1) and whenever we saw it, we also knew that the departing application's data is still in the relation data bag. We handled this by popping the departing data before using the data bag.

# (simplified version - differs slightly from the link)
if isinstance(event, RelationBrokenEvent):
    relation_data.pop((event.relation, event.app))

Now that sometimes we see event.app=None, I wonder if we should instead do something like:

if isinstance(event, RelationBrokenEvent):
    try:
        relation_data.pop((event.relation, event.app))
    except KeyError:
        log_and_pass  # ?

The one question I have is whether, when event.app==None, are we guaranteed that the departing application's data has been removed from the databag? If not, that will cause trouble as we can't pop it

benhoyt commented 1 year ago

@jameinel Per the above and per https://bugs.launchpad.net/juju/+bug/1960934, it seems Juju is sometimes setting JUJU_REMOTE_APP but sometimes not setting it. Do you think that could be fixed on the Juju side? Or I guess we could change Juju to always not set it, but that might be too breaking.

sed-i commented 1 year ago

https://bugs.launchpad.net/juju/+bug/2024583

sed-i commented 1 year ago

Also, I think we usually don't want to see the stale data even on relation-departed:

  1. apps foo and bar are related.
  2. bar scaled down from 2 to 1.
  3. foo received relation-departed, but not broken.
  4. foo finishes running with the old data view which included the departing unit, and will only be able to act on the new data view (without bar/2) on update-status.

Real world example:

  1. Alertmanager is scaled from 2 to 1
  2. Prometheus needs to regenerate its yaml config file with only one target under the alertmanagers section.
sed-i commented 1 year ago

This issue is related to a frequent point of friction in charming: reconciling holistic vs deltas approaches.

benhoyt commented 1 year ago

Need to consider further what to do here. Possibly related to https://github.com/canonical/operator/issues/940 work.

carlcsaposs-canonical commented 1 year ago

Related: https://github.com/canonical/operator/issues/940#issuecomment-1623559826 (difference in usage between local and remote app data during relation-broken)

PietroPasotti commented 8 months ago

This is fixed in ops 2.10, isn't it?

carlcsaposs-canonical commented 8 months ago

This is fixed in ops 2.10, isn't it?

I believe the relation data is still accessible—and I think it should be, for the local app/unit databags

benhoyt commented 6 months ago

@tonyandrewmeyer is going to investigate this further and then we'll make a decision here.