canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
246 stars 119 forks source link

Unable to access breaking relation data during deferred/custom ops event processed in relation broken juju event #1279

Open carlcsaposs-canonical opened 4 months ago

carlcsaposs-canonical commented 4 months ago

ops 2.10 introduced a breaking change that excludes breaking relations from Model.relations: https://github.com/canonical/operator/pull/1091

In order to access relation data for the breaking relation, we need to use event.relation—but this doesn't work in deferred ops events, custom ops events, or collect status ops events.

With ops, it is no longer possible to access the breaking relation data in those ops events during the relation-broken Juju event

Example of impact: mysql-router-k8s is currently written with no deferred events, no custom events during the relation-broken event, and no ops collect status events (because of limitations with status prioritization)—so it is not currently affected However, if mysql-router-k8s were to use deferred events in the future, or one of the charm libs it depends on were to use a deferred event or a custom event during relation-broken, it would be affected as following:

  1. Juju agent executes charm with relation-broken event
  2. ops emits deferred or custom event
  3. This block of code is executed: https://github.com/canonical/mysql-router-k8s-operator/blob/156da2da8c68c65ae262b2721cc01fb1b974104f/src/abstract_charm.py#L315-L345
            if self._unit_lifecycle.authorized_leader:
                if self._database_requires.is_relation_breaking(event):
                    if self._upgrade.in_progress:
                        logger.warning(
                            "Modifying relations during an upgrade is not supported. The charm may be in a broken, unrecoverable state. Re-deploy the charm"
                        )
                    self._database_provides.delete_all_databags()
                elif (
                    not self._upgrade.in_progress
                    and isinstance(workload_, workload.AuthenticatedWorkload)
                    and workload_.container_ready
                ):
                    self._reconcile_node_port(event=event)
                    self._database_provides.reconcile_users(
                        event=event,
                        router_read_write_endpoint=self._read_write_endpoint,
                        router_read_only_endpoint=self._read_only_endpoint,
                        exposed_read_write_endpoint=self._exposed_read_write_endpoint,
                        exposed_read_only_endpoint=self._exposed_read_only_endpoint,
                        shell=workload_.shell,
                    )
            if workload_.container_ready:
                workload_.reconcile(
                    event=event,
                    tls=self._tls_certificate_saved,
                    unit_name=self.unit.name,
                    exporter_config=self._cos_exporter_config(event),
                    key=self._tls_key,
                    certificate=self._tls_certificate,
                    certificate_authority=self._tls_certificate_authority,
                )

    if self._database_requires.is_relation_breaking(event): evaluates to if False and is skipped

                elif (
                    not self._upgrade.in_progress
                    and isinstance(workload_, workload.AuthenticatedWorkload)
                    and workload_.container_ready
                ):

    evaluates to elif False because relation data for databaserequires is missing (trace: [workload is not AuthenticatedWorkload](https://github.com/canonical/mysql-router-k8s-operator/blob/156da2da8c68c65ae262b2721cc01fb1b974104f/src/abstract_charm.py#L164) because relation is missing)

workload_.reconcile() disables MySQL Router because workload_ is not AuthenticatedWorkload because relation is missing

Current state: MySQL Router is disabled, but relation data in database_provides relations is still populated—mysql-router-k8s charm is giving consuming charms an endpoint and telling them that it should be accessible when Router is disabled & unreachable—this causes a poor UX (consuming applications will probably display incorrect status message) and could cause issues with exponential backoffs (i.e. a delay in service restoration) when MySQL Router is re-enabled

  1. ops emits relation broken event
  2. If charm encounters issue (uncaught exception in charm code, transient network failure with MySQL Server, etc.) it will fail to execute this code
                if self._database_requires.is_relation_breaking(event):
                    if self._upgrade.in_progress:
                        logger.warning(
                            "Modifying relations during an upgrade is not supported. The charm may be in a broken, unrecoverable state. Re-deploy the charm"
                        )
                    self._database_provides.delete_all_databags()

    potentially for an extended period of time—which causes the aforementioned issues with UX & exponential backoff

carlcsaposs-canonical commented 4 months ago

for comparison, before this change, the order would've been self._database_provides.delete_all_databags(), then disable the workload

(instead of disable the workload, then self._database_provides.delete_all_databags())

benhoyt commented 1 month ago

Tony is unlikely to get to this this cycle, but we still want to fix this.