canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
245 stars 119 forks source link

thought: automagically clean up dead packages on charm upgrade to work around a juju bug #1300

Open PietroPasotti opened 1 month ago

PietroPasotti commented 1 month ago

We recently had an issue where the otel library's mechanism to discover resource detectors raised an Exception after a charm refresh, as described in https://github.com/canonical/grafana-agent-operator/issues/146. That was caused by a juju/charmcraft issue (https://bugs.launchpad.net/juju/+bug/2058335) where, when upgrading packages, Juju leaves behind package metadata from the old versions (something to do with empty package dirs). This caused otel's discovery mechanism to incorrectly use the old resource detector, leading to an exception.

The underlying issue is that on charm upgrade, juju isn't as thorough as one would hope with replacing the old venv the charm runs with, with the new one. So on a dependency change between versions during a refresh, we have an issue.

I realize this is a rather specific issue due to the fact that the lib we're using (on opentelemetry sdk lib) relies on the presence of directories (instead of fiels) to determine what modules to attempt to load. However, it is a fact that juju isn't correctly cleaning up the venv and that might cause more issues to more people.

We can patch this in each charm using the library so that, on upgrade, we do a cleanup. However, if the dependency is a pydeps-introduced one, this means we'd have to update every single charm using the charm lib (as is our case). We can patch this in the charm lib itself (this is our current workaround), to do the cleanup on load or every time the core functionality is invoked, but that's quite hacky. We can fix the root issue in juju (hopefully we will) but that won't help people stuck with older juju versions. We can fix that in ops, and refreshing the charms will bring the fix to charms running with older jujus too. Hence this thought.

The implementation could be:

ca-scribner commented 1 month ago

Happy for us to figure something out here, but we'd need to be more nuanced than checking for entry_points. Entry points are an optional thing that a package might include (advertisements to other packages about things they offer). But there's probably some other (lack of) metadata we can detect to know if something is dead

PietroPasotti commented 1 month ago

an empty package dir is also a good clue

dimaqq commented 3 weeks ago

Upstream bug has a note: Fix committed for Juju 3.5.4