Closed nishant-dash closed 4 weeks ago
first attempt to reproduce gives me an error, but a different one:
juju bootstrap lxd lxd
juju add-model ga
juju deploy grafana-agent --channel=latest/stable --revision=134
juju deploy ubuntu
juju relate grafana-agent ubuntu
# grafana-agent goes to blocked because it has no related consumers
juju refresh grafana-agent --revision=164
where juju debug-log -i grafana-agent
shows:
unit-grafana-agent-0: 14:07:41 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
unit-grafana-agent-0: 14:10:26 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
unit-grafana-agent-0: 14:10:26 WARNING unit.grafana-agent/0.upgrade-charm Invalid type NoneType for attribute 'telemetry.sdk.version' value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
unit-grafana-agent-0: 14:10:26 INFO unit.grafana-agent/0.juju-log Running legacy hooks/upgrade-charm.
unit-grafana-agent-0: 14:10:27 WARNING unit.grafana-agent/0.upgrade-charm Invalid type NoneType for attribute 'telemetry.sdk.version' value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
unit-grafana-agent-0: 14:10:27 ERROR unit.grafana-agent/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-grafana-agent-0/charm/./src/charm.py", line 565, in <module>
main(GrafanaAgentMachineCharm)
File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/ops/main.py", line 546, in main
manager = _Manager(charm_class, use_juju_for_storage=use_juju_for_storage)
File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/ops/main.py", line 429, in __init__
self.charm = self._make_charm(self.framework, self.dispatcher)
File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/ops/main.py", line 432, in _make_charm
charm = self._charm_class(framework)
File "/var/lib/juju/agents/unit-grafana-agent-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 301, in wrap_init
resource = Resource.create(
File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/opentelemetry/sdk/resources/__init__.py", line 189, in create
next(
StopIteration
unit-grafana-agent-0: 14:10:27 ERROR juju.worker.uniter.operation hook "upgrade-charm" (via hook dispatching script: dispatch) failed: exit status 1
unit-grafana-agent-0: 14:10:27 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
just in case, I also tried deploying prometheus, offering a relation to prometheus, and relating grafana-agent (rev134) to prometheus first before upgrade, but I still get the same error (not the error @nishant-dash received)
Digging into @nishant-dash's error more, I see the block of code in context/__init__.py
that his error fails on is:
return next( # type: ignore
iter( # type: ignore
entry_points( # type: ignore
group="opentelemetry_context",
name=default_context,
)
)
).load()()
which is calling the same entry_points()
as where my attempts are failing. So I think the cause is probably the same (entry_points()
returns an empty generator) but not sure why we experience it differently
I'll expand further in a later comment, but this issue is being caused by this juju bug
What is happening here is that, for reasons explained in this juju bug, when a charm is refreshed from revA to revB any packages that are in revA but not in revB will leave behind some metadata in the charm's venv. For example, I see this after refreshing from rev134 to rev164 here:
ubuntu@juju-24f360-3:/var/lib/juju/agents/unit-grafana-agent3-0/charm/venv$ ll
...
drwxr-xr-x 16 root root 4096 Jul 22 18:47 opentelemetry/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_api-1.24.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_api-1.25.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_common-1.24.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_common-1.25.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_http-1.24.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_http-1.25.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_proto-1.24.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_proto-1.25.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_sdk-1.24.0.dist-info/
drwxr-xr-x 3 root root 4096 Jul 22 18:47 opentelemetry_sdk-1.25.0.dist-info/
which is a result of bumping opentelemetry-exporter-otlp-proto-http==1.24.0
to opentelemetry-exporter-otlp-proto-http==1.25.0
. The *1.24.0
directories do not contain fully fledged packages, but they have enough remnants that importlib.metadata.distributions()
finds them.
Why this matters here is that the tracing library calls from opentelemetry.sdk.resources.Resource.create()
, which uses importlib.metadata.entry_points()
to discover an otel_experimental_resource_detectors
, and entry_points()
calls and deduplicates the return of importlib.metadata.distributions()
. This means the older, incomplete packages found by distributions()
mask the newer, current packages. That leads to the error we see during refresh, where we iterate on an empty generator of resource detectors.
The juju fix proposed to the juju bug for this does not give us relief here. That fix says:
It is important to note that this change will only ensure the proper cleanup of files for charms that are newly deployed, as charms that are already deployed have their manifests written to the manifest files on disk
so even if Juju is patched, users with currently deployed rev134's cannot refresh.
An interesting note from the bug for anyone reproducing this:
This issue does not seem to affect charms uploaded directly from files, as juju seems to normalize and add additional directory entries during the upload process. https://github.com/juju/charm/blob/064bbf9e5a4f72a5dd78739cc23522a6b583e9c3/charmdir.go#L314-L317
So if you juju download grafana-agent --revision 134; juju deploy ./local_134; juju refresh --switch grafana-agent --revision 164
that should not be blocked by this bug. That makes reproduction of the issue locally a real pain...
But, it also probably means that this upgrade path should work:
juju deploy grafana-agent --revision 134
)juju download grafana-agent --revision 134
juju deploy ./local-version-of-134.charm
juju switch grafana-agent --channel latest/stable
Though I haven't tested it. Presumably too, now that @nishant-dash has upgraded to rev164, those duplicate directories already exist. So maybe he still cannot recover by doing this hacked procedure. We'd need to test
This should work now with the charm in edge
. It would be good to have some confirmation this solved the issue for you though before we roll the fix out to stable
Bug Description
I deployed grafana agent and related it to k8s charms. one of the units was stuck in
waiting
with the messageI tried to refresh the charm to the latest in
latest/stable
(which happens to be rev164
right now) putting all units in error stateTo Reproduce
Environment
juju
3.4.3
grafana-agentlatest/stable
revision134
Relevant log output
Additional context
Ignore the confusing charm name (not to be confused with the k8s grafana agent charm)