canonical / grafana-agent-operator

This charmed operator automates the operational procedures of running Grafana Agent, an open-soruce telemetry collector.
https://charmhub.io/grafana-agent
Apache License 2.0
4 stars 10 forks source link

charm refresh from revision `134` to revision `164` puts all units in error state #146

Closed nishant-dash closed 4 weeks ago

nishant-dash commented 2 months ago

Bug Description

I deployed grafana agent and related it to k8s charms. one of the units was stuck in waiting with the message

grafana-agent-k8s/9*  waiting   idle            a.b.c.d            Waiting for TLS certificate.
lib/charms/certificate_transfer_interface/v0/certificate_transfer.py:                f"Provider relation data did not pass JSON Schema validation: "
lib/charms/tls_certificates_interface/v2/tls_certificates.py:            logger.warning("Provider relation data did not pass JSON Schema validation")

I tried to refresh the charm to the latest in latest/stable (which happens to be rev 164 right now) putting all units in error state

grafana-agent-k8s                    error           11  grafana-agent               latest/stable  164  no       hook failed: "upgrade-charm"
  grafana-agent-k8s/10           error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/0            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/1            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/2            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/9*           error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/3            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/5            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/4            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/8            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/7            error        idle                a.b.c.d                  hook failed: "upgrade-charm"
  grafana-agent-k8s/11           error        idle                a.b.c.d                  hook failed: "upgrade-charm"
cos-proxy-monitors:cos-agent                          grafana-agent-k8s:cos-agent                               cos_agent                      subordinate  
grafana-agent-k8s:grafana-dashboards-provider         grafana-dashboards:grafana-dashboard                      grafana_dashboard              regular      
grafana-agent-k8s:peers                               grafana-agent-k8s:peers                                   grafana_agent_replica          peer         
kubernetes-control-plane:cos-agent                    grafana-agent-k8s:cos-agent                               cos_agent                      subordinate  
kubernetes-worker:cos-agent                           grafana-agent-k8s:cos-agent                               cos_agent                      subordinate  
loki-logging:logging                                  grafana-agent-k8s:logging-consumer                        loki_push_api                  regular      
openstack-integrator:juju-info                        grafana-agent-k8s:juju-info                               juju-info                      subordinate  
prometheus-receive-remote-write:receive-remote-write  grafana-agent-k8s:send-remote-write                       prometheus_remote_write        regular      
vault:certificates                                    grafana-agent-k8s:certificates                            tls-certificates               regula

To Reproduce

  1. juju deploy grafana-agent --channel=latest/stable --revision=134
  2. juju integrate grafana-agent ubuntu # or something
  3. juju refresh grafana-agent # rev 164 as of opening this bug

Environment

juju 3.4.3 grafana-agent latest/stable revision 134

Relevant log output

unit-grafana-agent-k8s-9: 17:18:06 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/grafana-agent-164 from API server
unit-grafana-agent-k8s-9: 17:18:06 ERROR juju.worker.uniter resolver loop error: preparing operation "upgrade to ch:amd64/jammy/grafana-agent-164" for grafana-agent-k8s/9: failed to download charm "ch:amd64/jammy/grafana-agent-164" from API server: download request with archiveSha256 length 0 not valid
unit-grafana-agent-k8s-9: 17:18:06 INFO juju.worker.uniter unit "grafana-agent-k8s/9" shutting down: preparing operation "upgrade to ch:amd64/jammy/grafana-agent-164" for grafana-agent-k8s/9: failed to download charm "ch:amd64/jammy/grafana-agent-164" from API server: download request with archiveSha256 length 0 not valid
unit-grafana-agent-k8s-9: 17:18:06 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: preparing operation "upgrade to ch:amd64/jammy/grafana-agent-164" for grafana-agent-k8s/9: failed to download charm "ch:amd64/jammy/grafana-agent-164" from API server: download request with archiveSha256 length 0 not valid
unit-grafana-agent-k8s-9: 17:18:09 INFO juju.worker.uniter unit "grafana-agent-k8s/9" started
unit-grafana-agent-k8s-9: 17:18:09 INFO juju.worker.uniter hooks are retried true
unit-grafana-agent-k8s-9: 17:18:09 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/grafana-agent-164 from API server
unit-grafana-agent-k8s-9: 17:18:36 INFO juju.worker.uniter found queued "upgrade-charm" hook
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm Failed to load context: contextvars_context, fallback to contextvars_context
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm Traceback (most recent call last):
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/venv/opentelemetry/context/__init__.py", line 45, in _load_runtime_context
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     return next(  # type: ignore
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm StopIteration
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm Traceback (most recent call last):
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/venv/opentelemetry/context/__init__.py", line 45, in _load_runtime_context
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     return next(  # type: ignore
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm StopIteration
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm 
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm During handling of the above exception, another exception occurred:
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm 
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm Traceback (most recent call last):
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/./src/charm.py", line 17, in <module>
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     from charms.tempo_k8s.v1.charm_tracing import trace_charm
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 125, in <module>
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 25, in <module>
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     from opentelemetry.exporter.otlp.proto.common._internal import (
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/venv/opentelemetry/exporter/otlp/proto/common/_internal/__init__.py", line 45, in <module>
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     from opentelemetry.sdk.trace import Resource
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/venv/opentelemetry/sdk/trace/__init__.py", line 44, in <module>
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     from opentelemetry import context as context_api
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/venv/opentelemetry/context/__init__.py", line 69, in <module>
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     _RUNTIME_CONTEXT = _load_runtime_context()
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm   File "/var/lib/juju/agents/unit-grafana-agent-k8s-9/charm/venv/opentelemetry/context/__init__.py", line 59, in _load_runtime_context
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm     return next(  # type: ignore
unit-grafana-agent-k8s-9: 17:18:36 WARNING unit.grafana-agent-k8s/9.upgrade-charm StopIteration
unit-grafana-agent-k8s-9: 17:18:37 ERROR juju.worker.uniter.operation hook "upgrade-charm" (via hook dispatching script: dispatch) failed: exit status 1
unit-grafana-agent-k8s-9: 17:18:37 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
unit-grafana-agent-k8s-9: 17:18:42 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
unit-grafana-agent-k8s-9: 17:18:42 WARNING unit.grafana-agent-k8s/9.upgrade-charm Failed to load context: contextvars_context, fallback to contextvars_context

Additional context

Ignore the confusing charm name (not to be confused with the k8s grafana agent charm)

ca-scribner commented 1 month ago

first attempt to reproduce gives me an error, but a different one:

juju bootstrap lxd lxd
juju add-model ga
juju deploy grafana-agent --channel=latest/stable --revision=134
juju deploy ubuntu
juju relate grafana-agent ubuntu
# grafana-agent goes to blocked because it has no related consumers
juju refresh grafana-agent --revision=164

where juju debug-log -i grafana-agent shows:

unit-grafana-agent-0: 14:07:41 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
unit-grafana-agent-0: 14:10:26 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
unit-grafana-agent-0: 14:10:26 WARNING unit.grafana-agent/0.upgrade-charm Invalid type NoneType for attribute 'telemetry.sdk.version' value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
unit-grafana-agent-0: 14:10:26 INFO unit.grafana-agent/0.juju-log Running legacy hooks/upgrade-charm.
unit-grafana-agent-0: 14:10:27 WARNING unit.grafana-agent/0.upgrade-charm Invalid type NoneType for attribute 'telemetry.sdk.version' value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
unit-grafana-agent-0: 14:10:27 ERROR unit.grafana-agent/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-grafana-agent-0/charm/./src/charm.py", line 565, in <module>
    main(GrafanaAgentMachineCharm)
  File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/ops/main.py", line 546, in main
    manager = _Manager(charm_class, use_juju_for_storage=use_juju_for_storage)
  File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/ops/main.py", line 429, in __init__
    self.charm = self._make_charm(self.framework, self.dispatcher)
  File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/ops/main.py", line 432, in _make_charm
    charm = self._charm_class(framework)
  File "/var/lib/juju/agents/unit-grafana-agent-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 301, in wrap_init
    resource = Resource.create(
  File "/var/lib/juju/agents/unit-grafana-agent-0/charm/venv/opentelemetry/sdk/resources/__init__.py", line 189, in create
    next(
StopIteration
unit-grafana-agent-0: 14:10:27 ERROR juju.worker.uniter.operation hook "upgrade-charm" (via hook dispatching script: dispatch) failed: exit status 1
unit-grafana-agent-0: 14:10:27 INFO juju.worker.uniter awaiting error resolution for "upgrade-charm" hook
ca-scribner commented 1 month ago

just in case, I also tried deploying prometheus, offering a relation to prometheus, and relating grafana-agent (rev134) to prometheus first before upgrade, but I still get the same error (not the error @nishant-dash received)

ca-scribner commented 1 month ago

Digging into @nishant-dash's error more, I see the block of code in context/__init__.py that his error fails on is:

        return next(  # type: ignore
            iter(  # type: ignore
                entry_points(  # type: ignore
                    group="opentelemetry_context",
                    name=default_context,
                )
            )
        ).load()()

which is calling the same entry_points() as where my attempts are failing. So I think the cause is probably the same (entry_points() returns an empty generator) but not sure why we experience it differently

ca-scribner commented 1 month ago

I'll expand further in a later comment, but this issue is being caused by this juju bug

ca-scribner commented 1 month ago

What is happening here is that, for reasons explained in this juju bug, when a charm is refreshed from revA to revB any packages that are in revA but not in revB will leave behind some metadata in the charm's venv. For example, I see this after refreshing from rev134 to rev164 here:

ubuntu@juju-24f360-3:/var/lib/juju/agents/unit-grafana-agent3-0/charm/venv$ ll
...
drwxr-xr-x 16 root root   4096 Jul 22 18:47 opentelemetry/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_api-1.24.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_api-1.25.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_common-1.24.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_common-1.25.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_http-1.24.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_exporter_otlp_proto_http-1.25.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_proto-1.24.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_proto-1.25.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_sdk-1.24.0.dist-info/
drwxr-xr-x  3 root root   4096 Jul 22 18:47 opentelemetry_sdk-1.25.0.dist-info/

which is a result of bumping opentelemetry-exporter-otlp-proto-http==1.24.0 to opentelemetry-exporter-otlp-proto-http==1.25.0. The *1.24.0 directories do not contain fully fledged packages, but they have enough remnants that importlib.metadata.distributions() finds them.

Why this matters here is that the tracing library calls from opentelemetry.sdk.resources.Resource.create(), which uses importlib.metadata.entry_points() to discover an otel_experimental_resource_detectors, and entry_points() calls and deduplicates the return of importlib.metadata.distributions(). This means the older, incomplete packages found by distributions() mask the newer, current packages. That leads to the error we see during refresh, where we iterate on an empty generator of resource detectors.

ca-scribner commented 1 month ago

The juju fix proposed to the juju bug for this does not give us relief here. That fix says:

It is important to note that this change will only ensure the proper cleanup of files for charms that are newly deployed, as charms that are already deployed have their manifests written to the manifest files on disk

so even if Juju is patched, users with currently deployed rev134's cannot refresh.

ca-scribner commented 1 month ago

An interesting note from the bug for anyone reproducing this:

This issue does not seem to affect charms uploaded directly from files, as juju seems to normalize and add additional directory entries during the upload process. https://github.com/juju/charm/blob/064bbf9e5a4f72a5dd78739cc23522a6b583e9c3/charmdir.go#L314-L317

So if you juju download grafana-agent --revision 134; juju deploy ./local_134; juju refresh --switch grafana-agent --revision 164 that should not be blocked by this bug. That makes reproduction of the issue locally a real pain...

But, it also probably means that this upgrade path should work:

Though I haven't tested it. Presumably too, now that @nishant-dash has upgraded to rev164, those duplicate directories already exist. So maybe he still cannot recover by doing this hacked procedure. We'd need to test

ca-scribner commented 4 weeks ago

This should work now with the charm in edge. It would be good to have some confirmation this solved the issue for you though before we roll the fix out to stable