canonical / prometheus-k8s-operator

https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 35 forks source link

External hostname is not updated if an ingress is added *after* relating a charm to Prometheus #368

Closed simskij closed 2 years ago

simskij commented 2 years ago

Bug Description

See title. If you first relate to Prometheus and then to Traefik, it all works as expected. The other way around, no cigar.

To Reproduce

-

Environment

-

Relevant log output

-

Additional context

We could have used the ingress established/revoked events, but these are unfortunately fired prematurely

sed-i commented 2 years ago

Reproduction

After relating a charm to traefik, its metrics endpoint is not updated and prometheus reports health: down because it is no longer reachable via the local ip.

  1. am, prom, trfk deployed and all in active/idle.
  2. juju relate am:self-metrics-endpoint prom
  3. juju relate am trfk
  4. juju run-action trfk/0 show-proxied-endpoints --wait

Proposal 1

Users of MetricsEndpointProvider must be instructed to always set custom refresh events

self.metrics = MetricsEndpointProvider(
    # ...
    refresh_event=[  # needed for ingress
        self.ingress.on.ready_for_unit,
        self.ingress.on.revoked_for_unit,
        self.on.update_status,
    ]

Proposal 2

MetricsEndpointProvider should always observe update-status by default.

Proposal 3

MetricsEndpointProvider should update relation data every re-init. I.e. the contructor MetricsEndpointProvider should call self._set_scrape_job_spec every instantiation, instead of registring it as an observer.

Proposal 4

Roll the responsibility to the user by introducing an update_endpoint method like we do in PrometheusRemoteWriteProvider.

Ideas? @dstathis @Abuelodelanada @rbarry82 cc: @PietroPasotti

rbarry82 commented 2 years ago

I think proposal #3 is preferable by far. It's idempotent, users don't have to do anything at all, it doesn't depend on update-status-interval or calling other events, and it can easily be removed from the library constructor when stripPrefix middleware lands in traefik, which makes this problem more or less disappear entirely (at least from an in-model/cluster perspective, as well as any external targets which have routable endpoints and don't need a path specified by any reverse proxy).

sed-i commented 2 years ago

Tested manually and the combination of:

solves the issue.

With which charm did you experience this @simskij ? You may need to update charm code:

simskij commented 2 years ago

I saw it with the loki datasource in grafana after deploying it as a bundle.

sed-i commented 2 years ago

I saw it with the loki datasource in grafana after deploying it as a bundle.

If it's a loki datasource issue then perhaps it's not related to prometheus_scrape?

Maybe we need to manually call update_source in loki? BTW, update_source seems very different from refresh_event. @dstathis @rbarry82

rbarry82 commented 2 years ago

Maybe we need to manually call update_source in loki? BTW, update_source seems very different from refresh_event. @dstathis @rbarry82

update_source is just a superset of _set_unit_details which also allows passing additional fields, and was added explicitly for consumers to say "I have an ingress now, so update out-of-band in case GrafanaSourceProvider._source_url from the constructor is out of date".

Since Loki already uses the property in the constructor, update_source would be called when an ingress is added, yes, which allows setting/updating the Grafana relation data immediately after ingress_ready rather than waiting for some other event to re-trigger the constructor. We could do the same thing in grafana_source as is done here, but it would make sense from Loki's codebase to add it just after update_endpoint(...), since the semantics are the same. The Prometheus libraries have just obsessively avoiding having any public API at all which could be used for this purpose.

simskij commented 2 years ago

My bad, I saw it in Prometheus too, but it seems to have been resolved now.