canonical / prometheus-k8s-operator

https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 35 forks source link

More refresh events for catalogue #405

Closed sed-i closed 1 year ago

sed-i commented 1 year ago

Issue

404 was confirmed working locally, but only once... In a load test deployment, the link was still wrong even after that fix.

Solution

Add more refresh events.

Context

Addresses https://github.com/canonical/catalogue-k8s-operator/issues/3.

Testing Instructions

  1. Deploy the cos-lite bundle.
  2. Make sure prom has the correct url.

Release Notes

Add more refresh events for catalogue.

rbarry82 commented 1 year ago

We should probably examine this more deeply to find out why it is not consistent rather than just adding more events as a sledgehammer

sed-i commented 1 year ago

We should probably examine this more deeply to find out why it is not consistent rather than just adding more events as a sledgehammer

I haven't convinced myself yet but I assume the reason was:

  1. "relation-changed" was processed by the charm before it was processed by the lib.
  2. Before this change, the charm wasn't passing "ingress ready" as a refresh event.

The sledgehammer events are now "relation-changed" and "update-status". And now that I think of it, they shouldn't be needed for catalogue: even if the url in reldata is not yet reachable, it doesn't matter for catalogue! Removed in next commit.

rbarry82 commented 1 year ago

I haven't convinced myself yet but I assume the reason was:

  1. "relation-changed" was processed by the charm before it was processed by the lib.
  2. Before this change, the charm wasn't passing "ingress ready" as a refresh event.

The sledgehammer events are now "relation-changed" and "update-status". And now that I think of it, they shouldn't be needed for catalogue: even if the url in reldata is not yet reachable, it doesn't matter for catalogue! Removed in next commit.

I think the question, from my POV, is basically: "is the URL reachable at any point?"

If it eventually becomes reachable, then when? And we're back to the Traefik "watch for accessibility before sending events" topic from a month and a half ago.

My concern is that it's the behavior we've sporadically observed where there is a "valid" URL but it 404s, and cannot be reached at all (through the catalogue or otherwise) until the relation between ingress and the charm is broken and re-established. If we're seeing that behavior, we should really root cause it. If it's just unreachable for a little while, then yeah, let's leave it and pursue a solution on the ingress side.

sed-i commented 1 year ago

My concern is that it's the behavior we've sporadically observed where there is a "valid" URL but it 404s, and cannot be reached at all (through the catalogue or otherwise) until the relation between ingress and the charm is broken and re-established.

I always encountered the same issue here: the advertized url is the fqdn, i.e. the url from before ingress. And re-relating fixed it.

sed-i commented 1 year ago

Tested on 4cpu-8gb (load test) and 2cpu-7gb (gh runner) VMs and seems to be working now.

./render_bundle.py local.yaml \
  --prometheus ../prometheus-k8s-operator/prometheus-k8s_ubuntu-20.04-amd64.charm \
  --grafana ../grafana-k8s-operator/grafana-k8s_ubuntu-20.04-amd64.charm
sed-i commented 1 year ago

My concern is that it's the behavior we've sporadically observed where there is a "valid" URL but it 404s, and cannot be reached at all (through the catalogue or otherwise) until the relation between ingress and the charm is broken and re-established.

This particular issue I see with grafana every single time I deploy the load test.

rbarry82 commented 1 year ago

This particular issue I see with grafana every single time I deploy the load test.

This does not seem to be reproducible locally, and nobody from QA has reported this, so I wonder if it's something with the loadtest configuration.

sed-i commented 1 year ago

Can't find anything wrong with the load-test config. There too re-relating grafana:ingress traefik fixes the 404.

rbarry82 commented 1 year ago

Ok. Do we have some time to live debug this later this afternoon or tomorrow? This is the same behavior I was seeing with Prometheus in CZ, and I can't reproduce it anymore, but it's almost certainly Traefik. There is no particular reason that re-relating Grafana would change anything, and the relation data should be identical in both cases.