canonical / traefik-k8s-operator

https://charmhub.io/traefik-k8s
Apache License 2.0
11 stars 26 forks source link

`ready_for_unit` is emitted too early #78

Open sed-i opened 2 years ago

sed-i commented 2 years ago

Bug Description

Currently, ready_for_unit is emitted based on relation events only:

https://github.com/canonical/traefik-k8s-operator/blob/494741059d931a9827e8214401f61cde8585582e/lib/charms/traefik_k8s/v1/ingress_per_unit.py#L692-L695

This is racy with the traefik workload: sometimes the remote app processes the event before traefik workload is in fact ready.

To Reproduce

Relate prometheus and traefik.

Environment

Relevant log output

# Prometheus gets the "ready-for-unit" event, but http requests still fail

Ingress for unit ready on 'http://pd-ssd-4cpu-8gb.us-central1-a.c.lma-light-load-testing.internal:80/cos-lite-load-test-prometheus-0'

config reload error via http://localhost:9090/cos-lite-load-test-prometheus-0/-/reload: HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=2.0)

# After a while update-status fires, at which point the traefik workload is really ready and http reqs to prom work

Emitting Juju event update_status.

Starting new HTTP connection (1): localhost:9090
http://localhost:9090 "GET /cos-lite-load-test-prometheus-0/api/v1/status/buildinfo HTTP/1.1" 200 188

Additional context

No response

PietroPasotti commented 2 years ago

todo: look at https://github.com/canonical/observability-libs/pull/10 and consider implementing some liveness check by using that external process to wake up the charm when "traefik is done", and only then publish the relation data to tell "ingress is ready"