celery_worker_up gauge is not updated when a worker drops without sending a worker-offline event

roganartu commented 1 year ago

celery_worker_up continues to emit 1 for a given worker hostname indefinitely after it is killed if it never sends a worker-offline event (eg: is SIGKILLed).

Relevant code block: https://github.com/danihodovic/celery-exporter/blob/8b2387bf83dc2478dee7a939dc14fbabd79c458c/src/exporter.py#L242-L246

You can verify this easily, by starting up a worker and killing it. You'll see something like this:

tl in host3 on ~ via 🐍 v2.7.16 took 5s
❯ curl localhost:9808/metrics -s | rg celery_worker_up
# HELP celery_worker_up Indicates if a worker has recently sent a heartbeat.
# TYPE celery_worker_up gauge
celery_worker_up{hostname="host1"} 1.0
celery_worker_up{hostname="host2"} 1.0
celery_worker_up{hostname="host3"} 1.0

tl in host3 on ~ via 🐍 v2.7.16 took 1s
❯ sc-restart celery-exporter.service
[sudo] password for tl:

tl in host3 on ~ via 🐍 v2.7.16 took 34s
❯ curl localhost:9808/metrics -s | rg celery_worker_up
# HELP celery_worker_up Indicates if a worker has recently sent a heartbeat.
# TYPE celery_worker_up gauge
celery_worker_up{hostname="host1"} 1.0
celery_worker_up{hostname="host2"} 1.0

I have two ideas for how to handle this:

1) support some kind of "worker timeout" option, and after not seeing a heartbeat for that long celery_worker_up is set to 0 for that hostname. Users would need to set this in alignment with their celery heartbeat interval settings, though. A long default (10 minutes?) still seems better than never, though. 2) export a new metric: celery_worker_last_heartbeat_timestamp - the unix timestamp of the last heartbeat received for a given worker

(2) seems a lot simpler than (1), which I guess needs some kind of background thread, though leaving (1) around seems like a footgun. I'm happy to submit a diff for (2) as it allows greater precision in alerting than the binary up metric.

danihodovic commented 1 year ago

2) Seems like the better option, but the question is if it makes for a complex query / complex visualization in Grafana. Opinions @adinhodovic ?

which I guess needs some kind of background thread

Why do we need a background thread? Can't we simply use def get_worker_heartbeathttps://github.com/danihodovic/celery-exporter/blob/master/src/exporter.py#L188

roganartu commented 1 year ago

I might be misunderstanding how this all works as I'm just going off reading the source, but doesn't that function only fire when heartbeats are received? ie when the worker actually sends this https://docs.celeryq.dev/en/stable/userguide/monitoring.html#worker-heartbeat? My guess re a background thread was that if a worker dies without emitting an event and never comes back, then something external to the existing event-based callbacks needs to effectively act as a watchdog.

I suppose you could, in theory, check all workers you have seen previously in every get_worker_heartbeat callback, but then 1) you're doing a lot more work than necessary (N^2 vs N), and 2) it still falls over when you go from one worker to zero.

danihodovic commented 1 year ago

I meant that we set the celery_worker_last_heartbeat_timestamp in get_worker_heartbeat

roganartu commented 1 year ago

Ah, that makes more sense. My run-on sentence was obviously unclear, sorry about that. What I meant to say was that part of the reason 1 seems harder is that it probably needs a background thread. You are of course right that for 2 adding the heartbeat timestamp in get_worker_heartbeat is trivial (this was also how I was thinking of doing this when I raised the issue).

danihodovic commented 1 year ago

Let's go ahead with approach #2 and open a PR. WDYT about removing the celery_worker_up metric in that case? I find it confusing to have both celery_worker_last_heartbeat_timestamp and celery_worker_up, especially if the latter is prone to erroneous readings.

roganartu commented 1 year ago

I'm happy to remove it if you think that's ok.

I agree it's confusing in the current state, but removing it would be a breaking change so I think it's a maintainer's call on that. I'll start working on a PR for 2 in any case.

danihodovic commented 1 year ago

We can leave it in for a few months and then remove it. We also need to update the Grafana charts to use the new metric, but I'll leave that to @adinhodovic

hammady commented 1 year ago

The manifestation of this bug extends to another metric: celery_worker_tasks_active where tasks active are reported from workers that are no longer up. See below:

Out of all these pods belonging to the same replicaset, only 1 is running, see below:

danihodovic / celery-exporter

celery_worker_up gauge is not updated when a worker drops without sending a worker-offline event #228