Closed roganartu closed 1 year ago
2) Seems like the better option, but the question is if it makes for a complex query / complex visualization in Grafana. Opinions @adinhodovic ?
which I guess needs some kind of background thread
Why do we need a background thread? Can't we simply use def get_worker_heartbeat
https://github.com/danihodovic/celery-exporter/blob/master/src/exporter.py#L188
I might be misunderstanding how this all works as I'm just going off reading the source, but doesn't that function only fire when heartbeats are received? ie when the worker actually sends this https://docs.celeryq.dev/en/stable/userguide/monitoring.html#worker-heartbeat? My guess re a background thread was that if a worker dies without emitting an event and never comes back, then something external to the existing event-based callbacks needs to effectively act as a watchdog.
I suppose you could, in theory, check all workers you have seen previously in every get_worker_heartbeat
callback, but then 1) you're doing a lot more work than necessary (N^2 vs N), and 2) it still falls over when you go from one worker to zero.
I meant that we set the celery_worker_last_heartbeat_timestamp in get_worker_heartbeat
Ah, that makes more sense. My run-on sentence was obviously unclear, sorry about that. What I meant to say was that part of the reason 1 seems harder is that it probably needs a background thread. You are of course right that for 2 adding the heartbeat timestamp in get_worker_heartbeat
is trivial (this was also how I was thinking of doing this when I raised the issue).
Let's go ahead with approach #2 and open a PR. WDYT about removing the celery_worker_up metric in that case? I find it confusing to have both celery_worker_last_heartbeat_timestamp
and celery_worker_up
, especially if the latter is prone to erroneous readings.
I'm happy to remove it if you think that's ok.
I agree it's confusing in the current state, but removing it would be a breaking change so I think it's a maintainer's call on that. I'll start working on a PR for 2 in any case.
We can leave it in for a few months and then remove it. We also need to update the Grafana charts to use the new metric, but I'll leave that to @adinhodovic
The manifestation of this bug extends to another metric: celery_worker_tasks_active
where tasks active are reported from workers that are no longer up. See below:
Out of all these pods belonging to the same replicaset, only 1 is running, see below:
celery_worker_up
continues to emit1
for a given worker hostname indefinitely after it is killed if it never sends aworker-offline
event (eg: is SIGKILLed).Relevant code block: https://github.com/danihodovic/celery-exporter/blob/8b2387bf83dc2478dee7a939dc14fbabd79c458c/src/exporter.py#L242-L246
You can verify this easily, by starting up a worker and killing it. You'll see something like this:
I have two ideas for how to handle this:
1) support some kind of "worker timeout" option, and after not seeing a heartbeat for that long
celery_worker_up
is set to0
for that hostname. Users would need to set this in alignment with their celery heartbeat interval settings, though. A long default (10 minutes?) still seems better than never, though. 2) export a new metric:celery_worker_last_heartbeat_timestamp
- the unix timestamp of the last heartbeat received for a given worker(2) seems a lot simpler than (1), which I guess needs some kind of background thread, though leaving (1) around seems like a footgun. I'm happy to submit a diff for (2) as it allows greater precision in alerting than the binary
up
metric.