danihodovic / celery-exporter

A Prometheus exporter for Celery metrics
MIT License
409 stars 90 forks source link

Bug: Queue length doesn't seem right #218

Open michaelschem opened 1 year ago

michaelschem commented 1 year ago

I'm doing a bit of load testing and I'm seeing roughly correct numbers for sent, but the tasks don't reflect in the queue length all the time. The tasks are short (10s) so maybe this is a queue length polling time issue?

image

michaelschem commented 1 year ago

not sure if it's related, but tasks active is also off

image

I've got a python query I use to find the active tasks, that I'm pretty sure should work and it shows no tasks

image

michaelschem commented 1 year ago

Any ideas on this? Still having this issue.

image

danihodovic commented 1 year ago

Celery prefetches 4 messages for each process. Each worker can have multiple processes and you could be running multiple workers. See: https://docs.celeryq.dev/en/stable/userguide/optimizing.html#prefetch-limits

humbertowoody commented 1 year ago

Hello, @danihodovic! First of all, thank you for such fantastic software, it is really awesome.

We've been using this exporter to monitor our Celery deployment and everything works like a charm except for the Queue Length, I will attach an image of how the dashboard looks. I find it difficult to believe that the queue length is always zero and maybe it's related to this issue by @michaelschem. We're using Django + Celery + Redis with the events on and it works perfectly (hence the other data being accurate), it is only that specific queue length metric that is always zero that seems suspicious (to me) given the other values. We're using only one queue named "celery" (which is the default value I assume) and the prefetch-multiplier value for each worker is set to 1 (from the default of 4). I can provide any other detail/info if required :)

Screenshot 2023-05-20 at 20 03 11

Thanks in advance!

danihodovic commented 1 year ago

Hi Humberto,

Can you confirm the queue length using celery inspect or using redis-cli and then LLEN command directly?

humbertowoody commented 1 year ago

Hi, @danihodovic! Thank you for your response.

The response I'm getting from celery inspect active_queues is:

->  celery@celery-worker-56849cd87c-fmhnh: OK
    * {'name': 'celery', 'exchange': {'name': 'celery', 'type': 'direct', 'arguments': None, 'durable': True, 'passive': False, 'auto_delete': False, 'delivery_mode': None, 'no_declare': False}, 'routing_key': 'celery', 'queue_arguments': None, 'binding_arguments': None, 'consumer_arguments': None, 'durable': True, 'exclusive': False, 'auto_delete': False, 'no_ack': False, 'alias': None, 'bindings': [], 'no_declare': None, 'expires': None, 'message_ttl': None, 'max_length': None, 'max_length_bytes': None, 'max_priority': None}
... (40 times since I got 40 workers)...

And from the redis-cli's LLEN (my DB index is 9):

127.0.0.1:6379[9]> llen celery
(integer) 0

Maybe this StackOverflow question is what's happening? Basically, the queue processing is so fast that it's always zero? That would be a happy problem :)

Thanks in advance :)

danihodovic commented 1 year ago

That's my experience in one of the larger projects we use Celery in. We scale between 10-20 workers and the queue length almost always stays at 0. @adinhodovic has more context