Context
We have an app running on Django and Celery 5.3.1, with a Broker on Redis. The producers submit jobs (basically an image and some context) to the workers who compress and/or convert the image and post it on an S3 object storage.
Here are our Django settings for Celery:
BROKER_URL = config("BROKER_URL")
CELERY_RESULT_BACKEND = config("CELERY_RESULT_BACKEND")
CELERYD_MAX_TASKS_PER_CHILD = 100
CELERY_ACCEPT_CONTENT = ["json"]
CELERY_RESULT_SERIALIZER = "json"
CELERY_TASK_SERIALIZER = "json"
CELERY_ALWAYS_EAGER = False
CELERY_TASK_RESULT_EXPIRES = 600 # 600 seconds that's 10 minutes
LARGE_QUEUE_THRESHOLD_BYTES = config("LARGE_QUEUE_THRESHOLD_BYTES", cast=int, default=512_000)
CELERY_REDIS_BACKEND_HEALTH_CHECK_INTERVAL = 20 # Not sure this is taken into account
CELERYD_PREFETCH_MULTIPLIER = 1
Investigation
We noticed very high network usage between the workers and the broker from time to time, leading ultimately to congestion (jobs not processed, producers stuck and redis restarting). We investigated the issue:
The network usage starts increasing always 1 hour after one worker crashed (typically because of OOM).
The network usage increases every 5 minutes. What happens is that, every 5 minutes, a new worker "get stuck" with high network usage and no more logs. The last logs are saying a job has been completed. Nothing after that.
The issue is perfectly correlated with the unacked_mutex being set, expired 5 minutes later, re-set, etc.
When the issue occurs, the Redis command hget is being used a lot. It is never used otherwise.
When the issue occurs, the Redis command zrevrangebyscore is exectued once every 5 minutes.
Issue
We think that workers try to get an unacked job once the visibility timeout is over, but for some reason, they never manage to get it and get stuck. The unacked_mutex expiration of 5 minutes allows another worker to try and get stuck every 5 minutes.
Based on the Redis command executed, the issue appears around here I would say: https://github.com/celery/kombu/blob/3884eb9dd62bf3ee2d47dacc5f7a764936b16b54/kombu/transport/redis.py#L414
Since we don't care if a job is not processed, we changed the mutex TTL to 5 hours, which makes the issue much slower to propagate and allows us to maintain the service. It worked as expected. A more stable mitigation will probably be to set the visibility timeout ridiculously high.
However, we would like to understand the actual root cause and have a proper fix for this.
Let me know if some more information would be useful to investigate this issue and/or suggest a proper fix. Thank you.
Context We have an app running on Django and Celery 5.3.1, with a Broker on Redis. The producers submit jobs (basically an image and some context) to the workers who compress and/or convert the image and post it on an S3 object storage. Here are our Django settings for Celery:
We start workers with CLI: "--single-child",
Investigation We noticed very high network usage between the workers and the broker from time to time, leading ultimately to congestion (jobs not processed, producers stuck and redis restarting). We investigated the issue:
unacked_mutex
being set, expired 5 minutes later, re-set, etc.zrevrangebyscore
is exectued once every 5 minutes.Issue We think that workers try to get an unacked job once the visibility timeout is over, but for some reason, they never manage to get it and get stuck. The unacked_mutex expiration of 5 minutes allows another worker to try and get stuck every 5 minutes. Based on the Redis command executed, the issue appears around here I would say: https://github.com/celery/kombu/blob/3884eb9dd62bf3ee2d47dacc5f7a764936b16b54/kombu/transport/redis.py#L414
Since we don't care if a job is not processed, we changed the mutex TTL to 5 hours, which makes the issue much slower to propagate and allows us to maintain the service. It worked as expected. A more stable mitigation will probably be to set the visibility timeout ridiculously high. However, we would like to understand the actual root cause and have a proper fix for this.
Let me know if some more information would be useful to investigate this issue and/or suggest a proper fix. Thank you.