celery / kombu

Messaging library for Python.
http://kombu.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
2.88k stars 928 forks source link

Celery worker stuck with unack mutex and high network usage with Kombu & Redis #1816

Open MathieuLamiot opened 12 months ago

MathieuLamiot commented 12 months ago

Context We have an app running on Django and Celery 5.3.1, with a Broker on Redis. The producers submit jobs (basically an image and some context) to the workers who compress and/or convert the image and post it on an S3 object storage. Here are our Django settings for Celery:

BROKER_URL = config("BROKER_URL")
CELERY_RESULT_BACKEND = config("CELERY_RESULT_BACKEND")
CELERYD_MAX_TASKS_PER_CHILD = 100
CELERY_ACCEPT_CONTENT = ["json"]
CELERY_RESULT_SERIALIZER = "json"
CELERY_TASK_SERIALIZER = "json"
CELERY_ALWAYS_EAGER = False
CELERY_TASK_RESULT_EXPIRES = 600  # 600 seconds that's 10 minutes
LARGE_QUEUE_THRESHOLD_BYTES = config("LARGE_QUEUE_THRESHOLD_BYTES", cast=int, default=512_000)
CELERY_REDIS_BACKEND_HEALTH_CHECK_INTERVAL = 20  # Not sure this is taken into account
CELERYD_PREFETCH_MULTIPLIER = 1

We start workers with CLI: "--single-child",

              "--",
              "celery",
              "--app",
              "imagify",
              "worker",
              "--loglevel",
              "warning",
              "--max-memory-per-child",
              "262144", # 256 Mb in Kb
              "--max-tasks-per-child",
              "50",
              "--concurrency",
              "4",
              "--queues",
              "pro",
              "--heartbeat-interval",
              "30",

Investigation We noticed very high network usage between the workers and the broker from time to time, leading ultimately to congestion (jobs not processed, producers stuck and redis restarting). We investigated the issue:

Issue We think that workers try to get an unacked job once the visibility timeout is over, but for some reason, they never manage to get it and get stuck. The unacked_mutex expiration of 5 minutes allows another worker to try and get stuck every 5 minutes. Based on the Redis command executed, the issue appears around here I would say: https://github.com/celery/kombu/blob/3884eb9dd62bf3ee2d47dacc5f7a764936b16b54/kombu/transport/redis.py#L414

Since we don't care if a job is not processed, we changed the mutex TTL to 5 hours, which makes the issue much slower to propagate and allows us to maintain the service. It worked as expected. A more stable mitigation will probably be to set the visibility timeout ridiculously high. However, we would like to understand the actual root cause and have a proper fix for this.

Let me know if some more information would be useful to investigate this issue and/or suggest a proper fix. Thank you.

auvipy commented 11 months ago

If you have more information to share to help finding the root cause please feel free :)