celery / kombu

Messaging library for Python.
http://kombu.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
2.86k stars 927 forks source link

SQS messages can be dropped during shutdown #1819

Open mgorven opened 10 months ago

mgorven commented 10 months ago

When celery is stopped gracefully, the in-flight request to SQS is not completed. If SQS returns messages for this request it considers them delivered, but Celery doesn't do anything with them (even if task_reject_on_worker_lost is set). SQS will redeliver the message when the visibility timeout expires, but this isn't great for a graceful shutdown.

SQS uses CurlClient to make requests, which uses celery.worker.loops.asynloop to do IO. The event loop is the first thing to stop when SIGTERM/SIGINT is received, and so in-flight requests are simply dropped.

auvipy commented 10 months ago

@rafidka any insight to share?

rafidka commented 10 months ago

@auvipy , I would need to do some debugging on this. I will try to allocate some time for it, but I cannot promise anything this week.

@mgorven , it would help if you have some easy step-by-step reproduction.

mgorven commented 10 months ago

I don't have an easy repro of actually losing events, I'll work on that. What I've done is add debug prints in CurlClient._setup_request() and CurlClient._process() and I can see that a request is started but never finished during shutdown: https://gist.github.com/mgorven/f671b3b9384b2814d86f7e99451a2936

mgorven commented 10 months ago
  1. Setup celery with an SQS queue which isn't processing other events and with a visibility timeout of a few minutes
  2. In another shell have a celery call command ready to dispatch a task
  3. Add a debug print in CurlSetup._setup_request() so you can see when a request is started
  4. Run celery worker
  5. Ctrl-C the worker just after it starts an SQS request
  6. Run celery call
  7. Start the worker again

You'd expect the worker to immediately process the dispatched task, but instead it only does so after the visibility timeout expires.

Nusnus commented 2 weeks ago

Potentially fixed with https://github.com/celery/celery/pull/9213 (although it’s in Celery, not Kombu) @mgorven @rafidka

Nusnus commented 1 week ago

Potentially fixed with celery/celery#9213 (although it’s in Celery, not Kombu) @mgorven @rafidka

Celery v5.5.0b3 released.