celery / kombu

Messaging library for Python.
http://kombu.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
2.82k stars 920 forks source link

SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049

Open pbudzon opened 6 days ago

pbudzon commented 6 days ago

Fix for: https://github.com/celery/kombu/issues/1835 Partially also: https://github.com/celery/kombu/issues/1819 Possibly related: https://github.com/celery/celery/discussions/8583

Issue when not using short polling (wait_time_seconds > 0) and task_acks_late=True it is possible for message to become stuck with full visibility timeout when celery is shut down. I was able to replicate that consistently with cold shutdowns.

_Similar issue happens with warm shutdowns (see related discussion/bugs above). And with task_acks_late=False I believe this causes the message to become lost completely._

To reproduce

  1. Execute a long(ish) running job (long enough for you to perform those steps).
  2. Send SIGQUIT to celery master process.
  3. See Restoring X unacknowledged message(s).
  4. See that the message is still in-flight with a full visibility timeout.

Note: this is a race condition, so it doesn't happen every time, but in my tests it happened MOST of the time (like 80%).

What happens I believe the internal details are described in https://github.com/celery/kombu/issues/1819. On AWS side, you can see in CloudTrail logs the ChangeMessageVisibility call appearing correctly, but right after it there is (at least one, sometimes more, depending on your concurrency) a ReceiveMessage call which fetches the message again from the queue, causing its visibility timeout to be set back to non-zero (whatever your settings are).

Screenshot 2024-07-01 at 19 52 45

This fix This change changes the visibility timeout of the redelivered message from 0 (zero) to wait_time_seconds value. While this doesn't solve the underlying issue, it prevents the message from being re-fetched by the still-running ReceiveMessage call(s). For most scenarios this should create a considerable improvement, because:

  1. With task_acks_late=True, the message will be hidden for wait_time_seconds (default 10 seconds), instead of visibility timeout (default 30 minutes). I have tested that change in this scenario and it is working as expected.
  2. With task_acks_late=False, this should prevent the message from being lost completely. I did not test that scenario, however.
pbudzon commented 5 days ago

Issue also reported here: https://github.com/celery/celery/discussions/7283 back in 2022.

And https://github.com/celery/celery/issues/8875

pbudzon commented 2 days ago

Do you need anything else to merge this?