Issue
when not using short polling (wait_time_seconds > 0) and task_acks_late=True it is possible for message to become stuck with full visibility timeout when celery is shut down. I was able to replicate that consistently with cold shutdowns.
_Similar issue happens with warm shutdowns (see related discussion/bugs above). And with task_acks_late=False I believe this causes the message to become lost completely._
To reproduce
Execute a long(ish) running job (long enough for you to perform those steps).
Send SIGQUIT to celery master process.
See Restoring X unacknowledged message(s).
See that the message is still in-flight with a full visibility timeout.
Note: this is a race condition, so it doesn't happen every time, but in my tests it happened MOST of the time (like 80%).
What happens
I believe the internal details are described in https://github.com/celery/kombu/issues/1819.
On AWS side, you can see in CloudTrail logs the ChangeMessageVisibility call appearing correctly, but right after it there is (at least one, sometimes more, depending on your concurrency) a ReceiveMessage call which fetches the message again from the queue, causing its visibility timeout to be set back to non-zero (whatever your settings are).
This fix
This change changes the visibility timeout of the redelivered message from 0 (zero) to wait_time_seconds value. While this doesn't solve the underlying issue, it prevents the message from being re-fetched by the still-running ReceiveMessage call(s).
For most scenarios this should create a considerable improvement, because:
With task_acks_late=True, the message will be hidden for wait_time_seconds (default 10 seconds), instead of visibility timeout (default 30 minutes). I have tested that change in this scenario and it is working as expected.
With task_acks_late=False, this should prevent the message from being lost completely. I did not test that scenario, however.
Fix for: https://github.com/celery/kombu/issues/1835 Partially also: https://github.com/celery/kombu/issues/1819 Possibly related: https://github.com/celery/celery/discussions/8583
Issue when not using short polling (
wait_time_seconds > 0
) andtask_acks_late=True
it is possible for message to become stuck with full visibility timeout when celery is shut down. I was able to replicate that consistently with cold shutdowns._Similar issue happens with warm shutdowns (see related discussion/bugs above). And with
task_acks_late=False
I believe this causes the message to become lost completely._To reproduce
SIGQUIT
to celery master process.Restoring X unacknowledged message(s)
.Note: this is a race condition, so it doesn't happen every time, but in my tests it happened MOST of the time (like 80%).
What happens I believe the internal details are described in https://github.com/celery/kombu/issues/1819. On AWS side, you can see in CloudTrail logs the
ChangeMessageVisibility
call appearing correctly, but right after it there is (at least one, sometimes more, depending on yourconcurrency
) aReceiveMessage
call which fetches the message again from the queue, causing its visibility timeout to be set back to non-zero (whatever your settings are).This fix This change changes the visibility timeout of the redelivered message from 0 (zero) to
wait_time_seconds
value. While this doesn't solve the underlying issue, it prevents the message from being re-fetched by the still-runningReceiveMessage
call(s). For most scenarios this should create a considerable improvement, because:task_acks_late=True
, the message will be hidden forwait_time_seconds
(default 10 seconds), instead of visibility timeout (default 30 minutes). I have tested that change in this scenario and it is working as expected.task_acks_late=False
, this should prevent the message from being lost completely. I did not test that scenario, however.