Closed sayanarijit closed 4 years ago
I can confirm this patch works.
Reproduced the same here: https://github.com/sayanarijit/remap-sigterm/tree/text/fix2
I might have closed that mistakenly! would you mind come with a PR with a unit test of that code?
Tested here by installing the patched celery directly from GitHub and deploying to Heroku from the mentioned branch.
Works fine.
I'm new to Celery, so I'm probably totally missing something here, but I'm trying to figure out (in the context of Heroku) what's the difference the expected behavior described here using REMAP_SIGTERM=SIGQUIT
, and a scenario where REMAP_SIGTERM
is not set and worker processes receive SIGTERM directly from Heroku and shutdown (as described here)?
Since the example enables acks_late
and task_reject_on_worker_lost
, isn't the Actual Behavior
the same end result as the Expected Behavior
due to the After 30 seconds it kills the worker. step is hit? When Heroku sends SIGKILL
, won't the unfinished task not be acknowledged with task_reject_on_worker_lost
? The Actual Behavior
implies that the queue will be empty when the worker comes back, but I can't quite figure out why that would be the case.
Hi @krmannix, by now I probably forgot a lot of the details, but here's my best guess:
The difference is celery killing the worker vs heroku killing the worker.
To execute each task, the worker pops the task from the queue and starts working on it. If celery is killing the worker, celery gets to put unacknowledged tasks back into the queue before dying, but when heroku kills the worker, celery can't, resulting in lost tasks.
@sayanarijit thanks for your quick response on an old issue!
I think I understand the mechanism, where celery can put unacknowledged tasks back into the queue before dying, and why that doesn't happen when Heroku kills the worker.
The thing that still gives me a little pause is that, with acks_late
and task_reject_on_worker_lost
enabled, tasks won't be acknowledged until they're complete, so if Heroku kills them in the middle of execution they won't be acknowledge. Celery's docs indicate that in this situation, tasks won't be lost:
A task message is not removed from the queue until that message has been acknowledged by a worker. ... If your task is idempotent you can set the
acks_late
option to have the worker acknowledge the message after the task returns instead.
Which I'm interpreting to mean that the end result is the same: regardless of whether Celery or Heroku kills the worker, the task isn't acknowledged and thus isn't lost; however, it sounds like that interpretation might not be correct.
Celery's docs indicate that in this situation, tasks won't be lost:
You are right that should have been the ideal case. However, the doc mentions a "re-queue" step. So, we can assume that the task is lost if it isn't re-queued, even if it's never acknowledged. And sigterm from celery doesn't let the re-queue to happen.
task_reject_on_worker_lost
Setting this to true allows the message to be re-queued instead, so that the task will execute again by the same worker, or another worker.
But I see now, these are some conflicting docs.
ah, perhaps it's this: in the case with acks_late
enabled, REMAP_SIGTERM=SIGQUIT
allows Heroku users to not turn on task_reject_on_worker_lost
, as the cold shutdown will immediately kill the worker and put unacknowledged tasks back on the queue. Otherwise, without REMAP_SIGTERM=SIGQUIT
one would enable task_reject_on_worker_lost
and have the Heroku SIGTERM kill the worker for us to reach the same behavior. Does that make sense?
(btw, i appreciate your continued involvement in responding to my inquiries 😄 )
I think there's some confusion...
As per docs we know:
So, we need both of them to work together.
But without the remap, task_reject_on_worker_lost doesn't get to do its thing. The application terminates before the task is re-queued.
The mapping allows holding the termination till task_reject_on_worker_lost does its thing and the task is re-queued.
Update:
I forgot to clarify: the unacknowledged tasks need to be re-queued for celery to be able to attempt the execution again when the workers come back, which is the workflow for idempotent tasks.
Ah, I think I got it. Thanks for your help!
I don't get this @sayanarijit @auvipy I think the old behavior described in "Actual Behavior" is the correct one: REMAP_SIGTERM=SIGQUIT should do a warm shutdown, not a cold one.
@krmannix was also confused because the old "Actual Behavior" was correct.
I think https://github.com/celery/celery/pull/6249 should be reverted. This message from another user in celery-users says the same.
I don't get this @sayanarijit @auvipy I think the old behavior described in "Actual Behavior" is the correct one: REMAP_SIGTERM=SIGQUIT should do a warm shutdown, not a cold one.
@krmannix was also confused because the old "Actual Behavior" was correct.
I think #6249 should be reverted. This message from another user in celery-users says the same.
did you read the full discussion? also that can't be automatically reverted.
Yes, I read the full discussion. AFAIK, if the dev wants the "Expected Behavior" described by OP, they should simply drop the REMAP_SIGTERM
. The "Actual Behavior" was the correct: it's the same thing that would happen outside of Heroku if one sent SIGTERM
, so to avoid losing tasks, the dev should either expect warm shutdown should work cleanly, or use acks_late
and a visibility timeout.
@fjsj could you please create a poc like https://github.com/sayanarijit/remap-sigterm to demonstrate how you'd implement idempotent tasks with warm shutdown?
@sayanarijit Why did you change the behavior from warm shutdown to cold shutdown if REMAP_SIGTERM is present?
Because I think we're misaligned because you seem to think Cold is waiting for tasks / re-queing, and Warm is the opposite. But the docs state otherwise:
Signal | Description |
---|---|
TERM | Warm shutdown, wait for tasks to complete. |
QUIT | Cold shutdown, terminate ASAP |
USR1 | Dump traceback for all active threads. |
USR2 | Remote debug, see celery.contrib.rdb. |
AFAIK, if you want the "Actual Behavior" you shouldn't have changed from warm to cold shutdown. Instead, you should have:
That should be enough and wouldn't require the change you made.
Also, when you mention:
we can assume that the task is lost if it isn't re-queued, even if it's never acknowledged.
That's wrong. visibility_timeout
in Redis broker handles that. I think the docs changed and this part was lost, but you can learn about it checking the v4 docs. In RabbitMQ, it works similarly: it has an internal config to deal with that.
Because I think we're misaligned because you seem to think Cold is waiting for tasks / re-queing, and Warm is the opposite.
No. Cold's immediate shutdown was what we needed, because the intention was to quit as soon as heroku sends sigterm, before heroku force kills celery along with all the tasks, without ever sending sigquit.
That should be enough and wouldn't require the change you made.
The poc linked in the issue demonstrates it wasn't enough, with or without the remap. And also that the fix was the implemented change.
Reg visibility timeout, it wasn't mentioned in any workflow for implementing idempotent tasks, so I'm not sure if it will actually work in practice, unless we have a poc.
But I guess, since heroku no longer provides free tier, it wouldn't be easy to create the poc, so honestly, I'm not sure what's the way forward (paying heroku to work for them is the opposite of how it should be).
edit: celery -> heroku
Reg visibility timeout, it wasn't mentioned in any workflow for implementing idempotent tasks, so I'm not sure if it will actually work in practice, unless we have a poc.
Note that you would have to wait 1 hour for the task to be re-queued, because that's the default visibility timeout. It should work, if it doesn't work, IMHO the fix is to solve the problem on visibility timeout, and not forcing the cold shutdown behavior.
That said, I do understand Heroku could need a specific behavior that would require a specific shutdown behavior due to the way it kills processes, but in theory that shouldn't be the case because the visibility timeout should re-queue tasks regardless of how the worker process is killed. Note the visibility timeout should even handle the case of the worker vanishing immediately, w/o ever having time to send any packet to the broker, because that's exactly what happens when there's a power loss.
Unfortunately, I don't have time to write a PoC for this and I don't have the interest, because I don't use Heroku anymore.
I don't care about heroku anymore. so if you can come with a manual revert PR, referencing all the reasoning, including re Q visibility time out, that would be great for the technical decision to make for us
https://github.com/celery/celery/pull/9213 Adds documentation to REMAP_SIGTERM
+ smoke tests and also potentially addresses the visibility timeout problem during worker shutdown.
9213 Adds documentation to
REMAP_SIGTERM
+ smoke tests and also potentially addresses the visibility timeout problem during worker shutdown.
Celery v5.5.0b3 released.
Checklist
master
branch of Celery.Mandatory Debugging Information
celery -A proj report
in the issue. (if you are not able to do this, then at least specify the Celery version affected).master
branch of Celery.pip freeze
in the issue.Optional Debugging Information
Related Issues and Possible Duplicates
Related Issues
Possible Duplicates
Environment & Settings
Celery version:
celery report
Output:``` software -> celery:4.4.6 (cliffs) kombu:4.6.11 py:3.7.6 billiard:3.6.3.0 redis:3.5.3 platform -> system:Darwin arch:64bit kernel version:19.6.0 imp:CPython loader -> celery.loaders.app.AppLoader settings -> transport:redis results:disabled broker_url: 'redis://h:********@ec2-23-21-96-121.compute-1.amazonaws.com:24849//' celery_backend_result: 'redis://h:********@ec2-23-21-96-121.compute-1.amazonaws.com:24849/' task_acks_late: True task_reject_on_worker_lost: True task_time_limit: 300 broker_transport_options: { 'visibility_timeout': 3600} ```
Steps to Reproduce
Required Dependencies
Python Packages
pip freeze
Output:``` amqp==2.6.0 appdirs==1.4.4 attrs==19.3.0 billiard==3.6.3.0 black==19.10b0 celery==4.4.6 click==7.1.2 flake8==3.8.3 future==0.18.2 importlib-metadata==1.7.0 kombu==4.6.11 mccabe==0.6.1 mypy==0.782 mypy-extensions==0.4.3 pathspec==0.8.0 pycodestyle==2.6.0 pyflakes==2.2.0 pytz==2020.1 redis==3.5.3 regex==2020.7.14 toml==0.10.1 typed-ast==1.4.1 typing-extensions==3.7.4.2 vine==1.3.0 zipp==3.1.0 ```
Other Dependencies
N/A
Minimally Reproducible Test Case
I have reproduced it here: https://github.com/sayanarijit/remap-sigterm
Expected Behavior
https://github.com/sayanarijit/remap-sigterm#what-should-happen
Actual Behavior
https://github.com/sayanarijit/remap-sigterm#what-happens