sayanarijit commented 4 years ago

Checklist

[x] I have verified that the issue exists against the master branch of Celery.
[x] This has already been asked to the discussion group first.
[x] I have read the relevant section in the contribution guide on reporting bugs.
[x] I have checked the issues list for similar or identical bug reports.
[x] I have checked the pull requests list for existing proposed fixes.
[x] I have checked the commit log to find out if the bug was already fixed in the master branch.
[x] I have included all related issues and possible duplicate issues in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

[x] I have included the output of celery -A proj report in the issue. (if you are not able to do this, then at least specify the Celery version affected).
[x] I have verified that the issue exists against the master branch of Celery.
[x] I have included the contents of pip freeze in the issue.
[x] I have included all the versions of all the external dependencies required to reproduce this bug.

Optional Debugging Information

[ ] I have tried reproducing the issue on more than one Python version and/or implementation.
[ ] I have tried reproducing the issue on more than one message broker and/or result backend.
[ ] I have tried reproducing the issue on more than one version of the message broker and/or result backend.
[ ] I have tried reproducing the issue on more than one operating system.
[ ] I have tried reproducing the issue on more than one workers pool.
[ ] I have tried reproducing the issue with autoscaling, retries, ETA/Countdown & rate limits disabled.
[ ] I have tried reproducing the issue after downgrading and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

None

Environment & Settings

Celery version:

celery report Output:

``` software -> celery:4.4.6 (cliffs) kombu:4.6.11 py:3.7.6 billiard:3.6.3.0 redis:3.5.3 platform -> system:Darwin arch:64bit kernel version:19.6.0 imp:CPython loader -> celery.loaders.app.AppLoader settings -> transport:redis results:disabled broker_url: 'redis://h:********@ec2-23-21-96-121.compute-1.amazonaws.com:24849//' celery_backend_result: 'redis://h:********@ec2-23-21-96-121.compute-1.amazonaws.com:24849/' task_acks_late: True task_reject_on_worker_lost: True task_time_limit: 300 broker_transport_options: { 'visibility_timeout': 3600} ```

Steps to Reproduce

Required Dependencies

Minimal Python Version: N/A or Unknown
Minimal Celery Version: N/A or Unknown
Minimal Kombu Version: N/A or Unknown
Minimal Broker Version: N/A or Unknown
Minimal Result Backend Version: N/A or Unknown
Minimal OS and/or Kernel Version: N/A or Unknown
Minimal Broker Client Version: N/A or Unknown
Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

``` amqp==2.6.0 appdirs==1.4.4 attrs==19.3.0 billiard==3.6.3.0 black==19.10b0 celery==4.4.6 click==7.1.2 flake8==3.8.3 future==0.18.2 importlib-metadata==1.7.0 kombu==4.6.11 mccabe==0.6.1 mypy==0.782 mypy-extensions==0.4.3 pathspec==0.8.0 pycodestyle==2.6.0 pyflakes==2.2.0 pytz==2020.1 redis==3.5.3 regex==2020.7.14 toml==0.10.1 typed-ast==1.4.1 typing-extensions==3.7.4.2 vine==1.3.0 zipp==3.1.0 ```

Other Dependencies

N/A

Minimally Reproducible Test Case

I have reproduced it here: https://github.com/sayanarijit/remap-sigterm

Expected Behavior

https://github.com/sayanarijit/remap-sigterm#what-should-happen

The task is queued by redis.
Heroku worker starts executing the task which goes to a 5 min. sleep.
When shutdown, Heroku worker sends SIGQUIT (because REMAP_SIGTERM="SIGQUIT") to the worker.
The worker exists immediately never acknowledging the task (the task stays in queue).
After the worker comes back, it fetches the task from queue and starts running again.

Actual Behavior

https://github.com/sayanarijit/remap-sigterm#what-happens

The task is queued by redis.
Heroku worker starts executing the task which goes to a 5 min. sleep.
When shutdown, Heroku worker sends SIGTERM to the worker.
The worker waits for the process to exit.
After 30 seconds it kills the worker.
After the worker comes back, the queue is empty. Hence task is lost.

sayanarijit commented 4 years ago

I can confirm this patch works.

Reproduced the same here: https://github.com/sayanarijit/remap-sigterm/tree/text/fix2

auvipy commented 4 years ago

I might have closed that mistakenly! would you mind come with a PR with a unit test of that code?

sayanarijit commented 4 years ago

Tested here by installing the patched celery directly from GitHub and deploying to Heroku from the mentioned branch.

Works fine.

krmannix commented 2 years ago

I'm new to Celery, so I'm probably totally missing something here, but I'm trying to figure out (in the context of Heroku) what's the difference the expected behavior described here using REMAP_SIGTERM=SIGQUIT, and a scenario where REMAP_SIGTERM is not set and worker processes receive SIGTERM directly from Heroku and shutdown (as described here)?

Since the example enables acks_late and task_reject_on_worker_lost, isn't the Actual Behavior the same end result as the Expected Behavior due to the After 30 seconds it kills the worker. step is hit? When Heroku sends SIGKILL, won't the unfinished task not be acknowledged with task_reject_on_worker_lost? The Actual Behavior implies that the queue will be empty when the worker comes back, but I can't quite figure out why that would be the case.

sayanarijit commented 2 years ago

Hi @krmannix, by now I probably forgot a lot of the details, but here's my best guess:

The difference is celery killing the worker vs heroku killing the worker.

To execute each task, the worker pops the task from the queue and starts working on it. If celery is killing the worker, celery gets to put unacknowledged tasks back into the queue before dying, but when heroku kills the worker, celery can't, resulting in lost tasks.

krmannix commented 2 years ago

@sayanarijit thanks for your quick response on an old issue!

I think I understand the mechanism, where celery can put unacknowledged tasks back into the queue before dying, and why that doesn't happen when Heroku kills the worker.

The thing that still gives me a little pause is that, with acks_late and task_reject_on_worker_lost enabled, tasks won't be acknowledged until they're complete, so if Heroku kills them in the middle of execution they won't be acknowledge. Celery's docs indicate that in this situation, tasks won't be lost:

A task message is not removed from the queue until that message has been acknowledged by a worker. ... If your task is idempotent you can set the acks_late option to have the worker acknowledge the message after the task returns instead.

Which I'm interpreting to mean that the end result is the same: regardless of whether Celery or Heroku kills the worker, the task isn't acknowledged and thus isn't lost; however, it sounds like that interpretation might not be correct.

sayanarijit commented 2 years ago

Celery's docs indicate that in this situation, tasks won't be lost:

You are right that should have been the ideal case. However, the doc mentions a "re-queue" step. So, we can assume that the task is lost if it isn't re-queued, even if it's never acknowledged. And sigterm from celery doesn't let the re-queue to happen.

sayanarijit commented 2 years ago

task_reject_on_worker_lost

Setting this to true allows the message to be re-queued instead, so that the task will execute again by the same worker, or another worker.

But I see now, these are some conflicting docs.

krmannix commented 2 years ago

ah, perhaps it's this: in the case with acks_late enabled, REMAP_SIGTERM=SIGQUIT allows Heroku users to not turn on task_reject_on_worker_lost, as the cold shutdown will immediately kill the worker and put unacknowledged tasks back on the queue. Otherwise, without REMAP_SIGTERM=SIGQUIT one would enable task_reject_on_worker_lost and have the Heroku SIGTERM kill the worker for us to reach the same behavior. Does that make sense?

(btw, i appreciate your continued involvement in responding to my inquiries 😄 )

sayanarijit commented 2 years ago

I think there's some confusion...

As per docs we know:

ack_late is responsible for not acknowledging task before execution is finished. But it doesn't re-queue the task if the worker is lost.
task_reject_on_worker_lost is responsible for re-queue-ing the unacknowledged tasks back into the task queue.

So, we need both of them to work together.

But without the remap, task_reject_on_worker_lost doesn't get to do its thing. The application terminates before the task is re-queued.

The mapping allows holding the termination till task_reject_on_worker_lost does its thing and the task is re-queued.

Update:

I forgot to clarify: the unacknowledged tasks need to be re-queued for celery to be able to attempt the execution again when the workers come back, which is the workflow for idempotent tasks.

krmannix commented 2 years ago

Ah, I think I got it. Thanks for your help!

fjsj commented 1 year ago

I don't get this @sayanarijit @auvipy I think the old behavior described in "Actual Behavior" is the correct one: REMAP_SIGTERM=SIGQUIT should do a warm shutdown, not a cold one.

@krmannix was also confused because the old "Actual Behavior" was correct.

I think https://github.com/celery/celery/pull/6249 should be reverted. This message from another user in celery-users says the same.

auvipy commented 1 year ago

I don't get this @sayanarijit @auvipy I think the old behavior described in "Actual Behavior" is the correct one: REMAP_SIGTERM=SIGQUIT should do a warm shutdown, not a cold one.

@krmannix was also confused because the old "Actual Behavior" was correct.

I think #6249 should be reverted. This message from another user in celery-users says the same.

did you read the full discussion? also that can't be automatically reverted.

fjsj commented 1 year ago

Yes, I read the full discussion. AFAIK, if the dev wants the "Expected Behavior" described by OP, they should simply drop the REMAP_SIGTERM. The "Actual Behavior" was the correct: it's the same thing that would happen outside of Heroku if one sent SIGTERM, so to avoid losing tasks, the dev should either expect warm shutdown should work cleanly, or use acks_late and a visibility timeout.

sayanarijit commented 1 year ago

@fjsj could you please create a poc like https://github.com/sayanarijit/remap-sigterm to demonstrate how you'd implement idempotent tasks with warm shutdown?

fjsj commented 1 year ago

@sayanarijit Why did you change the behavior from warm shutdown to cold shutdown if REMAP_SIGTERM is present?

Because I think we're misaligned because you seem to think Cold is waiting for tasks / re-queing, and Warm is the opposite. But the docs state otherwise:

Signal	Description
TERM	Warm shutdown, wait for tasks to complete.
QUIT	Cold shutdown, terminate ASAP
USR1	Dump traceback for all active threads.
USR2	Remote debug, see celery.contrib.rdb.

AFAIK, if you want the "Actual Behavior" you shouldn't have changed from warm to cold shutdown. Instead, you should have:

ensure you have idempotent tasks
use task_acks_late True
use task_reject_on_worker_lost True

That should be enough and wouldn't require the change you made.

Also, when you mention:

we can assume that the task is lost if it isn't re-queued, even if it's never acknowledged.

That's wrong. visibility_timeout in Redis broker handles that. I think the docs changed and this part was lost, but you can learn about it checking the v4 docs. In RabbitMQ, it works similarly: it has an internal config to deal with that.

sayanarijit commented 1 year ago

Because I think we're misaligned because you seem to think Cold is waiting for tasks / re-queing, and Warm is the opposite.

No. Cold's immediate shutdown was what we needed, because the intention was to quit as soon as heroku sends sigterm, before heroku force kills celery along with all the tasks, without ever sending sigquit.

That should be enough and wouldn't require the change you made.

The poc linked in the issue demonstrates it wasn't enough, with or without the remap. And also that the fix was the implemented change.

Reg visibility timeout, it wasn't mentioned in any workflow for implementing idempotent tasks, so I'm not sure if it will actually work in practice, unless we have a poc.

But I guess, since heroku no longer provides free tier, it wouldn't be easy to create the poc, so honestly, I'm not sure what's the way forward (paying heroku to work for them is the opposite of how it should be).

edit: celery -> heroku

fjsj commented 1 year ago

Reg visibility timeout, it wasn't mentioned in any workflow for implementing idempotent tasks, so I'm not sure if it will actually work in practice, unless we have a poc.

Note that you would have to wait 1 hour for the task to be re-queued, because that's the default visibility timeout. It should work, if it doesn't work, IMHO the fix is to solve the problem on visibility timeout, and not forcing the cold shutdown behavior.

That said, I do understand Heroku could need a specific behavior that would require a specific shutdown behavior due to the way it kills processes, but in theory that shouldn't be the case because the visibility timeout should re-queue tasks regardless of how the worker process is killed. Note the visibility timeout should even handle the case of the worker vanishing immediately, w/o ever having time to send any packet to the broker, because that's exactly what happens when there's a power loss.

Unfortunately, I don't have time to write a PoC for this and I don't have the interest, because I don't use Heroku anymore.

auvipy commented 1 year ago

I don't care about heroku anymore. so if you can come with a manual revert PR, referencing all the reasoning, including re Q visibility time out, that would be great for the technical decision to make for us

Nusnus commented 2 months ago

https://github.com/celery/celery/pull/9213 Adds documentation to REMAP_SIGTERM + smoke tests and also potentially addresses the visibility timeout problem during worker shutdown.

Nusnus commented 1 month ago

9213 Adds documentation to REMAP_SIGTERM + smoke tests and also potentially addresses the visibility timeout problem during worker shutdown.

Celery v5.5.0b3 released.

celery / celery

REMAP_SIGTERM=SIGQUIT not working #6244

Checklist

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

9213 Adds documentation to `REMAP_SIGTERM` + smoke tests and also potentially addresses the visibility timeout problem during worker shutdown.

celery / celery

REMAP_SIGTERM=SIGQUIT not working #6244

Checklist

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

9213 Adds documentation to REMAP_SIGTERM + smoke tests and also potentially addresses the visibility timeout problem during worker shutdown.

9213 Adds documentation to `REMAP_SIGTERM` + smoke tests and also potentially addresses the visibility timeout problem during worker shutdown.