Open andyneff opened 2 years ago
Diving into the problem further, it is very easy to offset the problem with print statements. Adding/removing them can hide or expose the problem, so it definitely seems like a timing affects the issue.
I've also replaced @shared_task
with the following, and the problem persists, so leaving shared_task
for easy of debugging:
app = Celery('tasks')
@app.task
Roughly speaking, celery creates a "task" out of your function, and then wraps that in a Proxy
class:
# bar = shared_task(foo)
app = celery_state.get_current_app()
foo_task = app._task_from_fun(foo)
def thing():
return foo_task
bar = Proxy(thing)
Starting in python 3.9.0, using a ProcessPoolExecutor has a good chance of deadlocking on a terra task. It's almost always happens with 10 workers, and is practically guaranteed with 16 workers.
I've managed to put together a piece of code to reproduce the error:
As you can see here, the bug is actually not part of terra, but can be reproduced just using the celery task object. Something about how celery works and a "task" vs a "function" is causing workers to hang before they ever process a single job.