Tasks are stuck in the queued state and will not be scheduled for execution. In the logs, these tasks have a message of could not queue task <task details>, as they are currently in the queued or running sets in the executor.
What you expected to happen
Tasks run :)
How to reproduce
We have a dockerized airflow setup, using celery with a rabbit broker & postgres as the result db. When rolling out DAG updates, we redeploy most of the components (the workers, scheduler, webserver, and rabbit). We can have a few thousand Dagruns at a given time. This error seems to happen during a load spike when a deployment happens.
Looking at the code, this is what I believe is happening:
Following the state update code, I see this does not cause any state updates to occur in Airflow (https://github.com/apache/airflow/blob/main/airflow/executors/celery_executor.py#L465). Thus, if a task is marked as STARTED in the results db, but queued in the airflow task state, it will never be transferred out by the scheduler. However ,you can get these tasks to finally run by clicking the run button.
Operating System
Ubuntu 20.04.3 LTS
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
Anything else
I believe this could be addressed by asking celery if a task is actually running in try_adopt_task_instances. There are obvious problems like timeouts, but celery inspect active|reserved can return a json output of running and reserved tasks to verify a STARTED task is actually running
Apache Airflow version
2.2.3 (latest released)
What happened
Tasks are stuck in the queued state and will not be scheduled for execution. In the logs, these tasks have a message of
could not queue task <task details>
, as they are currently in thequeued
orrunning
sets in the executor.What you expected to happen
Tasks run :)
How to reproduce
We have a dockerized airflow setup, using celery with a rabbit broker & postgres as the result db. When rolling out DAG updates, we redeploy most of the components (the workers, scheduler, webserver, and rabbit). We can have a few thousand Dagruns at a given time. This error seems to happen during a load spike when a deployment happens.
Looking at the code, this is what I believe is happening:
Starting from the initial debug message of
could not queue task
I found tasks are marked as running (but in the UI they still appear as queued): https://github.com/apache/airflow/blob/main/airflow/executors/base_executor.py#L85Tracking through our logs, I see these tasks are recovered by the adoption code, and the state there is STARTED (https://github.com/apache/airflow/blob/main/airflow/executors/celery_executor.py#L540).
Following the state update code, I see this does not cause any state updates to occur in Airflow (https://github.com/apache/airflow/blob/main/airflow/executors/celery_executor.py#L465). Thus, if a task is marked as STARTED in the results db, but queued in the airflow task state, it will never be transferred out by the scheduler. However ,you can get these tasks to finally run by clicking the
run
button.Operating System
Ubuntu 20.04.3 LTS
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
Anything else
I believe this could be addressed by asking celery if a task is actually running in
try_adopt_task_instances
. There are obvious problems like timeouts, butcelery inspect active|reserved
can return a json output of running and reserved tasks to verify a STARTED task is actually runningAre you willing to submit PR?
Code of Conduct