Airflow deployed on AWS ECS as docker containers with tasks run as ECS Tasks using EcsRunTaskOperator
What happened
Long running Tasks (say greater than 1 hour) run via EcsRunTaskOperator execute correctly but after 60 mins or so Airflow scheduler treats them as Zombies and retries them creating duplicate tasks.
If we set wait_for_creation as False and use a Sensor like EcsTaskStateSensor to monitor state of ECS Task that also gets marked as a zombie and retried
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
R = retval = fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 136, in execute_command
_execute_in_fork(command_to_exec, celery_task_id)
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 151, in _execute_in_fork
raise AirflowException(msg)
airflow.exceptions.AirflowException: Celery command failed on host: ip-10-1-27-248.us-east-2.compute.internal with celery_task_id fff791e8-62c0-4885-9d3e-e8ec46daae34 (PID: 1239, Return Code: 9)
Relevant Airflow config for zombie
scheduler_zombie_task_threshold = 300
The scheduler container resources are not an issue, same for the task run using EcsRunTaskOperator. Both have health CPU and Mem available to them
The container run as part of the task runs a shell script as its init which internally runs long running tasks
What you think should happen instead
The Operator should either publish heartbeats at a faster cadence or allow for configuring heartbeat pushes?
If that is not the root cause can we have better mechanism for retries in this case? EcsRunTaskOperator has a reattach mechanism which generates the startedBy on each retry. Is it possible to allow users to set startedBy to allow retries to attach to already running tasks started by a previous task run?
How to reproduce
Any Task that takes more than 60 mins ends up getting marked as Zombie. Sometimes even earlier than 60 mins
Any ECS Task that takes more than 15 mins will also fail
We can look at it in the airflow db uder the jobs table. The lastest_heartbeat for the EcsRunTaskOperator never updates
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.24.0 Airflow==2.9.2
Apache Airflow version
2.9.2
Operating System
Debian GNU/Linux 12 (bookworm)
Deployment
Other Docker-based deployment
Deployment details
Airflow deployed on AWS ECS as docker containers with tasks run as ECS Tasks using
EcsRunTaskOperator
What happened
Long running Tasks (say greater than 1 hour) run via
EcsRunTaskOperator
execute correctly but after 60 mins or so Airflow scheduler treats them as Zombies and retries them creating duplicate tasks.If we set
wait_for_creation
as False and use a Sensor likeEcsTaskStateSensor
to monitor state of ECS Task that also gets marked as a zombie and retriedRelevant Log lines From the task logs
From celery executors taskmeta
Relevant Airflow config for zombie
scheduler_zombie_task_threshold = 300
The scheduler container resources are not an issue, same for the task run using
EcsRunTaskOperator
. Both have health CPU and Mem available to themThe container run as part of the task runs a shell script as its init which internally runs long running tasks
What you think should happen instead
The Operator should either publish heartbeats at a faster cadence or allow for configuring heartbeat pushes? If that is not the root cause can we have better mechanism for retries in this case?
EcsRunTaskOperator
has a reattach mechanism which generates thestartedBy
on each retry. Is it possible to allow users to setstartedBy
to allow retries to attach to already running tasks started by a previous task run?How to reproduce
Any Task that takes more than 60 mins ends up getting marked as Zombie. Sometimes even earlier than 60 mins Any ECS Task that takes more than 15 mins will also fail
We can look at it in the airflow db uder the jobs table. The
lastest_heartbeat
for theEcsRunTaskOperator
never updatesAnything else
No response
Are you willing to submit PR?
Code of Conduct