ECS EcsTaskStateSensor and EcsRunTaskOperator marked as Zombie and retried as they don't seem to be updating their heartbeat

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.24.0 Airflow==2.9.2

Apache Airflow version

2.9.2

Operating System

Debian GNU/Linux 12 (bookworm)

Deployment

Other Docker-based deployment

Deployment details

Airflow deployed on AWS ECS as docker containers with tasks run as ECS Tasks using EcsRunTaskOperator

What happened

Long running Tasks (say greater than 1 hour) run via EcsRunTaskOperator execute correctly but after 60 mins or so Airflow scheduler treats them as Zombies and retries them creating duplicate tasks.

If we set wait_for_creation as False and use a Sensor like EcsTaskStateSensor to monitor state of ECS Task that also gets marked as a zombie and retried

Relevant Log lines From the task logs

[2024-08-14, 11:04:39 PDT] {ecs.py:176} INFO - Task state: RUNNING, waiting for: STOPPED
[2024-08-14, 11:05:17 PDT] {scheduler_job_runner.py:1737} ERROR - Detected zombie job: {'full_filepath': '/opt/airflow/dags/test_ecs_task_dag.py', 'processor_subdir': '/opt/airflow/dags', 'msg': "{'DAG Id': 'test_ecs_task_dag', 'Task Id': 'await_ecs_task_run', 'Run Id': 'manual__2024-08-14T10:49:23-07:00', 'Hostname': 'somehost.us-east-2.compute.internal', 'External Executor Id': '20d42f77-bd77-46ec-bf16-cd4e5faae379'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f3c22cd3dd0>, 'is_failure_callback': True} (See https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)
[2024-08-14, 11:05:39 PDT] {ecs.py:176} INFO - Task state: RUNNING, waiting for: STOPPED

From celery executors taskmeta

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 136, in execute_command
    _execute_in_fork(command_to_exec, celery_task_id)
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 151, in _execute_in_fork
    raise AirflowException(msg)
airflow.exceptions.AirflowException: Celery command failed on host: ip-10-1-27-248.us-east-2.compute.internal with celery_task_id fff791e8-62c0-4885-9d3e-e8ec46daae34 (PID: 1239, Return Code: 9)

Relevant Airflow config for zombie scheduler_zombie_task_threshold = 300

The scheduler container resources are not an issue, same for the task run using EcsRunTaskOperator. Both have health CPU and Mem available to them

The container run as part of the task runs a shell script as its init which internally runs long running tasks

What you think should happen instead

The Operator should either publish heartbeats at a faster cadence or allow for configuring heartbeat pushes? If that is not the root cause can we have better mechanism for retries in this case? EcsRunTaskOperator has a reattach mechanism which generates the startedBy on each retry. Is it possible to allow users to set startedBy to allow retries to attach to already running tasks started by a previous task run?

How to reproduce

Any Task that takes more than 60 mins ends up getting marked as Zombie. Sometimes even earlier than 60 mins Any ECS Task that takes more than 15 mins will also fail

We can look at it in the airflow db uder the jobs table. The lastest_heartbeat for the EcsRunTaskOperator never updates

Anything else

No response

Are you willing to submit PR?

[x] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

apache / airflow