apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.29k stars 14.34k forks source link

Airflow task is marked as zombie , when the backend process on remote server is still running. #13747

Closed bkumarac20 closed 3 years ago

bkumarac20 commented 3 years ago

Apache Airflow version:1.10.9-composer

Kubernetes Version : Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-gke.6002", GitCommit:"035184604aff4de66f7db7fddadb8e7be76b6717", GitTreeState:"clean", BuildDate:"2020-12-01T23:13:35Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}

Environment: Airflow, running on top of Kubernetes - Linux version 4.19.112

What happened ? A running task is marked as Zombie after the execution time crossed the latest heartbeat + 5 minutes. The task is running in background in another application server, triggered using SSHOperator. [2021-01-18 11:53:37,491] {taskinstance.py:888} INFO - Executing <Task(SSHOperator): load_trds_option_composite_file> on 2021-01-17T11:40:00+00:00 [2021-01-18 11:53:37,495] {base_task_runner.py:131} INFO - Running on host: airflow-worker-6f6fd78665-lm98m [2021-01-18 11:53:37,495] {base_task_runner.py:132} INFO - Running: ['airflow', 'run', 'dsp_etrade_process_trds_option_composite_0530', 'load_trds_option_composite_file', '2021-01-17T11:40:00+00:00', '--job_id', '282759', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/dsp_etrade_trds_option_composite_0530.py', '--cfg_path', '/tmp/tmpge4_nva0']

Task Executing time: dag_id dsp_etrade_process_trds_option_composite_0530 duration 7270.47 start_date 2021-01-18 11:53:37,491 end_date 2021-01-18 13:54:47.799728+00:00

Scheduler Logs during that time: [2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie { textPayload: "[2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie" insertId: "1ca8zyfg3zvma66" resource: { type: "cloud_composer_environment" labels: {3} } timestamp: "2021-01-18T13:54:54.432862699Z" severity: "ERROR" logName: "projects/asset-control-composer-prod/logs/airflow-scheduler" receiveTimestamp: "2021-01-18T13:54:55.714437665Z" }

Airflow-webserver log : X.X.X.X - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0" { textPayload: "172.17.0.5 - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0" " insertId: "1sne0gqg43o95n3" resource: {2} timestamp: "2021-01-18T13:54:45.401670481Z" logName: "projects/asset-control-composer-prod/logs/airflow-webserver" receiveTimestamp: "2021-01-18T13:54:50.598807514Z" }

Airflow Info logs : `2021-01-18 08:54:47.799 EST { textPayload: "NoneType: None " insertId: "1ne3hqgg47yzrpf" resource: {2} timestamp: "2021-01-18T13:54:47.799661030Z" severity: "INFO" logName: "projects/asset-control-composer-prod/logs/airflow-scheduler" receiveTimestamp: "2021-01-18T13:54:50.914461159Z" }

[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447 Copy link { textPayload: "[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447" insertId: "1ne3hqgg47yzrpg" resource: {2} timestamp: "2021-01-18T13:54:47.800605248Z" severity: "INFO" logName: "projects/asset-control-composer-prod/logs/airflow-scheduler" receiveTimestamp: "2021-01-18T13:54:50.914461159Z" } `

Airflow Database shows the latest heartbeat as: select state, latest_heartbeat from job where id=282759

state | latest_heartbeat running | 2021-01-18 13:48:41.891934

Airflow Configurations: celery worker_concurrency=6

scheduler scheduler_health_check_threshold=60 scheduler_zombie_task_threshold=300 max_threads=2

core dag_concurrency=6

Kubernetes Cluster : Worker nodes : 6

What we expected to happen ? The backend process takes around 2hrs 30 minutes to finish. During such long running jobs the task is detected as zombie. Even though the worker node is still processing the task. The state of the job is still marked as 'running'. State of the task is not known during the run time.

How to reproduce it:

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

bkumarac20 commented 3 years ago

Hello Team,

Debugging further, I have noticed a pattern where the worker is not sending heartbeat to db. When the time difference between the scheduler current time & latest heart is more than 5 minutes + X seconds, the task is detected as zombie.

dsp_etrade_process_trds_forex_0530.txt dsp_etrade_process_trds_future.txt dsp_etrade_process_trds_mutual_fund.txt

eladkal commented 3 years ago

This issue is reported against old version of airflow which is end of life. If the issue still happens in latest airflow version please let us know & provide updated details.

MuslimBeibytuly commented 1 year ago

@eladkal having: Detected zombie job: {'full_filepath': '/src/dags/project/dag_process_example.py', 'processor_subdir': '/src/dags', 'msg': "{'DAG Id': 'process_example', 'Task Id': 'extract', 'Run Id': 'scheduled__2023-06-04T20:20:00+00:00', 'Hostname': 'consumer-847b7cc86f-txmct', 'External Executor Id': 'c6c92d3f-3b1e-4e5c-9d55-4686f84f9c4b'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7ff76f2444f0>, 'is_failure_callback': True}

with python:3.10.10-slim and AIRFLOW_VERSION="2.6.1"

zhaow-de commented 1 year ago

We had precisely the same symptoms as @MuslimBeibytuly mentioned, with the same Python interpreter and Airflow version.

In our case, the task has succeeded, in fact --no errors from the worker's log. What we observed in addition, is that job has a dag_run_id of scheduled__2023-06-25T01:00:00+00:00, and the "Detected zombie job" message is generated at 2023-06-26T03:05:23.519

potiuk commented 1 year ago

Can you please @MuslimBeibytuly @zhaow-de - open a new issue about it and put all the details/logs/circumstances there. We try to keep order in our issues - this one has been reported for the old version which is not active for years and probably less than 10% of the code from that original time is left. There is no way we can link those two.

Opening a new issue (and providing all details necessary, logs, extra information, frequency of happening, whether and how you can reproduce it etc. is an approach that has a higher chance someone will be able to help you to solve your problem, or diagnose it and assess as an airflow issue.

asafrobinson commented 9 months ago

is there a solution here?

ankit48365 commented 6 months ago

is there a solution for this, ????

valgayres commented 6 months ago

Up also. is there a solution for this?

Thanks

nghilethanh09052000 commented 2 months ago

Is there solution for this I got [2024-08-29, 06:38:48 UTC] {task_context_logger.py:104} ERROR - Detected zombie job: {'full_filepath': '/opt/airflow/dags/linkedin_crawler/index.py', 'processor_subdir': '/opt/airflow/dags', 'msg': "{'DAG Id': 'linkedin_crawler', 'Task Id': 'scrape_company_data', 'Run Id': 'manual__2024-08-29T01:42:18.086424+00:00', 'Map Index': 1, 'Hostname': 'ubuntu-8gb-nbg1-1', 'External Executor Id': '500bc5f0-3121-443b-b90e-14174d8b3940'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7ed6adbf2e40>, 'is_failure_callback': True} (See https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)