Closed bkumarac20 closed 3 years ago
Thanks for opening your first issue here! Be sure to follow the issue template!
Hello Team,
Debugging further, I have noticed a pattern where the worker is not sending heartbeat to db. When the time difference between the scheduler current time & latest heart is more than 5 minutes + X seconds, the task is detected as zombie.
dsp_etrade_process_trds_forex_0530.txt dsp_etrade_process_trds_future.txt dsp_etrade_process_trds_mutual_fund.txt
This issue is reported against old version of airflow which is end of life. If the issue still happens in latest airflow version please let us know & provide updated details.
@eladkal
having:
Detected zombie job: {'full_filepath': '/src/dags/project/dag_process_example.py', 'processor_subdir': '/src/dags', 'msg': "{'DAG Id': 'process_example', 'Task Id': 'extract', 'Run Id': 'scheduled__2023-06-04T20:20:00+00:00', 'Hostname': 'consumer-847b7cc86f-txmct', 'External Executor Id': 'c6c92d3f-3b1e-4e5c-9d55-4686f84f9c4b'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7ff76f2444f0>, 'is_failure_callback': True}
with python:3.10.10-slim
and AIRFLOW_VERSION="2.6.1"
We had precisely the same symptoms as @MuslimBeibytuly mentioned, with the same Python interpreter and Airflow version.
In our case, the task has succeeded, in fact --no errors from the worker's log. What we observed in addition, is that job has a dag_run_id of scheduled__2023-06-25T01:00:00+00:00
, and the "Detected zombie job" message is generated at 2023-06-26T03:05:23.519
Can you please @MuslimBeibytuly @zhaow-de - open a new issue about it and put all the details/logs/circumstances there. We try to keep order in our issues - this one has been reported for the old version which is not active for years and probably less than 10% of the code from that original time is left. There is no way we can link those two.
Opening a new issue (and providing all details necessary, logs, extra information, frequency of happening, whether and how you can reproduce it etc. is an approach that has a higher chance someone will be able to help you to solve your problem, or diagnose it and assess as an airflow issue.
is there a solution here?
is there a solution for this, ????
Up also. is there a solution for this?
Thanks
Is there solution for this I got [2024-08-29, 06:38:48 UTC] {task_context_logger.py:104} ERROR - Detected zombie job: {'full_filepath': '/opt/airflow/dags/linkedin_crawler/index.py', 'processor_subdir': '/opt/airflow/dags', 'msg': "{'DAG Id': 'linkedin_crawler', 'Task Id': 'scrape_company_data', 'Run Id': 'manual__2024-08-29T01:42:18.086424+00:00', 'Map Index': 1, 'Hostname': 'ubuntu-8gb-nbg1-1', 'External Executor Id': '500bc5f0-3121-443b-b90e-14174d8b3940'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7ed6adbf2e40>, 'is_failure_callback': True} (See https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)
Apache Airflow version:1.10.9-composer
Kubernetes Version : Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-gke.6002", GitCommit:"035184604aff4de66f7db7fddadb8e7be76b6717", GitTreeState:"clean", BuildDate:"2020-12-01T23:13:35Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}
Environment: Airflow, running on top of Kubernetes - Linux version 4.19.112
What happened ? A running task is marked as Zombie after the execution time crossed the latest heartbeat + 5 minutes. The task is running in background in another application server, triggered using SSHOperator.
[2021-01-18 11:53:37,491] {taskinstance.py:888} INFO - Executing <Task(SSHOperator): load_trds_option_composite_file> on 2021-01-17T11:40:00+00:00 [2021-01-18 11:53:37,495] {base_task_runner.py:131} INFO - Running on host: airflow-worker-6f6fd78665-lm98m [2021-01-18 11:53:37,495] {base_task_runner.py:132} INFO - Running: ['airflow', 'run', 'dsp_etrade_process_trds_option_composite_0530', 'load_trds_option_composite_file', '2021-01-17T11:40:00+00:00', '--job_id', '282759', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/dsp_etrade_trds_option_composite_0530.py', '--cfg_path', '/tmp/tmpge4_nva0']
Task Executing time:
dag_id dsp_etrade_process_trds_option_composite_0530 duration 7270.47 start_date 2021-01-18 11:53:37,491 end_date 2021-01-18 13:54:47.799728+00:00
Scheduler Logs during that time:
[2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie { textPayload: "[2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie" insertId: "1ca8zyfg3zvma66" resource: { type: "cloud_composer_environment" labels: {3} } timestamp: "2021-01-18T13:54:54.432862699Z" severity: "ERROR" logName: "projects/asset-control-composer-prod/logs/airflow-scheduler" receiveTimestamp: "2021-01-18T13:54:55.714437665Z" }
Airflow-webserver log :
X.X.X.X - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0" { textPayload: "172.17.0.5 - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0" " insertId: "1sne0gqg43o95n3" resource: {2} timestamp: "2021-01-18T13:54:45.401670481Z" logName: "projects/asset-control-composer-prod/logs/airflow-webserver" receiveTimestamp: "2021-01-18T13:54:50.598807514Z" }
Airflow Info logs : `2021-01-18 08:54:47.799 EST { textPayload: "NoneType: None " insertId: "1ne3hqgg47yzrpf" resource: {2} timestamp: "2021-01-18T13:54:47.799661030Z" severity: "INFO" logName: "projects/asset-control-composer-prod/logs/airflow-scheduler" receiveTimestamp: "2021-01-18T13:54:50.914461159Z" }
[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447 Copy link { textPayload: "[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447" insertId: "1ne3hqgg47yzrpg" resource: {2} timestamp: "2021-01-18T13:54:47.800605248Z" severity: "INFO" logName: "projects/asset-control-composer-prod/logs/airflow-scheduler" receiveTimestamp: "2021-01-18T13:54:50.914461159Z" } `
Airflow Database shows the latest heartbeat as: select state, latest_heartbeat from job where id=282759
state | latest_heartbeat running | 2021-01-18 13:48:41.891934
Airflow Configurations: celery worker_concurrency=6
scheduler scheduler_health_check_threshold=60 scheduler_zombie_task_threshold=300 max_threads=2
core dag_concurrency=6
Kubernetes Cluster : Worker nodes : 6
What we expected to happen ? The backend process takes around 2hrs 30 minutes to finish. During such long running jobs the task is detected as zombie. Even though the worker node is still processing the task. The state of the job is still marked as 'running'. State of the task is not known during the run time.
How to reproduce it: