apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37k stars 14.27k forks source link

Airflow AKS with keda auto scaling running job and marking it as zombie and killing it while it is running on databricks. #35107

Open Raul824 opened 1 year ago

Raul824 commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow 2.6.1 Running in Azure AKS. Keda Auto scaling 0-30 workers Worker Concurrency 16. Backend DB - Potsgres

We are using airflow to run job on databricks using submit run api. Our jobs are being killed in between the run, the reason is they are being marked as zombie. Below is the cause I can come up with after seeing the details of a failed job, it could be inaccurate as the logic has been built after observing the behavior of failed jobs.

Airflow is sending job to a worker A but celery is running same job on worker B. Airflow is trying to get the status from the worker A causes heartbeat miss and mark it as zombie and kills it.

Below is the log from Airflow scheduler. [2023-10-19T12:12:38.862+0000] {scheduler_job_runner.py:1683} WARNING - Failing (1) jobs without heartbeat after 2023-10-19 12:07:38.854638+00:00 [2023-10-19T12:12:38.862+0000] {scheduler_job_runner.py:1693} ERROR - Detected zombie job: {'full_filepath': '/opt/airflow/dags/UDPPRDAU_ODS_KEY_SCD_SCF_11.py', 'processor_subdir': '/opt/airflow/dags', 'msg': "{'DAG Id': 'UDPPRDAU_ODS_KEY_SCD_SCF_11', 'Task Id': 'SSOT_DDS_ASSG_PROD_SCD.SSOT_DDS_ASSG_PROD_SCD', 'Run Id': 'manual__2023-10-17T12:59:00+00:00', 'Hostname': 'optusairflow-worker-7458876cdf-glk6z', 'External Executor Id': 'c5746e3a-c8f8-4596-a31d-132413d5591c'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f4c99bb3790>, 'is_failure_callback': True}

Below is the snippet from celery for same external executor id running on different worker than mentioned in above airflow log.

image

This issue is fixed if we set 10 workers running but this will cause the workers to be running in case of no jobs which will increase the cost.

Related Issue #35056

What you think should happen instead

If airflow is running a job onto a specific worker, celery should run it on same worker. If the worker is about to be shut down, celery should mark it in some state so that Airflow cannot submit a job on a worker which is about to be shut down due to Auto Scaling.

How to reproduce

Set the worker scaling through Keda from 0-10 and run more than 40 jobs.

Operating System

Azure Kubernetes Services

Versions of Apache Airflow Providers

2.6.1

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

The occurrence of these failures is too high almost 10 jobs per 30 run.

Are you willing to submit PR?

Code of Conduct

RNHTTR commented 1 year ago

Can you check to see if any of the workers in the cluster are running out of memory? Zombie tasks can be caused by a variety of things; one of the most popular ones is running out of memory.

Raul824 commented 1 year ago

@RNHTTR First thing I checked was memory because every reference of this similar issue points out it was due to memory. kubectl top nodes - shows 30% memory in use and 70% free. We are not getting OOMkilled on any of the pods as we have provided a pretty large size clusters to Airflow. our jobs are passing all the load to databricks and are doing only request get to get the status every 30 seconds.

Could you please help me with the doubt as why the airflow is trying to get the status of job from different worker whereas celery is running in on a different worker.

Logs and snippets are in original post, please let me know if any more details I can add which will help.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.