apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.7k stars 14.21k forks source link

High load issues in Airflow DAGs when running on EKS cluster #42784

Open sairan-01 opened 1 week ago

sairan-01 commented 1 week ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.2

What happened?

Description:

I am encountering significant load spikes when running DAGs in an Airflow setup on an EKS cluster with the following configuration:

EKS Node Type: c5.9xlarge OS (Architecture): linux (amd64) OS Image: Amazon Linux 2 Kernel Version: 5.10.225-213.878.amzn2.x86_64 Container Runtime: containerd://1.7.11 Kubelet Version: v1.28.13-eks-a737599 The DAGs are responsible for ingesting data from Kafka to S3 and are triggered both periodically and by orchestration DAGs. However, during their execution, I observe a sharp increase in resource usage, particularly CPU and memory, on the worker nodes.

The configuration for some DAGs includes relatively high max_active_runs, which may contribute to the overload. I have attached a simplified version of the DAGs below.

Steps to Reproduce:

Deploy an Airflow instance in an EKS cluster with at least two c5.9xlarge worker nodes. Run the provided DAGs that trigger data ingestion from Kafka to S3. Monitor resource usage on the worker nodes. Expected Behavior:

The load on the worker nodes should remain stable and manageable, even during the execution of multiple DAGs.

Actual Behavior:

The load on the worker nodes increases significantly, causing performance degradation across the cluster.

Environment:

Kubernetes version: v1.28.13-eks-a737599 Airflow version: 2.5.0 Worker nodes: c5.9xlarge Container runtime: containerd://1.7.11

What you think should happen instead?

No response

How to reproduce

Deploy an Airflow instance in an EKS cluster with at least two c5.9xlarge worker nodes running Amazon Linux 2. Configure Airflow with the DAGs that handle ingestion of data from Kafka to S3. Use a high number of max_active_runs (e.g., 10) for DAGs that trigger orchestration or batch tasks. Trigger the orchestration DAG, which will start the data ingestion processes from Kafka to S3. Monitor CPU and memory usage on the worker nodes during DAG execution. Observe the load spike during execution, particularly when multiple DAGs run in parallel. Expected behavior: Stable resource usage on the worker nodes without sharp increases in load.

Actual behavior: Significant increase in CPU and memory usage, causing performance degradation on the worker nodes.

Operating System

Ubuntu 22

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

import logging from datetime import datetime from CommonDag.L1RawCommonDag import DagRawKafkaS3Msg from CommonUtil import CommonUtilClass

logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name)

dag_config = CommonUtilClass.DagConfig( dag_id="example_dag", default_args=CommonUtilClass.DagDefaultArgs( owner="airflow_user", start_date=datetime(2024, 9, 25) ), max_active_runs=5, tags=["L0", "kafka", "cdc"] )

dag_object = DagRawKafkaS3Msg(dag_config=dag_config, log=logger) dag = dag_object.prepare_dag()

Anything else?

I would appreciate any guidance on how to manage these load spikes more effectively, or suggestions for optimizing DAGs in this scenario.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 week ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

gopidesupavan commented 1 week ago

How are you handling Kafka to S3 processing? It appears that you've implemented a custom operator. In general, if your implementation involves resource-intensive operations, it will certainly impact memory and CPU usage. Additionally, you're using version 2.8, which is outdated. I recommend upgrading to a more recent version, as it includes optimizations.