apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.91k stars 14.25k forks source link

standalone dag processor gets stuck when over 1k dag files #41806

Closed awesomescot closed 2 months ago

awesomescot commented 2 months ago

Apache Airflow version

2.10.0

If "Other Airflow 2 version" selected, which one?

No response

What happened?

When I have over about 1000 dag files the standalone processor seems to stop functioning properly. I see CPU drop to almost zero. Parsing processes is also around 0. The dagbag never fills up. Logs are unhelpful. I can't seem to figure out what the dag processor is doing, seems as though it's silently crashing.

What you think should happen instead?

I think the standalone dag processor should process in the same or less time than the scheduler dag processor.

How to reproduce

I'm not sure I can share our dag files, but I will post my values file and would love to see if others can reproduce.

Operating System

kubernetes helm chart

Versions of Apache Airflow Providers

The ones in the helm chart.

Deployment

Official Apache Airflow Helm Chart

Deployment details

We are connecting to an RDS postgres instance(also very low cpu usage).

Anything else?

I've been trying to play around with settings to see if I can figure out what is happening, but no luck so far. I'm happy to post any logs that would be helpful.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

jscheffl commented 2 months ago

I think the standalone DAG processor should be the same like integrated.

Can you please check: Is it running stable when DAG processing is integrated and not separated? Is the failure happening on first run already, so never completing? Or is it a in-stability that sometimes hot your environment? Can you bi-sect and in increments cut the amount of DAG files by half? Is there a specific expensive DAG file which takes long to parse? Or can you create an artificial file set which makes it re-producible?

awesomescot commented 2 months ago

Thanks for the reply. I have been testing it on the scheduler and it seems to also be struggling, so this should probably be a discussion and not an issue. I'll raise something over there.

michalc commented 1 month ago

Just in case anyone else stumbles on this - we hit a very similar issue, and worked around it by wrapping the dag processor command with timeout. So in our case, where we have a dag-processor per subdir:

timeout --kill-after=10 600 airflow dag-processor --subdir $AIRFLOW_HOME/dags/$SUB_FOLDER -n 1

(And then surrounding code/infrastructure will ensure that another dag-processor is spun up)