apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
35.89k stars 13.94k forks source link

Poor scheduling throughput when a low percentage of tasks are schedule-able. #31185

Open rob-1126 opened 1 year ago

rob-1126 commented 1 year ago

Apache Airflow version

2.6.0

What happened

Airflow scheduling task throughput suffers with relatively moderate numbers of total outstanding tasks - in situations where you have a few thousand tasks associated with running dags but a low fraction of them are schedu-able the scheduler spends a large portion of the time considering tasks before it finds the one or two that are eligible for scheduling in every cycle of the scheduler loop.

Similarly, a dag with many blocked tasks can adversely impact a generally simple dagbag with moderate throughput requirements.

In practice, the scheduler under nearly ideal circumstances can consider roughly 500-1000 tasks a second.

This appears to be a combination of two factors.

  1. Airflow is considering all tasks in that belong to a dag and re-considers each task each run of the scheduler instead of in response to events which can cause state changes.
  2. Airflow's per-task consideration (including ti_deps) in the main scheduler loop is 1-2ms per task from a world of all non-completed tasks and is not sufficient to brute-force through without impacting scheduler latency.

What you think should happen instead

Either:

ti_deps should be stored in the database and modified in response to events. The main scheduler loop should pull only tasks where there are no unsatisfied task dependencies and non check task-dependencies at run time.

OR:

per-task consideration in the main scheduler loop critical path should be reduced extremely significantly. Do any of the ti_deps result in further database calls for individual instances? If not, this path may not be plausible without the scheduler taking out a broad-scoped lock and keeping substantial state in memory.

How to reproduce

Construct a dag bag with two dags:

Each step of the 20-deep series of simple tasks will be delayed in execution for several seconds as the previous scheduler loop takes longer to finish up the 5000 tasks (if processing the sleep-then-parallel dag after the serial-dag) or the new scheduler loop process the 500 tasks (if processing the sleep-then-parallel dag before the serial-dag).

Operating System

MacOS against my will

Versions of Apache Airflow Providers

N/A

Deployment

Other

Deployment details

Local source with custom no-op executor benchmarking and additional instrumentation in sqlalchemy, python, and elsewhere.

Anything else

My feeling is that one or both of these items should be filed as a feature-request if there is consensus on a course of action.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

csp33 commented 1 year ago

This issue is also affecting me

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

rob-1126 commented 1 week ago

Still applicable.