apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.26k stars 14.08k forks source link

Webserver Graph View Partial Load #41620

Open collinmcnulty opened 3 weeks ago

collinmcnulty commented 3 weeks ago

Description

The Graph view currently load all tasks in the DAG and then allows you to scroll around the graph. This is very inefficient for large DAGs because the user can only actually perceive a small fraction of the DAG at once. This inefficiency can crash the webserver (or require excessive resources) for dags with 5 figure numbers of tasks.

The prior art I would take inspiration from is video games, which do not load the whole world, but instead load only the parts that you are looking at and update that set as you move around. For Airflow, I think we should show only the tasks connected to the task you are currently looking at and the ones connected to those and so on until some limit on the total number of tasks (e.g. 50) is hit. To see the rest of the DAG, click one of the connected tasks and then Airflow should re-center the view on that task, loading in the new tasks and dropping the ones farthest away.

Use case/motivation

Large DAGs should load intelligently in the Graph view without needing excessive resources on the webserver.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

eladkal commented 2 weeks ago

The prior art I would take inspiration from is video games, which do not load the whole world, but instead load only the parts that you are looking at

In these games the entry point is very clear. It's your city/town/castle and you can move around the map as everything is fixed.

How do you suggest to actually make it with Airflow? Which task should be the first presented? How do you imagine navigation will look like? Do I know what is on the right or left side of this specific task?

I think 5K tasks for specific dags is really high. Did you pack all of the tasks in nested task groups? That way you can zoom in only on what is relevant for you and possibly we can make the UI to load only the specific task group?

collinmcnulty commented 2 weeks ago

I think the selected task would be the entry point, and for DAGs larger than some configured number of tasks, the Graph view should not load unless a specific task is selected from the Grid view. Or maybe you just pick any random root node and display that one.

Navigation would be like it is now, that you can pan around the graph, and my clicking on a task it will change the center point, loading the new tasks as required and dropping old ones. So you can traverse the graph by clicking successive tasks to move around the graph.

I work in support for Astronomer, so I'm speaking on behalf of users I've worked with, not for my own DAGs. I've observed that DAGs that are large and don't make great use of task groups exist in the wild, and there is the problem that on shared Airflow instances, even one DAG like this can start breaking the webserver for everyone. I'm certainly open to other ideas on how to make graph view efficient, but I don't think we can/should rely on DAG authors to keep graph view efficiency in mind when writing DAGs.