dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
10.01k stars 1.63k forks source link

[Feature] Performance: prune potentially redundant edges, or document why they exist #10842

Open ttusing opened 1 month ago

ttusing commented 1 month ago

Is this your first time submitting a feature request?

Describe the feature

Following up on this issue: https://github.com/dbt-labs/dbt-core/issues/10434#issuecomment-2283408003

https://github.com/dbt-labs/dbt-core/blob/63262e93cb59ed3b5143a1194bc46ba4c03feca1/core/dbt/compilation.py#L197-L215

I am unsure why the graph needs to be built in this way, It seems like at most, a single edge going from a test to the direct 1-depth children should be sufficient if the goal is to maintain build order. The current implementation means that tests are the direct parents of ALL non-test downstream nodes, meaning that a project with 5,000 models and 15,000 tests might have (5k*15k/2) = 37.5 million edges, where limiting to a depth of 1 might keep that in the hundreds of thousands.

This has large implications for memory usage, build times, etc. for projects with lots of tests and/or lots of nodes generally.

If this construction is needed, I would like to understand why and add some comments or documentation for future readers of this code exploring performance issues. Otherwise, I would like to consider changing the construction to use a depth of 1.

Describe alternatives you've considered

No response

Who will this benefit?

All users of DBT, especially those with large projects.

Are you interested in contributing this feature?

No response

Anything else?

No response

ttusing commented 1 month ago

@dbeatty10 @gshank