dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.95k stars 1.63k forks source link

[SPIKE+] Improve the Performance Characteristics of add_test_edges() #10950

Open peterallenwebb opened 2 weeks ago

peterallenwebb commented 2 weeks ago

Housekeeping

Short description

The add_test_edges() function is called during the dbt build command, and inserts edges into the execution graph which are meant to ensure that models downstream from a node will not run until all the tests on that node have passed.

The function is slow in certain projects, and recent data from the field show that it inflates the number of edges in the graph by a factor of six. It is slow enough that it often shows up in performance profiles, but is even more problematic in terms of memory consumption, as memory use is high enough to cause OOM crashes.

Acceptance criteria

  1. If possible, implement a new version of this function which adds edges to achieve the desired test-dependency behavior but inserts fewer edges and runs more quickly.
  2. Add a new behavior flag which causes the new function to be used, while retaining the old function on the default code path.
  3. Follow up by gathering data about the relative performance of the two implementations and monitoring for regressions.

Suggested Tests

Existing tests should suffice, but we should add additional tests to reduce the risks associated with the new implementation.

Impact to Other Teams

None.

Will backports be required?

No.

Context

No response

MichelleArk commented 1 week ago

As @peterallenwebb noted, a source of complexity here is that this add_test_edges currently accounts for tests that depend on multiple models, not just one. It may be difficult to take similar approaches for running test nodes "just in time" after a model completes during handle_job_queue if certain tests depend on multiple models before they can run

ChenyuLInx commented 1 week ago

One thought here is to remove the transitive edgestest1 -> model 3(add_test_edges).

@gshank mentioned we can also only do this operation for selected parts of the DAG or not build it when people select tests in build command.