[Regression] 1.8.2 slower to build than 1.5.9 when tag+ includes many nodes

cajubelt commented 4 months ago

Is this a regression in a recent version of dbt-core?

[X] I believe this is a regression in dbt-core functionality
[X] I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

dbt build -s tag:my_tag+ takes about 20 minutes longer to start on dbt 1.8.2 than it does on 1.5.9 with the same tag. The tag used has a lot of downstream nodes in our project, about 11k. Generally we’re seeing better performance on 1.8 so we were surprised to see this big regression in performance.

Expected/Previous Behavior

Previously building everything downstream of a tag with lots of nodes would take a couple minutes of startup time and then begin running queries against our db. Now it takes 20+ minutes.

Steps To Reproduce

Set up dbt project with a tag that has about 11k downstream nodes
Install dbt 1.8.2
Run dbt build -s tag:my_tag+

Relevant log output

No response

Environment

- OS: MacOS 14.5 and Ubuntu 22.04
- Python: 3.9.12
- dbt (working version): 1.5.9
- dbt (regression version): 1.8.2

Which database adapter are you using with dbt?

bigquery

Additional Context

The reason we need this is because we have a selector used in CI/CD that excludes everything downstream of a tag that is upstream of many nodes. The example given is a simpler version of the original issue we found with that selector. (We tested the simpler version and found to also have the same performance issue.) The selector was something like

- name: my_selector
  definition: 
    union:
      - state:modified+
      - exclude:
        - tag:my_tag+

jtcohen6 commented 4 months ago

Thanks @cajubelt! Is this only for the build command (i.e. not run)?

I suspect this might be due to the large number of tests in your project (per our conversation), and the additional time that dbt spends "linking" the DAG (adding edges between test on upstream model -> downstream models, so that they skip on test failure). While it's not immediately clear to me what change we made between v1.5 -> v1.8 to that logic, if you only see this slowdown on build, it would be a strong indication that therein lies the problem.

cajubelt commented 4 months ago

@jtcohen6 yes I just confirmed that dbt run doesn't have any noticeable slowdown on 1.8 with the same selector.

jtcohen6 commented 4 months ago

@cajubelt Okay! So I think the hypothesis is:

A project with a lot of tests (~5k models and ~15k data tests) is meaningfully slower at startup for dbt build in v1.8 compared to v1.5
My suspicion is that this slowdown is happening within (or closely adjacent to) the add_test_edges method

jtcohen6 commented 4 months ago

If you're up for it, there are two ways to try confirming that hypothesis by profiling dbt's performance:

Using snakeviz as documented here: https://docs.getdbt.com/reference/global-configs/record-timing-info (this way is a bit older, but more familiar to me)

Using py-spy (snazzier) in the way @peterallenwebb recommends here:

pip install py-spy
sudo py-spy record -s -f speedscope -- dbt build

gshank commented 4 months ago

We have the ability to exclude tests from the build resource types, but that doesn't affect whether or not we pass "add_test_edges" to compile. It would be possible to check the resource types and if tests are excluded, pass "add_test_edges" = False.

Of course if you do want tests to run, that doesn't help.

cajubelt commented 4 months ago

We do want to run tests if possible.

@jtcohen6 here's a screenshot of a search through py-spy's output for the longest segments of the flame graph. Looks like generic_bfs_edges is adding 12:41, not sure if that's particularly revealing however

stefankeidel commented 4 months ago

@jtcohen6 Not super sure if related to this issue but we're seeing an extreme slowdown when building a large chunk of a fairly big graph with --indirect-selection buildable in CI/CD. The project has 2573 models, 13 snapshots, 7 analyses, 2254 data tests, 20 seeds, 1 operation, 359 sources, 44 exposures, 1408 macros, 1 group, 13 unit tests

cautious and eager end up being fine, but with buildable, the process "hangs" for up to 20 minutes after it prints the project stats "Found 2573 models ..." etc. and before the "Concurrency: 4 threads (target='test')" message. No other output, even on debug log level.

I tried to figure out what's going on using the trace module but it was a bit too much spam :). Will try add some trace output when I find a minute

stefankeidel commented 4 months ago

here's me waiting for about 5minutes for the generic_bfs_edges traversal

In a "good" run with eager selection this takes about two seconds with the exact same selector (the rest is dbt actually starting to execute, which I didn't wait for in the other run)

ttusing commented 3 months ago

What percentage of your graph (including tests) is selected by the tag, and what % is selected by the selector?

I had someone on my team try using the @ decorator on a selector (which I don't normally do) and we both had DBT hang. I kind of pushed it out of my head, will be thinking on this now.

I don't really understand the py-spy output (worked with SnakeViz).... if I had to guess what might be happening:

for some reason, generic_bfs_edges is either really slow or being called a lot
collect_specified_neighbors in selector.py calls self.graph.select_children when passed certain selector patterns
self.graph.select_children applies the self.descendants method for each node, which is perhaps a lot of times, or maybe it uses up a lot of memory.
the slow part based on your output is filter_edges_by_type, which is calling get_edge_data in networkx,

If you can share the snakeviz would help out, also maybe check your memory usage? Last bug I worked on for this graph optimization stuff was having performance issues from running out of memory when the DAG got too big.

If my stab in the dark here is in the right direction, maybe there is a pruning that could be done to reduce calling descendants so much. Like, if you have a ton of nodes in your selector for which many are already descendents of each other, you don't need to re-tread and evaluate each node. could first get a list of all descendents for each node with a depth of 1, then take the difference of that and the list of nodes in the selector to restrict to searching for descendents of unsearched nodes, find descendents of depth 1 from that list, and iterate.

For your graph has 20k nodes, if 1k have the tag, 10k are descendents, and each descendent is on average the descendent of 70% of the tagged nodes, this takes the number of times each of the descendents has to be evaluated by ~700. 700*10,000 is a lot of calls.

After writing this out this seems actually pretty likely XD. Think an update to select_children with this logic would be the place to implement this. This seems similar to https://github.com/dbt-labs/dbt-core/pull/7194

Also if the slow part is filtered_graph = self.exclude_edge_type("parent_test"), maybe could cache that or pass it as a parameter into descendants?

will try tests on my own project with ~500 models and ~5000 tests next week.

ttusing commented 3 months ago

I added a PR! https://github.com/dbt-labs/dbt-core/pull/10526

I believe I was able to recreate the issue in my project, which has: Found 442 models, 43 snapshots, 5112 data tests, 32 seeds, 2 operations, 134 sources, 23 exposures, 1075 macros

Running dbt build --select tag:silver+ on my DAG on 1.8.4, it takes 132 seconds to start building. dbt build just takes 11 seconds.

My PR applies the pruning strategy I described. I saw the runtime improve for dbt build --select tag:silver+ to 66 seconds. Using SnakeViz, I identified that the number of calls to get_edge_data reduced from 14,332,569 to 383,916.

I am unsure how much DBT 1.9/main branch contains performance improvements... the total runtime improvement was 66 seconds, but select_children only improved from 80 seconds to 45 seconds (35 seconds improvement). Going to try to run main branch for performance check.

Feedback welcome, especially if you can test/review the change!

peterallenwebb commented 3 months ago

I've reviewed and suggested some changes to @ttusing's PR but we should be close to a fix on this.

peterallenwebb commented 3 months ago

This is now resolved in 1.9+, but I'm re-opening this issue so we can consider backports.

ttusing commented 3 months ago

I'm also interested in seeing performance reports from @stefankeidel and @cajubelt. This should help, but checking the edge type seems to be inherently slow. Wonder if the runtime will improve from 20 minutes to like 15, 10, or 5 minutes and if there are further improvements that can be made in the edge type lookup speed.

I could image avoiding the edge type lookup altogether somehow with the knowledge that tests cannot depend on tests. So, don't check the edge type for searching children if the node is a test (or if parent is a test for ancestors) or something like that. Or, building the subgraph with all non-tests nodes first.

cajubelt commented 3 months ago

Just saw the great activity on this after it fell off my radar for a bit, thanks all. Will follow up on Monday with a perf report, I have personal commitments this weekend.

We'd love to see a backport to 1.8.

stefankeidel commented 3 months ago

Thanks for the update and amazing work! I am currently traveling on a sabbatical leave so I won’t be able to test this before I’m back in November, but will make sure to do so then!On 9. Aug 2024, at 23:55, Charlie Andrews-Jubelt @.***> wrote: Just saw the great activity on this after it fell off my radar for a bit, thanks all. Will follow up on Monday with a perf report, I have personal commitments this weekend. We'd love to see a backport to 1.8 since our project is fairly large and we are unlikely to update to 1.9 before some key deadlines this Fall.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ttusing commented 3 months ago

So I'm looking at this section:

https://github.com/dbt-labs/dbt-core/blob/63262e93cb59ed3b5143a1194bc46ba4c03feca1/core/dbt/compilation.py#L197-L215

This adds a huge amount of edges. For probably most but pathological DAGs, this is an edge between every test and all nodes downstream of what its testing. For the example of 5k models and 15k data tests, this might be added approximately (5k*15k/2)= 37.5 million edges.

Was added in https://github.com/dbt-labs/dbt-core/pull/7276, I don't really understand the purpose of this code. @gshank can you explain why it is necessary to add all of these edges? It seems like a huge factor limiting DBT scaling. If the goal is to get dbt build to run tests on parents before children, isn't enough to just do this with a depth of 1?

I'm guessing the get edge type function is slow because of the millions of edges. There are way less nodes (thousands), so getting node type should probably be way faster, and I am nearly certain nothing can directly depend on tests (only parent_edge type), so I think checking the node types is another potential huge optimization.

Somewhat separately, I wonder about a feature enhancement to print simple DAG stats with like --dag-debug for helping with these. Might show the total number of edges, highest degree node (or distribution stats on degree), highest number of tests on a single model (or distribution), etc.

cajubelt commented 3 months ago

Ok I did some profiling last night on dbt build --select tag:large_tag+. Looks like get_edge_data accounted for about 36 minutes and was called 710,951,628 times. Total time to build was 81 minutes. (I used --empty and it errored out immediately when writing to tables so none of this was time waiting for the db writes.) I used dbt 1.8.2. Will try again tonight on 1.9 if that's helpful

cajubelt commented 3 months ago

Ok, ran it again with 1.9.0-a1 and the improvement was dramatic. The build time was just 10 minutes. get_edge_data was called 11,778,492, a 98% reduction, and accounted for just 21 seconds.

cajubelt commented 3 months ago

Some stats about my project: 5799 models, 20 snapshots, 91 analyses, 18763 data tests, 186 seeds, 1 operation, 1086 sources, 69 exposures, 1899 macros

dbeatty10 commented 2 months ago

Resolved for dbt-core 1.9+ in https://github.com/dbt-labs/dbt-core/pull/10526 and backported to dbt-core 1.8 in https://github.com/dbt-labs/dbt-core/pull/10567.

dbt-labs / dbt-core