dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
210 stars 148 forks source link

[CT-779] Listing tables on big datasets on every compile can be abnormally long (and incomplete?) #205

Closed github-christophe-oudar closed 11 months ago

github-christophe-oudar commented 2 years ago

Describe the bug

I was troubleshooting a 2/3 min compile time for a small dbt project and it appears that it contained a model that output to a dataset that has 60K tables. As dbt-bigquery is providing list_relations_without_caching, it will use the list tables API. However the calls is slow with a lot of tables (2/3 minutes). Also as the max results is set to 100K and it looks like there is no pagination, it might be incomplete.

Steps To Reproduce

Expected behavior

I would expect the behavior to be faster. Yet I understand that it's not straightforward to improve. There are few leads:

Obviously a simple workaround is to use a dataset that is not that crowded. Especially since all datasets table listings are running concurrently but it might not be applicable for all users. Another workaround would be to have a setting in the config that would assume the table exists or not directly to avoid scanning.

System information

The output of dbt --version:

Core:
  - installed: 1.2.0-a1
  - latest:    1.1.1    - Ahead of latest version!

I'm using dbt-bigquery 1.2.0-b1 (but had the issues as well with 1.1.1)

The operating system you're using: MacOS

The output of python --version: Python 3.10.4

kevinhoe commented 2 years ago

Hi @github-christophe-oudar - thanks for opening this issue. We are experiencing the same performance slowness with dbt Cloud v1.0 with BigQuery...actually even worse performance since it takes us 35 minutes to compile 😢 . Regarding your comment:

Also as the max results is set to 100K and it looks like there is no pagination, it might be incomplete.

...I believe that explains the log message we have started to see in dbt Cloud when we compile:

dbt compile run results

I found your issue thanks to Jeremy Yeo (dbt Labs) who has been helping on https://github.com/dbt-labs/dbt-bigquery/issues/115 which is about dbt docs generate taking a long time for BigQuery. As I note in that ticket, we are experiencing very long processing times with both the compile and docs generate [--no-compile] (ie, catalog) commands. For context, we have many BigQuery projects that each have many datasets with many tables that we scan/union.

Wondering if maybe the folks at dbt Labs can point us in the right direction for how to override any macros in the compile step? I'd be interested to know if we can exclude certain dbt directories from being compiled (or even excluded from the docs generation)?

jtcohen6 commented 2 years ago

This issue deserves a thorough look from the relevant team.

Just popping in to offer, as a potential mechanism for temporary relief, you could look into trying out the cache_selected_only config: https://docs.getdbt.com/reference/global-configs#cache-database-objects-for-selected-resource

This is marked "experimental" because it hasn't been extensively battle-tested. It also won't help with full dbt run, but could stand to help with dbt run -s specific_model if you materialize your models across many different projects/datasets. I could foresee a future where we want to limit caching to the selected resources only by default.

kevinhoe commented 2 years ago

Hi @jtcohen6 - 🙏 thanks for your time and support here! Appreciate the idea about using cach_selected_only...reading up on that now and wondering if that flag also works with dbt compile and/or dbt docs generate [--no-comiple]? FYI - we're presently running a standalone cloud job for generating docs that does not include the dbt run command which the "getdbt docs" link references with the dbt run command.

I have some additional learnings to share regarding workarounds for our...workarounds 😅.

Recently, our "initial workaround" to generating docs was as follows (dbt Cloud v1.2 w/BigQuery):

As our model count has continued to grow, the above workaround (of splitting the compile and docs generate steps) started failing so the thread count was adjusted down to 8. This worked for a day or two (with the compile step completing in 50+ mins).

But then we continued to hit memory limits in dbt Cloud again...and so here's the "latest" workaround for getting docs to complete successfully:

Some followup thoughts/questions:

  1. It was great to learn that the compile command takes the arguments --select, --exclude, and --selector: https://docs.getdbt.com/reference/node-selection/syntax
  2. Thus using dbt compile with --select ____ will still enable us to generate the manifest.json file that's needed to serve the docs online. BUT I'm not 100% sure if this workaround is comprehensive for building the full manifest.json?
  3. Curious to know if we can also pass the --select or --exclude arguments to the command dbt docs generate?
  4. Once served, our docs are now crashing the Chrome browser much more frequently at this point.

Many thanks and interested to hear if folks have any questions/comments!

jtcohen6 commented 2 years ago

@kevinhoe Thanks for all those details! The config I linked is "global," so it should be supported for all commands, if not as a CLI flag in the place you expect then at least as an env var.

Out of curiosity:

Obviously a simple workaround is to use a dataset that is not that crowded.

Agreed! Christophe put this well: we encourage materializing dbt models in namespaces (schemas/datasets) entirely owned by dbt, containing only dbt-produced objects. But I also appreciate that, as dbt projects mature and get larger, "only dbt-produced objects" may still number in the hundreds/thousands.

Just to keep ourselves organized, let's keep this issue specific to the slowdown encountered during caching, when dbt uses the "list tables API" at the start of compile, run, and other commands. (This includes docs generate since it includes a compile step, by default, but not if you add --no-compile.) Your workaround makes sense, insofar as DBT_CACHE_SELECTED_ONLY=1 dbt compile --select one_model would only run caching queries for the schema containing one_model, and dbt docs generate --no-compile would avoid running caching queries entirely (since no compile step). DBT_CACHE_SELECTED_ONLY=1 dbt docs generate --select one_model should get the same job done in a single command.

Let's keep https://github.com/dbt-labs/dbt-bigquery/issues/115 focused on the scaling limits encountered during docs generate. I think it's the same basic phenomenon at play—dbt is running catalog queries that are hitting datasets with tons of objects in them, thereby loading too many of them into memory—something we could look to limit by trimming down the exact tables needing cataloging: https://github.com/dbt-labs/dbt-core/issues/4997. (We cannot take the same table-filtered approach for caching, because dbt caches schemas on an "all or nothing" basis: if the schema is cached, and a specific table supposedly residing in that schema is missing from the cache, dbt assumes it doesn't exist. The stakes for missing a catalog entry are much lower.)

Separately, we're aware of the scaling limits encountered using dbt-docs (https://github.com/dbt-labs/dbt-docs/issues/170), which is a static site that has to load the full contents of manifest.json + catalog.json into the browser. Approximately how big (in bytes) are your manifest and catalog?

kevinhoe commented 2 years ago

@jtcohen6 - thank you for the extremely thorough and helpful response! Apologies for the delay from my end 🙏

  1. Thanks for confirming that the config you linked is "global".

  2. Regarding your question about the size/scope of our dbt instance:

    • I've shared some concrete numbers with folks at dbt Labs via support tickets, etc. Would be happy to connect offline regarding specifics, but as a proxy, our manifest.json file is ~40MB and our catalog.json is also ~40MB.
  3. Regarding the best practice of setting up dedicated BigQuery namespaces for "only dbt-produced objects":

    • We have a couple of different BigQuery Projects/Datasets where we're materializing the dbt models; and most of the content is dbt specific (if not all of the content). Though as you alluded to, the sheer volume of our dbt models is substantial (and growing).
    • Also, I'm a bit unclear as to whether this next detail plays a role with the cacheing challenges...but our dbt models in "BigQuery Project XYZ" are referencing many, many, many upstream tables from "BigQuery Project ABC" (that is not a dbt project). Not only are we referencing many tables from "BigQuery Project ABC," but "BigQuery Project ABC" has even more tables that we are not even using (magnitudes more).
    • Definitely not ideal! But something we're working through given the legacy backend tables which will take time to be refactored (away from date-sharded to partitioned/clustered).
  4. Thanks for also confirming the workaround I shared. It's good to know about the equivalent command of DBT_CACHE_SELECTED_ONLY=1 dbt docs generate --select one_model. The workaround has been reliably and successfully generating our docs with all the requisite components.

  5. Also helpful to know that the "table-filtered approach" is not possible nor the right solution for the caching slowness.

  6. Appreciate the heads up about https://github.com/dbt-labs/dbt-docs/issues/170. Will follow that as well.

Thank you again for all your time and insights! Happy to help beta test any ideas, too.

Fleid commented 1 year ago

I'm having a hard time wrapping my head around the overall caching performance topic. Should we tackle it case-by-case, for each adapter, since it's mostly an optimization conversation? Or should we do it top-down, re-designing the entire workflow of caching across the board?

In any case, this specific issue I will re-label as enhancement rather than a bug.

It's as much a statement on whether in this specific instance, performance is a must-have or a nice-to-have; that it is a acknowledgement that there is no easy fix here:

comp615 commented 1 year ago

Not sure how to help push on this or if anything will come, but wanted to throw another data point on and see if I can help encourage further research here. I'm not sure why our performance is so bad, so hopefully some of this helps.

In CI we run:

dbt --cache-selected-only build --target staging --select state:modified+ --store-failures --fail-fast  --exclude tag:limited
dbt --cache-selected-only docs generate --target staging --no-compile

I created a sample PR with just a trivial change in a leaf note, and this results in the following output:

00:15:21  Running with dbt=1.2.2
00:17:37  Found 1968 models, 3505 tests, 1 snapshot, 2 analyses, 1050 macros, 0 operations, 25 seed files, 743 sources, 8 exposures, 0 metrics
00:21:30  
00:21:32  Concurrency: 12 threads (target='staging')
00:21:32  
00:21:32  1 of 1 START table model mixer_9380.xpt_table_test ........ [RUN]
00:21:39  1 of 1 OK created table model mixer_9380.xpt_table_test ... [CREATE TABLE (137.6k rows, 14.1 MB processed) in 6.24s]
00:21:39  
00:21:39  Finished running 1 table model in 0 hours 0 minutes and 8.95 seconds (8.95s).
00:21:40  
00:21:40  Completed successfully
00:21:40  
00:21:40  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
00:21:49  Running with dbt=1.2.2
00:22:53  Building catalog
00:29:29  Catalog written to /workspace/dbt/target/catalog.json

I'm trying to debug and understand what happens:

Not shown: our CI copies the manifest and run_results from the previous prod run. We do have 120 +schema: usages, and ~126 sources.yml files. So based on others reading, it seems like spanning across so many datasets may be problematic? I tried --cache-selected-only but didn't really see any gains from that.

github-actions[bot] commented 11 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 11 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.