[CT-779] Listing tables on big datasets on every compile can be abnormally long (and incomplete?)

Describe the bug

I was troubleshooting a 2/3 min compile time for a small dbt project and it appears that it contained a model that output to a dataset that has 60K tables. As dbt-bigquery is providing list_relations_without_caching, it will use the list tables API. However the calls is slow with a lot of tables (2/3 minutes). Also as the max results is set to 100K and it looks like there is no pagination, it might be incomplete.

Steps To Reproduce

Create a dataset with a lot of tables
Create a model that output in that dataset
run a dbt compile

Expected behavior

I would expect the behavior to be faster. Yet I understand that it's not straightforward to improve. There are few leads:

We could use the get table for each schema with concurrency instead of fetching all tables to achieve decent performances. It would be better as long as there are few models on large datasets then it could be slower than current method.
We could use the SQL to request the INFORMATION_SCHEMA to retrieve the information such as using SELECT table_name FROM mydataset.INFORMATION_SCHEMA.TABLES WHERE table_name IN (...). You get the performance hit from query (2-3s overhead) and you're billed 10 MB though.

Obviously a simple workaround is to use a dataset that is not that crowded. Especially since all datasets table listings are running concurrently but it might not be applicable for all users. Another workaround would be to have a setting in the config that would assume the table exists or not directly to avoid scanning.

System information

The output of dbt --version:

Core:
  - installed: 1.2.0-a1
  - latest:    1.1.1    - Ahead of latest version!

I'm using dbt-bigquery 1.2.0-b1 (but had the issues as well with 1.1.1)

The operating system you're using: MacOS

The output of python --version: Python 3.10.4

Hi @github-christophe-oudar - thanks for opening this issue. We are experiencing the same performance slowness with dbt Cloud v1.0 with BigQuery...actually even worse performance since it takes us 35 minutes to compile 😢 . Regarding your comment:

Also as the max results is set to 100K and it looks like there is no pagination, it might be incomplete.

...I believe that explains the log message we have started to see in dbt Cloud when we compile:

I found your issue thanks to Jeremy Yeo (dbt Labs) who has been helping on https://github.com/dbt-labs/dbt-bigquery/issues/115 which is about dbt docs generate taking a long time for BigQuery. As I note in that ticket, we are experiencing very long processing times with both the compile and docs generate [--no-compile] (ie, catalog) commands. For context, we have many BigQuery projects that each have many datasets with many tables that we scan/union.

Wondering if maybe the folks at dbt Labs can point us in the right direction for how to override any macros in the compile step? I'd be interested to know if we can exclude certain dbt directories from being compiled (or even excluded from the docs generation)?

This issue deserves a thorough look from the relevant team.

Just popping in to offer, as a potential mechanism for temporary relief, you could look into trying out the cache_selected_only config: https://docs.getdbt.com/reference/global-configs#cache-database-objects-for-selected-resource

This is marked "experimental" because it hasn't been extensively battle-tested. It also won't help with full dbt run, but could stand to help with dbt run -s specific_model if you materialize your models across many different projects/datasets. I could foresee a future where we want to limit caching to the selected resources only by default.

Hi @jtcohen6 - 🙏 thanks for your time and support here! Appreciate the idea about using cach_selected_only...reading up on that now and wondering if that flag also works with dbt compile and/or dbt docs generate [--no-comiple]? FYI - we're presently running a standalone cloud job for generating docs that does not include the dbt run command which the "getdbt docs" link references with the dbt run command.

I have some additional learnings to share regarding workarounds for our...workarounds 😅.

Recently, our "initial workaround" to generating docs was as follows (dbt Cloud v1.2 w/BigQuery):

create a standalone cloud job for the sole purpose of generating docs (16 threads)
begin the job with: dbt parse
then add a second explicit command: dbt compile
followed by a third explicit command: dbt docs generate --no-compile The above steps worked for a while but have since hit memory limits again as our model count continues to grow. For reference, we were using 16 threads which yielded a 40 min compile time and an 8ish min docs generate time.

As our model count has continued to grow, the above workaround (of splitting the compile and docs generate steps) started failing so the thread count was adjusted down to 8. This worked for a day or two (with the compile step completing in 50+ mins).

But then we continued to hit memory limits in dbt Cloud again...and so here's the "latest" workaround for getting docs to complete successfully:

create a standalone cloud job for the sole purpose of generating docs (8 threads)
begin the job with: dbt parse
then add a second explicit command: dbt compile --select [one_of_our_models]
followed by a third explicit command: dbt docs generate --no-compile The compile step now completes in less than a minute. My initial concern was that selecting only one model to compile would yield a manifest.json file of just that one model...but it seems that we are getting the full manifest.json file of all our models!? (I may be overlooking something here...but my spot checks in the served docs make me think we're representing all our models in the docs still.) Thus now end-to-end, this cloud job takes about 10 minutes...and the final step to serve the docs is to use the run ID number in the url format of: https://cloud.getdbt.com/accounts/[account_number]/runs/[run_number]/docs/#!/overview (since we are unable to sync this cloud job to the built-in "Documentation" link in the web UI's hamburger menu).

Some followup thoughts/questions:

It was great to learn that the compile command takes the arguments --select, --exclude, and --selector: https://docs.getdbt.com/reference/node-selection/syntax
Thus using dbt compile with --select ____ will still enable us to generate the manifest.json file that's needed to serve the docs online. BUT I'm not 100% sure if this workaround is comprehensive for building the full manifest.json?
Curious to know if we can also pass the --select or --exclude arguments to the command dbt docs generate?
Once served, our docs are now crashing the Chrome browser much more frequently at this point.

Many thanks and interested to hear if folks have any questions/comments!

@kevinhoe Thanks for all those details! The config I linked is "global," so it should be supported for all commands, if not as a CLI flag in the place you expect then at least as an env var.

Out of curiosity:

Roughly how many models are in your dbt project?
Across how many different BigQuery projects/datasets are you materializing those models?
Do any of those datasets contain many many (thousands) of objects not owned/managed/updated by dbt?

Obviously a simple workaround is to use a dataset that is not that crowded.

Agreed! Christophe put this well: we encourage materializing dbt models in namespaces (schemas/datasets) entirely owned by dbt, containing only dbt-produced objects. But I also appreciate that, as dbt projects mature and get larger, "only dbt-produced objects" may still number in the hundreds/thousands.

Just to keep ourselves organized, let's keep this issue specific to the slowdown encountered during caching, when dbt uses the "list tables API" at the start of compile, run, and other commands. (This includes docs generate since it includes a compile step, by default, but not if you add --no-compile.) Your workaround makes sense, insofar as DBT_CACHE_SELECTED_ONLY=1 dbt compile --select one_model would only run caching queries for the schema containing one_model, and dbt docs generate --no-compile would avoid running caching queries entirely (since no compile step). DBT_CACHE_SELECTED_ONLY=1 dbt docs generate --select one_model should get the same job done in a single command.

Let's keep https://github.com/dbt-labs/dbt-bigquery/issues/115 focused on the scaling limits encountered during docs generate. I think it's the same basic phenomenon at play—dbt is running catalog queries that are hitting datasets with tons of objects in them, thereby loading too many of them into memory—something we could look to limit by trimming down the exact tables needing cataloging: https://github.com/dbt-labs/dbt-core/issues/4997. (We cannot take the same table-filtered approach for caching, because dbt caches schemas on an "all or nothing" basis: if the schema is cached, and a specific table supposedly residing in that schema is missing from the cache, dbt assumes it doesn't exist. The stakes for missing a catalog entry are much lower.)

Separately, we're aware of the scaling limits encountered using dbt-docs (https://github.com/dbt-labs/dbt-docs/issues/170), which is a static site that has to load the full contents of manifest.json + catalog.json into the browser. Approximately how big (in bytes) are your manifest and catalog?

@jtcohen6 - thank you for the extremely thorough and helpful response! Apologies for the delay from my end 🙏

Thanks for confirming that the config you linked is "global".
Regarding your question about the size/scope of our dbt instance:
- I've shared some concrete numbers with folks at dbt Labs via support tickets, etc. Would be happy to connect offline regarding specifics, but as a proxy, our manifest.json file is ~40MB and our catalog.json is also ~40MB.
Regarding the best practice of setting up dedicated BigQuery namespaces for "only dbt-produced objects":
- We have a couple of different BigQuery Projects/Datasets where we're materializing the dbt models; and most of the content is dbt specific (if not all of the content). Though as you alluded to, the sheer volume of our dbt models is substantial (and growing).
- Also, I'm a bit unclear as to whether this next detail plays a role with the cacheing challenges...but our dbt models in "BigQuery Project XYZ" are referencing many, many, many upstream tables from "BigQuery Project ABC" (that is not a dbt project). Not only are we referencing many tables from "BigQuery Project ABC," but "BigQuery Project ABC" has even more tables that we are not even using (magnitudes more).
- Definitely not ideal! But something we're working through given the legacy backend tables which will take time to be refactored (away from date-sharded to partitioned/clustered).
Thanks for also confirming the workaround I shared. It's good to know about the equivalent command of DBT_CACHE_SELECTED_ONLY=1 dbt docs generate --select one_model. The workaround has been reliably and successfully generating our docs with all the requisite components.
Also helpful to know that the "table-filtered approach" is not possible nor the right solution for the caching slowness.
Appreciate the heads up about https://github.com/dbt-labs/dbt-docs/issues/170. Will follow that as well.

Thank you again for all your time and insights! Happy to help beta test any ideas, too.

I'm having a hard time wrapping my head around the overall caching performance topic. Should we tackle it case-by-case, for each adapter, since it's mostly an optimization conversation? Or should we do it top-down, re-designing the entire workflow of caching across the board?

In any case, this specific issue I will re-label as enhancement rather than a bug.

It's as much a statement on whether in this specific instance, performance is a must-have or a nice-to-have; that it is a acknowledgement that there is no easy fix here:

I don't think we do async at all in adapters code, @colin-rogers-dbt please keep me honest. So if we were to start, this should be a top-down conversation
I'm not against offering the option to switch to a SQL call? @github-christophe-oudar did you measure the overall performance gain here?

Not sure how to help push on this or if anything will come, but wanted to throw another data point on and see if I can help encourage further research here. I'm not sure why our performance is so bad, so hopefully some of this helps.

In CI we run:

dbt --cache-selected-only build --target staging --select state:modified+ --store-failures --fail-fast  --exclude tag:limited
dbt --cache-selected-only docs generate --target staging --no-compile

I created a sample PR with just a trivial change in a leaf note, and this results in the following output:

00:15:21  Running with dbt=1.2.2
00:17:37  Found 1968 models, 3505 tests, 1 snapshot, 2 analyses, 1050 macros, 0 operations, 25 seed files, 743 sources, 8 exposures, 0 metrics
00:21:30  
00:21:32  Concurrency: 12 threads (target='staging')
00:21:32  
00:21:32  1 of 1 START table model mixer_9380.xpt_table_test ........ [RUN]
00:21:39  1 of 1 OK created table model mixer_9380.xpt_table_test ... [CREATE TABLE (137.6k rows, 14.1 MB processed) in 6.24s]
00:21:39  
00:21:39  Finished running 1 table model in 0 hours 0 minutes and 8.95 seconds (8.95s).
00:21:40  
00:21:40  Completed successfully
00:21:40  
00:21:40  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
00:21:49  Running with dbt=1.2.2
00:22:53  Building catalog
00:29:29  Catalog written to /workspace/dbt/target/catalog.json

I'm trying to debug and understand what happens:

~2min between start and "Found 1968 models"
~4min between found and execution of the single model
~7min of building catalog

Not shown: our CI copies the manifest and run_results from the previous prod run. We do have 120 +schema: usages, and ~126 sources.yml files. So based on others reading, it seems like spanning across so many datasets may be problematic? I tried --cache-selected-only but didn't really see any gains from that.

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbt-labs / dbt-bigquery