Closed github-christophe-oudar closed 11 months ago
Hi @github-christophe-oudar - thanks for opening this issue. We are experiencing the same performance slowness with dbt Cloud v1.0 with BigQuery...actually even worse performance since it takes us 35 minutes to compile 😢 . Regarding your comment:
Also as the max results is set to 100K and it looks like there is no pagination, it might be incomplete.
...I believe that explains the log message we have started to see in dbt Cloud when we compile:
I found your issue thanks to Jeremy Yeo (dbt Labs) who has been helping on https://github.com/dbt-labs/dbt-bigquery/issues/115 which is about dbt docs generate
taking a long time for BigQuery. As I note in that ticket, we are experiencing very long processing times with both the compile
and docs generate [--no-compile]
(ie, catalog
) commands. For context, we have many BigQuery projects that each have many datasets with many tables that we scan/union.
Wondering if maybe the folks at dbt Labs can point us in the right direction for how to override any macros in the compile
step? I'd be interested to know if we can exclude certain dbt directories from being compiled (or even excluded from the docs generation)?
This issue deserves a thorough look from the relevant team.
Just popping in to offer, as a potential mechanism for temporary relief, you could look into trying out the cache_selected_only
config: https://docs.getdbt.com/reference/global-configs#cache-database-objects-for-selected-resource
This is marked "experimental" because it hasn't been extensively battle-tested. It also won't help with full dbt run
, but could stand to help with dbt run -s specific_model
if you materialize your models across many different projects/datasets. I could foresee a future where we want to limit caching to the selected resources only by default.
Hi @jtcohen6 - 🙏 thanks for your time and support here! Appreciate the idea about using cach_selected_only
...reading up on that now and wondering if that flag also works with dbt compile
and/or dbt docs generate [--no-comiple]
? FYI - we're presently running a standalone cloud job for generating docs that does not include the dbt run
command which the "getdbt docs" link references with the dbt run
command.
I have some additional learnings to share regarding workarounds for our...workarounds 😅.
Recently, our "initial workaround" to generating docs was as follows (dbt Cloud v1.2 w/BigQuery):
dbt parse
dbt compile
dbt docs generate --no-compile
The above steps worked for a while but have since hit memory limits again as our model count continues to grow. For reference, we were using 16 threads which yielded a 40 min compile time and an 8ish min docs generate time.As our model count has continued to grow, the above workaround (of splitting the compile
and docs generate
steps) started failing so the thread count was adjusted down to 8. This worked for a day or two (with the compile
step completing in 50+ mins).
But then we continued to hit memory limits in dbt Cloud again...and so here's the "latest" workaround for getting docs to complete successfully:
dbt parse
dbt compile --select [one_of_our_models]
dbt docs generate --no-compile
The compile
step now completes in less than a minute. My initial concern was that selecting only one model to compile
would yield a manifest.json
file of just that one model...but it seems that we are getting the full manifest.json
file of all our models!? (I may be overlooking something here...but my spot checks in the served docs make me think we're representing all our models in the docs still.) Thus now end-to-end, this cloud job takes about 10 minutes...and the final step to serve the docs is to use the run ID number in the url format of: https://cloud.getdbt.com/accounts/[account_number]/runs/[run_number]/docs/#!/overview
(since we are unable to sync this cloud job to the built-in "Documentation" link in the web UI's hamburger menu).Some followup thoughts/questions:
compile
command takes the arguments --select
, --exclude
, and --selector
: https://docs.getdbt.com/reference/node-selection/syntaxdbt compile
with --select ____
will still enable us to generate the manifest.json
file that's needed to serve the docs online. BUT I'm not 100% sure if this workaround is comprehensive for building the full manifest.json
?--select
or --exclude
arguments to the command dbt docs generate
?Many thanks and interested to hear if folks have any questions/comments!
@kevinhoe Thanks for all those details! The config I linked is "global," so it should be supported for all commands, if not as a CLI flag in the place you expect then at least as an env var.
Out of curiosity:
Obviously a simple workaround is to use a dataset that is not that crowded.
Agreed! Christophe put this well: we encourage materializing dbt models in namespaces (schemas/datasets) entirely owned by dbt, containing only dbt-produced objects. But I also appreciate that, as dbt projects mature and get larger, "only dbt-produced objects" may still number in the hundreds/thousands.
Just to keep ourselves organized, let's keep this issue specific to the slowdown encountered during caching, when dbt uses the "list tables API" at the start of compile
, run
, and other commands. (This includes docs generate
since it includes a compile step, by default, but not if you add --no-compile
.) Your workaround makes sense, insofar as DBT_CACHE_SELECTED_ONLY=1 dbt compile --select one_model
would only run caching queries for the schema containing one_model
, and dbt docs generate --no-compile
would avoid running caching queries entirely (since no compile step). DBT_CACHE_SELECTED_ONLY=1 dbt docs generate --select one_model
should get the same job done in a single command.
Let's keep https://github.com/dbt-labs/dbt-bigquery/issues/115 focused on the scaling limits encountered during docs generate
. I think it's the same basic phenomenon at play—dbt is running catalog queries that are hitting datasets with tons of objects in them, thereby loading too many of them into memory—something we could look to limit by trimming down the exact tables needing cataloging: https://github.com/dbt-labs/dbt-core/issues/4997. (We cannot take the same table-filtered approach for caching, because dbt caches schemas on an "all or nothing" basis: if the schema is cached, and a specific table supposedly residing in that schema is missing from the cache, dbt assumes it doesn't exist. The stakes for missing a catalog entry are much lower.)
Separately, we're aware of the scaling limits encountered using dbt-docs
(https://github.com/dbt-labs/dbt-docs/issues/170), which is a static site that has to load the full contents of manifest.json
+ catalog.json
into the browser. Approximately how big (in bytes) are your manifest and catalog?
@jtcohen6 - thank you for the extremely thorough and helpful response! Apologies for the delay from my end 🙏
Thanks for confirming that the config you linked is "global".
Regarding your question about the size/scope of our dbt instance:
manifest.json
file is ~40MB and our catalog.json
is also ~40MB.Regarding the best practice of setting up dedicated BigQuery namespaces for "only dbt-produced objects":
Thanks for also confirming the workaround I shared. It's good to know about the equivalent command of DBT_CACHE_SELECTED_ONLY=1 dbt docs generate --select one_model
. The workaround has been reliably and successfully generating our docs with all the requisite components.
Also helpful to know that the "table-filtered approach" is not possible nor the right solution for the caching slowness.
Appreciate the heads up about https://github.com/dbt-labs/dbt-docs/issues/170. Will follow that as well.
Thank you again for all your time and insights! Happy to help beta test any ideas, too.
I'm having a hard time wrapping my head around the overall caching performance topic. Should we tackle it case-by-case, for each adapter, since it's mostly an optimization conversation? Or should we do it top-down, re-designing the entire workflow of caching across the board?
In any case, this specific issue I will re-label as enhancement
rather than a bug
.
It's as much a statement on whether in this specific instance, performance is a must-have or a nice-to-have; that it is a acknowledgement that there is no easy fix here:
Not sure how to help push on this or if anything will come, but wanted to throw another data point on and see if I can help encourage further research here. I'm not sure why our performance is so bad, so hopefully some of this helps.
In CI we run:
dbt --cache-selected-only build --target staging --select state:modified+ --store-failures --fail-fast --exclude tag:limited
dbt --cache-selected-only docs generate --target staging --no-compile
I created a sample PR with just a trivial change in a leaf note, and this results in the following output:
00:15:21 Running with dbt=1.2.2
00:17:37 Found 1968 models, 3505 tests, 1 snapshot, 2 analyses, 1050 macros, 0 operations, 25 seed files, 743 sources, 8 exposures, 0 metrics
00:21:30
00:21:32 Concurrency: 12 threads (target='staging')
00:21:32
00:21:32 1 of 1 START table model mixer_9380.xpt_table_test ........ [RUN]
00:21:39 1 of 1 OK created table model mixer_9380.xpt_table_test ... [CREATE TABLE (137.6k rows, 14.1 MB processed) in 6.24s]
00:21:39
00:21:39 Finished running 1 table model in 0 hours 0 minutes and 8.95 seconds (8.95s).
00:21:40
00:21:40 Completed successfully
00:21:40
00:21:40 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
00:21:49 Running with dbt=1.2.2
00:22:53 Building catalog
00:29:29 Catalog written to /workspace/dbt/target/catalog.json
I'm trying to debug and understand what happens:
Not shown: our CI copies the manifest and run_results from the previous prod run. We do have 120 +schema:
usages, and ~126 sources.yml
files. So based on others reading, it seems like spanning across so many datasets may be problematic? I tried --cache-selected-only
but didn't really see any gains from that.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Describe the bug
I was troubleshooting a 2/3 min compile time for a small dbt project and it appears that it contained a model that output to a dataset that has 60K tables. As dbt-bigquery is providing list_relations_without_caching, it will use the list tables API. However the calls is slow with a lot of tables (2/3 minutes). Also as the max results is set to 100K and it looks like there is no pagination, it might be incomplete.
Steps To Reproduce
Expected behavior
I would expect the behavior to be faster. Yet I understand that it's not straightforward to improve. There are few leads:
INFORMATION_SCHEMA
to retrieve the information such as usingSELECT table_name FROM mydataset.INFORMATION_SCHEMA.TABLES WHERE table_name IN (...)
. You get the performance hit from query (2-3s overhead) and you're billed 10 MB though.Obviously a simple workaround is to use a dataset that is not that crowded. Especially since all datasets table listings are running concurrently but it might not be applicable for all users. Another workaround would be to have a setting in the config that would assume the table exists or not directly to avoid scanning.
System information
The output of
dbt --version
:I'm using dbt-bigquery 1.2.0-b1 (but had the issues as well with 1.1.1)
The operating system you're using: MacOS
The output of
python --version
: Python 3.10.4