dbt-labs / dbt-snowflake

dbt-snowflake contains all of the code enabling dbt to work with Snowflake
https://getdbt.com
Apache License 2.0
276 stars 167 forks source link

[ADAP-379] [Feature] Improve startup performance #528

Closed virgilp closed 11 months ago

virgilp commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Dbt is fairly slow in general for Snowflake (especially on small datasets) - and this can be particularly annoying at startup. This is largely because DBT issues a lot of metadata queries, and metadata queries are orders of magnitude slower in Snowflake than "direct queries" - even if those queries would need to look at the same metadata internally (this is I guess due to how Snowflake implements things, they chose to make explicit queries to INFORMATION_SCHEMA be much slower).

Take for example this DBT run:

09:18:31  Running with dbt=1.4.4
09:18:31  Found 9 models, 37 tests, 0 snapshots, 0 analyses, 342 macros, 0 operations, 1 seed file, 3 sources, 0 exposures, 0 metrics
09:18:31  
09:18:55  Concurrency: 10 threads (target='cloud')
09:18:55  
09:18:55  1 of 1 START sql table model 27ec6563-d2b5-49a6-af3b-ec321f6b655d.stats ........ [RUN]
09:18:57  1 of 1 OK created sql table model 27ec6563-d2b5-49a6-af3b-ec321f6b655d.stats ... [SUCCESS 1 in 2.21s]
09:18:57  
09:18:57  Finished running 1 table model in 0 hours 0 minutes and 26.10 seconds (26.10s).
09:18:57  
09:18:57  Completed successfully
09:18:57  
09:18:57  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

26s total runtime, 2s "useful" runtime. Now, this was an extreme case, but it always takes a few seconds just to start. Why? Here's a hint (I modified dbt to print the queries as they are issued)

10:20:59  Running with dbt=1.4.4
10:20:59  Found 9 models, 37 tests, 0 snapshots, 0 analyses, 342 macros, 0 operations, 1 seed file, 3 sources, 0 exposures, 0 metrics
10:20:59  
ADAPTER: <dbt.adapters.snowflake.impl.SnowflakeAdapter object at 0x10703f1c0>
EXECUTE SQL 
    show terse schemas in database "MY_DATABASE"
    limit 10000

EXECUTE SQL 
    show terse objects in "MY_DATABASE"."27ec6563-d2b5-49a6-af3b-ec321f6b655d"

10:21:03  Concurrency: 1 threads (target='cloud')

First thing it does, is "create_schemas()"; in order to see what schemas are already created, it lists existing schemas in the database. We actually can have A TON of schemas in the database - and this generally works fine with snowflake, but makes the DBT startup slow (also, if we were to have more than 10k schemas, which is not completely unimaginable, dbt would break down).

Thing is - it's completely avoidable; create_schemas() could simply "CREATE SCHEMA IF NOT EXISTS", there's no need to use "reflection"/ to list all the schemas in order to find whether the one that we're interested in exists or not.

Describe alternatives you've considered

No response

Who will this benefit?

Everybody using dbt with snowflake for development - it would make dbt builds so much faster, especially for small datasets.

Are you interested in contributing this feature?

maybe; but I don't have lots of time right now.

Anything else?

In general, it's way faster for snowflake to attempt to execute the query rather than to query the information schema explicitly. Unfortunately, from a quick glance, it seems that dbt implementation is heavily skewed towards querying the information schemas - so I don't think this is easily fixable in dbt-snowflake (and I'm not sure anybody would want to "fix"/ change the approach in dbt-core).

jtcohen6 commented 1 year ago

It's a fair question whether it's faster to run show terse schemas, versus to just run create schema if not exists. If you're running many models, the latter might need to run many times; but only one time if you're only running one model, it would run once, at most, and probably as a faster query!

I think that would look like a change to the create_schemas method in dbt-core. If len(required_schemas) == 1, then just jump straight to create schema if not exists, and skip the whole introspective step.

(An older issue in a similar vein, since it relates to the caching queries that run at startup, which tend to be among the more expensive: https://github.com/dbt-labs/dbt-snowflake/issues/83)

virgilp commented 1 year ago

FYI in my tests, it was about an order of magnitude faster to run CREATE SCHEMA IF NOT EXISTSthan to run a select against INFORMATION_SCHEMA. So your test should probably be something like if len(required_schemas) > 10... and even then, it depends a lot on the actual number of schemas in your database (if you have thousands of unrelated schemas, then the show terse will actually be way slower, not just one order of magnitude). And also the variation will be larger (show terse schemas can take tens of seconds in the "outlier" case... doesn't always/frequently happen but when it does it's annoying).

github-actions[bot] commented 12 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 11 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.