Closed virgilp closed 11 months ago
It's a fair question whether it's faster to run show terse schemas
, versus to just run create schema if not exists
. If you're running many models, the latter might need to run many times; but only one time if you're only running one model, it would run once, at most, and probably as a faster query!
I think that would look like a change to the create_schemas
method in dbt-core
. If len(required_schemas) == 1
, then just jump straight to create schema if not exists
, and skip the whole introspective step.
(An older issue in a similar vein, since it relates to the caching queries that run at startup, which tend to be among the more expensive: https://github.com/dbt-labs/dbt-snowflake/issues/83)
FYI in my tests, it was about an order of magnitude faster to run CREATE SCHEMA IF NOT EXISTS
than to run a select against INFORMATION_SCHEMA. So your test should probably be something like if len(required_schemas) > 10
... and even then, it depends a lot on the actual number of schemas in your database (if you have thousands of unrelated schemas, then the show terse
will actually be way slower, not just one order of magnitude). And also the variation will be larger (show terse schemas
can take tens of seconds in the "outlier" case... doesn't always/frequently happen but when it does it's annoying).
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Is this your first time submitting a feature request?
Describe the feature
Dbt is fairly slow in general for Snowflake (especially on small datasets) - and this can be particularly annoying at startup. This is largely because DBT issues a lot of metadata queries, and metadata queries are orders of magnitude slower in Snowflake than "direct queries" - even if those queries would need to look at the same metadata internally (this is I guess due to how Snowflake implements things, they chose to make explicit queries to INFORMATION_SCHEMA be much slower).
Take for example this DBT run:
26s total runtime, 2s "useful" runtime. Now, this was an extreme case, but it always takes a few seconds just to start. Why? Here's a hint (I modified dbt to print the queries as they are issued)
First thing it does, is "create_schemas()"; in order to see what schemas are already created, it lists existing schemas in the database. We actually can have A TON of schemas in the database - and this generally works fine with snowflake, but makes the DBT startup slow (also, if we were to have more than 10k schemas, which is not completely unimaginable, dbt would break down).
Thing is - it's completely avoidable; create_schemas() could simply "CREATE SCHEMA IF NOT EXISTS", there's no need to use "reflection"/ to list all the schemas in order to find whether the one that we're interested in exists or not.
Describe alternatives you've considered
No response
Who will this benefit?
Everybody using dbt with snowflake for development - it would make dbt builds so much faster, especially for small datasets.
Are you interested in contributing this feature?
maybe; but I don't have lots of time right now.
Anything else?
In general, it's way faster for snowflake to attempt to execute the query rather than to query the information schema explicitly. Unfortunately, from a quick glance, it seems that dbt implementation is heavily skewed towards querying the information schemas - so I don't think this is easily fixable in dbt-snowflake (and I'm not sure anybody would want to "fix"/ change the approach in dbt-core).