Closed alexanderstautner closed 1 year ago
@alexanderstautner Thanks for opening, it's worth revisiting this question.
Initially, I did not think it ought to be the responsibility of this package to create the schema for a couple of reasons:
run
, seed
, snapshot
) is defined within the task. At such time as this package's code might be integrated into a real dbt task (dbt source stage-external
?), that's the right place to put that logic.In any event, I could be in favor of automating create schema if not exists
on Snowflake, BigQuery, Spark, etc., while keeping the current setup requirement in place for Redshift.
@jtcohen6 Thanks for your answer. I see your point with Redshift and I see your point with the source.
So to recapitulate this. The module has two natures depending on the database system. There are database systems like Redshift where it is behaving like a source. So the external table and the schema must be created outside dbt and the module is only referencing to an existing object.
And then there are systems like Snowflake, where the external table is created with the module like a normal table or view. And there at least I expect that the schema is handled like in dbt run and is created, if it's not already existing.
So, your suggestion would be to do exactly that. Treating it as source for database systems where it's an external source, and treating it like dbt run for database systems where it's possible to create schemas?
If yes, then for me this would be a cool feature, when it's documented properly for the different database systems.
@alexanderstautner I'd say that this package always takes the opinionated view that external tables should always be treated like sources, and not like models. To that end, there's likely to be some advance setup required (e.g. Redshift schema, Snowflake external stage). On other data technologies (e.g. Apache Spark), there's a much narrower distinction between an "external table" and a "managed table." Even then, though, this package still encourages thinking about them separately: external tables are sources (no business logic, just pointers to data in storage) vs. models (opinionated, logic-bearing transformations).
The question is, on databases where that advance setup does not require manual schema creation—where dbt run-operation stage_external_schemas
could easily include a create schema if not exists
step—should it?
Hi @jtcohen6, I actually think that adding create schema if not exists
could help quite a lot. Sometimes, precisely because these are external tables coming from a particular location, you want to cluster them using something like a schema so that you can easily recognize them (_srcexternal, for example). It's a bit painful to do this manually!
If you agree I can try to make a PR for this!
@guillesd Sorry for the delayed response — I'm convinced! I'd welcome a PR for this.
Let's keep this automatic schema creation limited to databases that do not require additional setup at schema creation time — which means no to Redshift, yes to Snowflake and BigQuery. What's the right approach for Spark?
Thanks for doing this, this would make life much easier on Snowflake.
Okey I will pick this up on Friday @jtcohen6. For Spark it will have to be the following statement:
CREATE DATABASE IF NOT EXISTS <my_db_or_schema>;
Hive metastore does not support schemas as a level of abstraction --> docs
@guillesd I am creating the schema's manually right now to organize source tables for bigquery, so would be very interested in this feature. Please let me know if there is any way I can help on it.
I would be interested in this feature too. Right now we are creating this as an extra step and this would make the whole pipeline simpler.
@jtcohen6 Would this be as simple as adding it to get_external_build_plan
for Snowflake's plugins build plan? I can quickly craft a PR for that if thats all.
Resolved in #167
Describe the bug
When I use in the spec.yml in the sources block a schema that is not existing at the snowflake database and run dbt run-operation stage_external_sources then I got a schema not existing or not authorized error.
Expected results
As in the dbt models, when I name a schema that is not existing yet, then the schema is created.
Actual results
Error that the schema is not existing or not authorized
Which database are you using dbt with? snowflake
The output of
dbt --version
: 0.19.1