Package is not generating schema in Snowflake, when schema is not existing

dbt-labs / dbt-external-tables

dbt macros to stage external sources

https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/

Apache License 2.0

305 stars 122 forks source link

Package is not generating schema in Snowflake, when schema is not existing #100

Closed alexanderstautner closed 1 year ago

alexanderstautner commented 3 years ago

Describe the bug

When I use in the spec.yml in the sources block a schema that is not existing at the snowflake database and run dbt run-operation stage_external_sources then I got a schema not existing or not authorized error.

Expected results

As in the dbt models, when I name a schema that is not existing yet, then the schema is created.

Actual results

Error that the schema is not existing or not authorized

Which database are you using dbt with? snowflake

The output of dbt --version: 0.19.1

jtcohen6 commented 3 years ago

@alexanderstautner Thanks for opening, it's worth revisiting this question.

Initially, I did not think it ought to be the responsibility of this package to create the schema for a couple of reasons:

On Redshift, external tables can only be created in external schemas, and external schemas are a thing unto themselves. Creating them requires certain permissions, and an IAM role or secret at time of creation. I didn't think it made sense for dbt to automate that process.
On principle, dbt doesn't automatically create source schemas, it expects them to already exist. It also doesn't tend to create source tables, which this package does, but I was looking to draw the line somewhere.
The mechanism by which dbt creates schemas for "real" tasks (run, seed, snapshot) is defined within the task. At such time as this package's code might be integrated into a real dbt task (dbt source stage-external?), that's the right place to put that logic.

In any event, I could be in favor of automating create schema if not exists on Snowflake, BigQuery, Spark, etc., while keeping the current setup requirement in place for Redshift.

alexanderstautner commented 3 years ago

@jtcohen6 Thanks for your answer. I see your point with Redshift and I see your point with the source.

So to recapitulate this. The module has two natures depending on the database system. There are database systems like Redshift where it is behaving like a source. So the external table and the schema must be created outside dbt and the module is only referencing to an existing object.

And then there are systems like Snowflake, where the external table is created with the module like a normal table or view. And there at least I expect that the schema is handled like in dbt run and is created, if it's not already existing.

So, your suggestion would be to do exactly that. Treating it as source for database systems where it's an external source, and treating it like dbt run for database systems where it's possible to create schemas?

If yes, then for me this would be a cool feature, when it's documented properly for the different database systems.

jtcohen6 commented 3 years ago

@alexanderstautner I'd say that this package always takes the opinionated view that external tables should always be treated like sources, and not like models. To that end, there's likely to be some advance setup required (e.g. Redshift schema, Snowflake external stage). On other data technologies (e.g. Apache Spark), there's a much narrower distinction between an "external table" and a "managed table." Even then, though, this package still encourages thinking about them separately: external tables are sources (no business logic, just pointers to data in storage) vs. models (opinionated, logic-bearing transformations).

The question is, on databases where that advance setup does not require manual schema creation—where dbt run-operation stage_external_schemas could easily include a create schema if not exists step—should it?

guillesd commented 3 years ago

Hi @jtcohen6, I actually think that adding create schema if not exists could help quite a lot. Sometimes, precisely because these are external tables coming from a particular location, you want to cluster them using something like a schema so that you can easily recognize them (_srcexternal, for example). It's a bit painful to do this manually!

If you agree I can try to make a PR for this!

jtcohen6 commented 2 years ago

@guillesd Sorry for the delayed response — I'm convinced! I'd welcome a PR for this.

Let's keep this automatic schema creation limited to databases that do not require additional setup at schema creation time — which means no to Redshift, yes to Snowflake and BigQuery. What's the right approach for Spark?

alexanderstautner commented 2 years ago

Thanks for doing this, this would make life much easier on Snowflake.

guillesd commented 2 years ago

Okey I will pick this up on Friday @jtcohen6. For Spark it will have to be the following statement:

CREATE DATABASE IF NOT EXISTS <my_db_or_schema>;

Hive metastore does not support schemas as a level of abstraction --> docs

scarrucciu commented 2 years ago

@guillesd I am creating the schema's manually right now to organize source tables for bigquery, so would be very interested in this feature. Please let me know if there is any way I can help on it.

sertaco commented 2 years ago

I would be interested in this feature too. Right now we are creating this as an extra step and this would make the whole pipeline simpler.

pecigonzalo commented 2 years ago

@jtcohen6 Would this be as simple as adding it to get_external_build_plan for Snowflake's plugins build plan? I can quickly craft a PR for that if thats all.

jeremyyeo commented 1 year ago

Resolved in #167