dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
294 stars 119 forks source link

Create an external table using the infer_schema function in Snowflake #191

Closed PTMLSkippy closed 1 year ago

PTMLSkippy commented 1 year ago

Describe the feature

I would like to be able to define an external table (via a source.yml) with an infer_schema attribute, default to false - but when set to true create SQL in the following format - as documented for Snowflake: https://docs.snowflake.com/en/sql-reference/functions/infer_schema#examples

CREATE OR REPLACE EXTERNAL TABLE DB.schema.external_table USING TEMPLATE ( SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*)) FROM TABLE( INFER_SCHEMA( LOCATION=>'@my_S3_Stage' , FILE_FORMAT=>'MY_PARQUET_File_Format' , IGNORE_CASE => TRUE --another optional parameter for the infer_schema attribute ) ) ) LOCATION=@my_S3_Stage FILE_FORMAT=MY_PARQUET_File_Format AUTO_REFRESH=true ;

I would anticipate that with this new attribute to true - you would then not have to define any Columns.

Describe alternatives you've considered

Defining the above SQL in a model for each source table - this can then be built as either an external or persisted table

Additional context

Specific to Snowflake - and certain file formats - at time of writing supported files formats are: [Apache Parquet, Apache Avro, ORC]

Who will this benefit?

This means simply schema changes (new colums) can be easily handled within DBT code based for external sourced data. For example, schema changes can be automatically accessible to downstream models or automatically processed when using a *select from {{ ref('ext_table')}} in a down-stream incremental model, the schema changes are then handled by on_shcema_change: append_new_columns** model config.

b-per commented 1 year ago

This should be fixed in the new 0.8.5

You can check this file to see an example of inferring schemas on parquet files