dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
288 stars 116 forks source link

Add support to infer schemas on Snowflake #211

Closed b-per closed 1 year ago

b-per commented 1 year ago

Description & motivation

Fixes #139 and #191, allowing the package to infer columns for parquet files on Snowflake

This Snowflake page in the docs mention the following code to create an external table using the schema of a parquet file:

CREATE EXTERNAL TABLE mytable
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
          LOCATION=>'@mystage',
          FILE_FORMAT=>'my_parquet_format'
        )
      )
    )
    LOCATION=@mystage
    FILE_FORMAT=my_parquet_format
    AUTO_REFRESH=false;

But in that case, I couldn't get DDL working when defining partitions. Partitions being a huge saver for external tables it didn't look ideal.

What I have done instead is calling INFER_SCHEMA in a run_query() and then looping through the columns. The DDL executed is actually the same as if all the columns were maintained manually.

Example of source

A new key infer_schema is introduced.

We can still list columns for documentation purpose but if infer_schema = true we don't use those in the DDL.

version: 2

sources:
  - name: snowflake_external
    loader: s3

    tables:

      - name: bper_parquet
        external: 
          location: '@mydb.mychema.mystage'
          file_format: 'mydb.myschema.myformat'
          infer_schema: true
          partitions:
            - name: section
              data_type: varchar(64)
              expression: "substr(split_part(metadata$filename, 'section=', 2), 1, 1)"
        columns:
          - name: id
            description: this is an id
          - name: name
            description: and this is a name

Checklist

There is currently no details about infering schemas in the README and not test with parquet data on Snowflake.

b-per commented 1 year ago

The test failure on Redshift doesn't seem to be related to this change.

jeremyyeo commented 1 year ago

Hey @b-per - thanks for this :)

Given the nature of the change to one of the "core" macros of this package, I think it would make sense to add some test to check for syntax?

We can probably do that by adding a few more sources to https://github.com/dbt-labs/dbt-external-tables/blob/main/integration_tests/models/plugins/snowflake/snowflake_external.yml

I've converted the partitioned csv files from s3://dbt-external-tables-testing/csv/ which are the following files:

s3://dbt-external-tables-testing/csv/section=a/people_a.csv
s3://dbt-external-tables-testing/csv/section=b/people_b.csv
s3://dbt-external-tables-testing/csv/section=c/people_c.csv
s3://dbt-external-tables-testing/csv/section=d/people_d.csv

Using:

from pyarrow import csv
from pyarrow import parquet

people = csv.read_csv("people_a.csv")
parquet.write_table(people, "people_a.parquet")

And have uploaded that to s3 so it's available at:

s3://dbt-external-tables-testing/parquet/section=a/people_a.parquet
s3://dbt-external-tables-testing/parquet/section=b/people_b.parquet
s3://dbt-external-tables-testing/parquet/section=c/people_c.parquet
s3://dbt-external-tables-testing/parquet/section=d/people_d.parquet
b-per commented 1 year ago

@jeremyyeo Done. Let me know what you think!

I fixed the tests for all adapters. Redshift is still failing but this is not related to my change.