duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
857 stars 77 forks source link

Schema question when performing an external materialization and registering with Glue #422

Open firewall413 opened 1 month ago

firewall413 commented 1 month ago

I'm trying to understand how schema registration works using the Glue.py plugin

You run your DBT logic e.g. {{ config(materialized='external',location='s3://mybucket/hello, glue_register=true, ... )}}

select 1,2,3 from source

After which it it looks like the table is being materialized -> parquet file is written to s3 -> a view is built on top of this location using select * from s3://mybucket/*/*.parquet -> columns are extracted from this view -> this schema is registered in Glue.

This works neatly when all files have the same columns.

However, when adding new columns to your materialized parquet files (which Glue/Athena supports) and save those in a new partition, the next time the you run this model, it will still register the old schema (likely because of the */*.parquet of the s3 location), and seems to ignore the new schema.

Wouldn't it be better to register the schema of your last-run model? Is this a matter of reordering/adapting the macros in materializations/external.sql? Or would this be undesirable?

jwills commented 1 month ago

I thought the intended behavior was to update the schema if the columns change (viz. https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/glue.py#L328 )-- so if that isn't happening sometimes, it seems like a bug?