aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
101 stars 69 forks source link

Add on_schema_change possibility #426

Closed Jeremynadal33 closed 2 months ago

Jeremynadal33 commented 3 months ago

resolves #330

Description

Adds a call to the macro process_schema_change after creating the temp relation and before merging. Similar to dbt-spark's incremental materialization.

It will allow the "ignore", "fail" and "append_new_columns" as on_schema_change strategy but not the "sync_all_columns" because the macro spark__alter_relation_add_remove_columns does not allow for dropping columns. It could because Delta allow for dropping columns if table properties are :

{
    'delta.columnMapping.mode' = 'name',
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5'

It will be possible once dbt-sprak PR #1088 is merged.

WARNING : there is still a limitation. For example, if using "ignore" strategy and removing columns, update will fail because spark__get_merge_sql uses update set * which looks for all columns.

Checklist

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

FYI : I am not sure wether this change needs testing, if it does, I would be interested in understanding how to!

Tests that were run locally

Add a simple model with incremental strategy (test_change_schema) :

{{
    config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=['id'],
        file_format='delta',
        on_schema_change='append_new_columns'
    )
}}

with incoming_data as (
    select 1 as id
    union all
    select 2 as id
)

select * from incoming_data

Run it a first time and add a new columns :

{{
    config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=['id'],
        file_format='delta',
        on_schema_change='append_new_columns'
    )
}}

with incoming_data as (
    select 1 as id, 'a' as name
    union all
    select 2 as id, 'b' as name
)

select * from incoming_data

Column is added.

moomindani commented 2 months ago

Thank you for your contribution!