databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Handle dbt-spark incremental on_schema_change behaviour #780

Open sp-cveeragandham opened 2 months ago

sp-cveeragandham commented 2 months ago

Describe the bug

We are using merge incremental strategy in dbt and have tried the on_schema_change = "sync_all_columns" in order to handle any additions or deletions of columns. However, due to an exception created in spark__alter_relation_add_remove_columns macro in dbt-spark, it simply throws this error whenever a column deletion is detected - Delta lake does not support dropping columns from tables although delta lake supports column deletions provided the table has required table properties. In order to work around this problem, we have created an override macro that does the following,

  1. Checks if the required tblproperties are set. If not, it sets them.
  2. Creates an alter query to drop the column(s) and runs the query of column deletion is detected.

Creating this bug request to handle this error and perhaps create an override macro in dbt-databricks incrementals.

Steps To Reproduce

  1. Create a dbt incremental model with at least 2 columns. And set the dbt incremental config - on_schema_change = "sync_all_columns"
    {{
    config(
        materialized="incremental",
        unique_key="id",
        on_schema_change="sync_all_columns",
    )
    }}
    with
    sample_data as (
        select 1 as id, 'name1' as name
        union
        select 2 as id, 'name2' as name
    )
    select *
    from sample_data
  2. Run the dbt model to materialize it in Databricks.
  3. Drop a column and run the model again.
    {{
    config(
        materialized="incremental",
        unique_key="id",
        on_schema_change="sync_all_columns",
    )
    }}
    with
    sample_data as (
        select 1 as id
        union
        select 2 as id
    )
    select *
    from sample_data

Expected behavior

When a column is deleted in an incremental model, we expect that the column is dropped in the target incremental model. Implement a macro in dbt-databricks incrementals that overrides the default behaviour of spark__alter_relation_add_remove_columns in dbt-spark repo.

  1. Remove the exception for Delta Lake (Line number 406 in the above repo)
  2. Set required tblproperties. (delta.minReaderVersion: 2, delta.minWriterVersion: 5 and delta.columnMapping.mode: name)
  3. Add an alter statement for removing columns

Screenshots and log output

Error when trying to drop a column. Compilation Error in model test_on_schema_change (models/stage/br/test_on_schema_change.sql) Delta Lake does not support dropping columns from tables

in macro spark__alter_relation_add_remove_columns (macros/adapters.sql) called by macro alter_relation_add_remove_columns (macros/adapters/columns.sql) called by macro sync_column_schemas (macros/materializations/models/incremental/on_schema_change.sql) called by macro process_schema_changes (macros/materializations/models/incremental/on_schema_change.sql) called by macro materialization_incremental_databricks (macros/materializations/incremental/incremental.sql) called by model test_on_schema_change (models/stage/br/test_on_schema_change.sql)

System information

Core:

Plugins:

Additional context

I found a PR to fix this in dbt-spark. Not sure what the lead time is to get this updated in dbt-spark repo.

AlexVialaBellander commented 2 months ago

Is this still a thing?

benc-db commented 2 months ago

Can you submit a PR for your macro override? I was unaware this limitation has been lifted. I wonder which Databricks runtime version it's compatible with.