dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
211 stars 149 forks source link

[CT-1153] [Feature] BigQuery Incremental - `on_schema_change: ignore` cannot ignore missing columns #1102

Open awong-evoiq opened 2 years ago

awong-evoiq commented 2 years ago

Is this a new bug in dbt-core?

Current Behavior

"If you remove a column from your incremental model, and execute a dbt run, this column will not be removed from your target table." https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#default-behavior

It doesn't remove the column, but it doesn't succeed in loading the other columns either. The generated sql tries to insert all the columns from the target table. Not finding a column in the source table, it fails with Query error: Unrecognized name: <deleted_col>.

Expected Behavior

It should only insert the columns that are defined in the source table. BigQuery will leave any extra columns in the target schema as null (unless they are REQUIRED, in which case the insert should fail).

Steps To Reproduce

For an existing BigQuery table, use a materialized='incremental' config that does not have all the target columns in the source data.

E.g.

{{
    config(
        materialized='incremental',
        unique_key='id_uuid',
        on_schema_change='ignore',
    )
}}

or

{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        partition_by={
          "field": "snapshot_ts",
          "data_type": "timestamp",
          "granularity": "hour"
        },
        on_schema_change='ignore',
    )
}}

Then execute dbt run.

Relevant log output

Logs

16:29:37  Database Error in model <my_model> (models/example/<my_model>.sql)
16:29:37    Name <missing_col> not found inside DBT_INTERNAL_SOURCE at [70:444]
16:29:37    compiled SQL at target/run/<my_proj>/models/example/<my_model>.sql

target/run//models/example/.sql

...
when matched then update set
    `test` = DBT_INTERNAL_SOURCE.`test`, 
    `missing_col` = DBT_INTERNAL_SOURCE.`missing_col`
when not matched then insert
    (`test`, `missing_col`)
values
    (`test`, `missing_col`)

Environment

- OS: Debian GNU/Linux 11 (bullseye)
- Python: 3.9.13
- dbt: 1.2.1

Which database adapter are you using with dbt?

dbt-bigquery

Additional Context

We have per-customer data pipelines, which are all meant to ingest customer data into a unified schema. Since each customer's upstream data looks different, we are writing different dbt models for each. This is making it hard to maintain a unified schema across customers, so we want to manage the schema outside of dbt and use dbt's "incremental" materialization to append snapshots of data to an existing table. We are ingesting to BigQuery.

We'd like to be able to add a column to our BigQuery schemas without breaking the dbt pipelines, but we can't, because adding a column to the target schema causes the same mismatch as removing a column from the source schema.

mbarugelCA commented 1 year ago

I am seeing this same behavior in Redshift.

For further reference, if you set on_schema_change='append_new_columns', the materialization works just fine even if there are columns in the target table that do not exist in the source.

Fleid commented 1 year ago

I can reproduce this. After spending some time looking at the code, I feel like the actual descriptions of the options are:

on_schema_change In case of new columns... In case of removed columns...
ignore columns ignored run failed
fail run failed run failed
append_new_columns columns added columns ignored
sync_all_columns columns added columns removed

@jtcohen6 it may just be that ignore is mislabeled and should really be original? Looks like everybody took for granted what the original behavior was, but now that the doc's not there anymore, it's hard to say looking at it from the future ;) Do you remember the time?


Anyway, in incremental.sql, we try process_schema_changes to get the target schema, and if we get nothing we pull the existing one (with the removed rows). https://github.com/dbt-labs/dbt-core/blob/e8da84fb9e177d9eee3d7722d3d6906bb283183d/core/dbt/include/global_project/macros/materializations/models/incremental/incremental.sql#L45-L49

In on_schema_change.sql, we abandon ship if the option is ignore:

https://github.com/dbt-labs/dbt-core/blob/7efb6ab62dae3ea9da4e6e28e9a30aa4e0e74919/core/dbt/include/global_project/macros/materializations/models/incremental/on_schema_change.sql#L102-L106

So yes, ignore will result in a DML statement being built with dest_columns listing the removed rows.


Speaking of DML statement, looking at incremental merge (through 3 layers of dispatch): https://github.com/dbt-labs/dbt-core/blob/f841a7ca76b44429eb94174bc5d0c2fecbf2763b/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L40-L44

We're looping on update_columns which comes from:

https://github.com/dbt-labs/dbt-core/blob/f841a7ca76b44429eb94174bc5d0c2fecbf2763b/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L11-L12

So really just dest_columns. Note, I don't think we have documented merge_exclude_columns, only merge_update_columns.

And back to the merge, for inserted columns:

https://github.com/dbt-labs/dbt-core/blob/f841a7ca76b44429eb94174bc5d0c2fecbf2763b/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L47-L50

Which can only fail, since it tries to pull columns from the source that have been removed.


I'd love a sanity check here, but my thinking is that this bug is actually a documentation clarification (ignore really is original or classic) plus an enhancement (a new mode that actually does ignore/ignore on that table at the top)

dbeatty10 commented 1 year ago

@Fleid nice sleuth work :shipit:

Although I didn't actually sanity check this for you, I did create the following issue for the documentation clarification:

We can always yank that issue if it turns out to not be needed.

How are you thinking about accomplishing the full switch from ignoreoriginal or classic? One PR or multiple? How would we communicate the change?

awong-evoiq commented 1 year ago

Can we make this into a feature request then? It doesn't make a lot of sense, semantically, that append_new_columns will ignore extra columns in the db, but ignore will fail. I think the use case mentioned in the ticket is kind of a reasonable thing to support (i.e. dbt being able to allow another system to manage db schemas).

dbeatty10 commented 1 year ago

@awong-evoiq I just switched this issue from a bug to a feature request. Could you take a look at the following subject line and see if you want to adjust it accordingly?

BigQuery Incremental - on_schema_change: ignore cannot ignore missing columns

@Fleid would you be willing to re-assess this through the lens of a feature request?

Fleid commented 1 year ago

So what we want is:

on_schema_change In case of new columns... In case of removed columns...
classic columns ignored run failed
ignore columns ignored columns ignored
fail run failed run failed
append_new_columns columns added columns ignored
sync_all_columns columns added columns removed

I'm good for it!

sudo-pradip commented 12 months ago

Hello @Fleid,

We're really excited about this feature! 😄

Could you please let us know if development is currently underway for it? Has the DBT community reached a decision on this issue and approved updates to the behavior? And if the work hasn't started yet, would it be possible for me to get involved and contribute to its development? 🚀

github-actions[bot] commented 1 month ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

suchac commented 1 month ago

Hi, we still want this feature! 🙏 🙏 🙏