Open awong-evoiq opened 2 years ago
I am seeing this same behavior in Redshift.
For further reference, if you set on_schema_change='append_new_columns'
, the materialization works just fine even if there are columns in the target table that do not exist in the source.
I can reproduce this. After spending some time looking at the code, I feel like the actual descriptions of the options are:
on_schema_change |
In case of new columns... | In case of removed columns... |
---|---|---|
ignore |
columns ignored | run failed |
fail |
run failed | run failed |
append_new_columns |
columns added | columns ignored |
sync_all_columns |
columns added | columns removed |
@jtcohen6 it may just be that ignore
is mislabeled and should really be original
? Looks like everybody took for granted what the original behavior was, but now that the doc's not there anymore, it's hard to say looking at it from the future ;) Do you remember the time?
Anyway, in incremental.sql
, we try process_schema_changes
to get the target schema, and if we get nothing we pull the existing one (with the removed rows).
https://github.com/dbt-labs/dbt-core/blob/e8da84fb9e177d9eee3d7722d3d6906bb283183d/core/dbt/include/global_project/macros/materializations/models/incremental/incremental.sql#L45-L49
In on_schema_change.sql
, we abandon ship if the option is ignore
:
So yes, ignore
will result in a DML statement being built with dest_columns
listing the removed rows.
Speaking of DML statement, looking at incremental merge (through 3 layers of dispatch): https://github.com/dbt-labs/dbt-core/blob/f841a7ca76b44429eb94174bc5d0c2fecbf2763b/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L40-L44
We're looping on update_columns
which comes from:
So really just dest_columns
. Note, I don't think we have documented merge_exclude_columns
, only merge_update_columns
.
And back to the merge, for inserted columns:
Which can only fail, since it tries to pull columns from the source that have been removed.
I'd love a sanity check here, but my thinking is that this bug is actually a documentation clarification (ignore
really is original
or classic
) plus an enhancement (a new mode that actually does ignore
/ignore
on that table at the top)
@Fleid nice sleuth work :shipit:
Although I didn't actually sanity check this for you, I did create the following issue for the documentation clarification:
We can always yank that issue if it turns out to not be needed.
How are you thinking about accomplishing the full switch from ignore
→ original
or classic
? One PR or multiple? How would we communicate the change?
Can we make this into a feature request then? It doesn't make a lot of sense, semantically, that append_new_columns
will ignore extra columns in the db, but ignore
will fail. I think the use case mentioned in the ticket is kind of a reasonable thing to support (i.e. dbt being able to allow another system to manage db schemas).
@awong-evoiq I just switched this issue from a bug to a feature request. Could you take a look at the following subject line and see if you want to adjust it accordingly?
BigQuery Incremental -
on_schema_change
: ignore cannot ignore missing columns
@Fleid would you be willing to re-assess this through the lens of a feature request?
So what we want is:
on_schema_change |
In case of new columns... | In case of removed columns... |
---|---|---|
classic |
columns ignored | run failed |
ignore |
columns ignored | columns ignored |
fail |
run failed | run failed |
append_new_columns |
columns added | columns ignored |
sync_all_columns |
columns added | columns removed |
classic
mode, wired as the current ignore
modeignore
to actually ignore removed columnsI'm good for it!
Hello @Fleid,
We're really excited about this feature! 😄
Could you please let us know if development is currently underway for it? Has the DBT community reached a decision on this issue and approved updates to the behavior? And if the work hasn't started yet, would it be possible for me to get involved and contribute to its development? 🚀
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Hi, we still want this feature! 🙏 🙏 🙏
Is this a new bug in dbt-core?
Current Behavior
"If you remove a column from your incremental model, and execute a dbt run, this column will not be removed from your target table." https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#default-behavior
It doesn't remove the column, but it doesn't succeed in loading the other columns either. The generated sql tries to insert all the columns from the target table. Not finding a column in the source table, it fails with
Query error: Unrecognized name: <deleted_col>
.Expected Behavior
It should only insert the columns that are defined in the source table. BigQuery will leave any extra columns in the target schema as null (unless they are REQUIRED, in which case the insert should fail).
Steps To Reproduce
For an existing BigQuery table, use a
materialized='incremental'
config that does not have all the target columns in the source data.E.g.
or
Then execute
dbt run
.Relevant log output
Logs
target/run//models/example/.sql
Environment
Which database adapter are you using with dbt?
dbt-bigquery
Additional Context
We have per-customer data pipelines, which are all meant to ingest customer data into a unified schema. Since each customer's upstream data looks different, we are writing different dbt models for each. This is making it hard to maintain a unified schema across customers, so we want to manage the schema outside of dbt and use dbt's "incremental" materialization to append snapshots of data to an existing table. We are ingesting to BigQuery.
We'd like to be able to add a column to our BigQuery schemas without breaking the dbt pipelines, but we can't, because adding a column to the target schema causes the same mismatch as removing a column from the source schema.