Changing a non-incremental to incremental model leaves old data in the table

ajsquared commented 3 weeks ago

Describe the bug

When converting a non-incremental table to an incremental table, a --full-refresh leaves old data in the table.

Steps To Reproduce

We had a DBT model that was initially non-incremental. We decided to add partitioning and turn it into an incremental model to improve performance. After making this change, we ran the model with the --full-refresh flag.

However, we saw the old table data still present:

Doing a count(*) on the table returned twice as many rows as expected
Half of the rows in the table had a null value for the partition column (these are the old rows that weren't properly removed).

This might be the same underlying issue as #695, but I'm not sure so I've filed this separately.

Expected behavior

Based on https://docs.getdbt.com/docs/build/incremental-models#how-do-i-rebuild-an-incremental-model, I'd expect DBT to drop the existing table, so any data currently in the table would be removed.

System information

The output of dbt --version:

Core:
  - installed: 1.8.0
  - latest:    1.8.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.8.0 - Update available!
  - spark:      1.8.0 - Up to date!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: Linux

The output of python --version:

Python 3.11.7

ajsquared commented 3 weeks ago

This is also using the insert_overwrite incremental strategy

benc-db commented 3 weeks ago

A couple of things of interest on this issue:

insert_overwrite was kind of nerfed in 1.8.0 because I removed the dynamic partition setting to make it work with SQL Warehouses, but upon reflection, this took a feature from something that only worked for clusters to something that didn't work as expected for either clusters or sql warehouses. This will be reverted in 1.8.2.
It's also possible that the issue is related to not being able to detect existing tables if the catalog has been renamed. If this is the case, it would also be fixed by 1.8.2

If possible, can you try 1.8.2rc1 and see if it fixes the issue? If not, please share the relevant log with me via email: ben.cassell@databricks.com

ajsquared commented 3 weeks ago

Okay, I'll try out 1.8.2rc1 and let you know

ajsquared commented 3 weeks ago

I see the same behavior on 1.8.2rc1. I'll share log files after I've tested #695 too

ajsquared commented 3 weeks ago

I've sent the logs over now

benc-db commented 3 weeks ago

Ok, here is what is going on. This is a consequence of spark.sql.sources.partitionOverwriteMode, which surprisingly affects the create or replace, in addition to the insert overwrite. So, what I can do in the adapter is to set the property to static on full-refresh and set to dynamic on incremental, and I think this should work for most reasonable use cases.

benc-db commented 3 weeks ago

Thanks for reporting this bug; I think your mitigation is easy enough, just blow away the existing table outside of dbt, but this report will improve the adapter.

ajsquared commented 3 weeks ago

Ah I see, that makes sense. I think that approach sounds good; it will be great if --full-refresh just works™!

ajsquared commented 3 weeks ago

Thanks for the quick fix on this! OOC, do you have an estimate of when 1.8.2 would be released?

benc-db commented 3 weeks ago

Most likely week after next as we have Data and AI Summit next week. I can push a second release candidate if it would help to have this available as prerelease on pypi before then.

ajsquared commented 3 weeks ago

After the summit is totally fine, just wanted to plan on my end when to push out the next upgrade

databricks / dbt-databricks