[Bug] Updating partitioning/clustering on a table self referencing itself will fail and delete the table

github-christophe-oudar commented 4 months ago

Is this a new bug in dbt-bigquery?

[X] I believe this is a new bug in dbt-bigquery
[X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

In some cases, models can be designed to self reference itself ({{ this }}) as part of a "sliding window" approach.

Running that kind model AFTER adding a change to clustering/partitioning results in triggering: https://github.com/dbt-labs/dbt-bigquery/blob/0627aa2ec5283fdc3f7df43865d37e3612f3df5b/dbt/include/bigquery/macros/materializations/table.sql#L27-L30

dbt will drop your current "state" and your query will fail (because {{ this }} won't exist).

If you add an is_incremental backup it might backfire as you would recreate the table "from scratch" without previous state making it a "silent" bug.

It would happen both in table and incremental materialization

Expected Behavior

This code is still relevant as trying to reproduce the root cause would still throw something like:

Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is clustering(id) and existing spec is none

I would suggest to:

create an option to produce an error to let the user deal with the incompatibility
create a temp table, delete the current table, create a clone with the right name from the temp table, delete the temp table (note: it would be still tricky for the partitioned table we would need to copy first the whole table in the temp table before running the "usual incremental process")

Steps To Reproduce

A simple example would be to use following model:

WITH base AS (
SELECT website, clicks
FROM {{ this }}
UNION ALL
SELECT website, COUNT(*) clicks
FROM `my_project.my_dataset.click_events`
)
SELECT website, SUM(clicks) clicks
FROM base
GROUP BY website

Run the model.

Then you are adding a config block to cluster_by = ['website'].

Rerun the model.

Relevant log output

No response

Environment

- OS: Mac OS
- Python: 3.11
- dbt-core: 1.8.2
- dbt-bigquery: 1.8.1

Additional Context

That bug is some unexpected "chaos engineering" that's not so easy to figure out

amychen1776 commented 2 months ago

@github-christophe-oudar Could you provide some use cases where you want to have a model self referencing itself? This is by default a pretty anti-pattern to dbt's idempotency.

github-christophe-oudar commented 2 months ago

Sure, I've few cases in production where I build self referencing models that are, for instance, daily partitioned tables that contains a 30 day sliding window on which I add new elements and expire old ones. Practically, I read my {{ this }} on previous day partition (where day = DATE_SUB({{ var("date") }}, INTERVAL 1 DAY)) along expiration logic (AND event_ts > DATE_SUB(var("date") }}, INTERVAL 29 DAY)) then it's idempotent to process the same partition as I'm inserting TIMESTAMP({{ var("date") }}) day to insert in the right partition. I'm also able to join reference data that match the day using snapshots for instance. Creating a single table makes it for downstream consumers (data apps) that just need to read the latest partition (or rollback to the previous one if something is fishy or provide a "time travel" experience).

amychen1776 commented 2 months ago

Thank you for the explanation! That use case makes sense to me (I've done something similar before).That said, I will close this issue for now because we do have a broader project that we are currently working on to be announced close to Coalesce that should resolve this use case more delightfully.

dbt-labs / dbt-bigquery