dbt-labs / dbt-snowflake

dbt-snowflake contains all of the code enabling dbt to work with Snowflake
https://getdbt.com
Apache License 2.0
296 stars 176 forks source link

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228

Closed graciegoheen closed 2 weeks ago

graciegoheen commented 1 month ago

Is this a new bug in dbt-core?

Current Behavior

Cartesian Join based deletion is causing data spilling to disk which heavily bogs down performance

The delete statement looks like:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'

    );

But we are not doing anything with my_model__dbt_tmp in the where clause.

We can simplify this logic and improve the performance, by instead doing:

delete from <existing> where <date range>;
insert into <existing> from <new data for same date range>;

One advantage of microbatch is that we know in advance the exact boundaries of every batch (time range, cf. "static" insert_overwrite).

In a world where we support "microbatch merge" models (= update batches by upserting on unique_key, rather than full batch replacement), then we would want to join (using) based on unique_key match, like so:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where DBT_INTERNAL_TARGET.event_id = DBT_TMP_TARGET.event_id
    and (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'

    );

But this shouldn't be the default assumption.

Expected Behavior

We should delete this line.

Steps To Reproduce

See here.

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

QuentinCoviaux commented 4 weeks ago

This seems like a decent deal-breaker at the moment - unless I'm facing some other config issue.

To put some numbers in perspectives (on Snowflake, based on a X-Small warehouse):

  1. Take one table with ~30 millions records (~0.5GB) roughly spread over 30 days
  2. Lookback of 3 days
  3. Delete query takes more than 10 minutes (I cancelled it after that), and this has to be done for each lookback period

Hoping we can get this one prioritized 😊