dbt-labs / dbt-snowflake

dbt-snowflake contains all of the code enabling dbt to work with Snowflake
https://getdbt.com
Apache License 2.0
297 stars 177 forks source link

[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

Open SystemOfaDrow opened 1 month ago

SystemOfaDrow commented 1 month ago

Is this your first time submitting a feature request?

Describe the feature

Currently, the delete+insert incremental strategy requires a "unique_key", but there's nothing actually enforcing this key be unique. I've done this before, but the issue is that it creates a cartesian join between all the records with matching (non-unique) unique_key values because of the USING. The new microbatch strategy does something similar, where if you don't have an additional predicate, the USING bit is unnecessary. In both cases, the join explosion can take up a significant amount of resources.

Describe alternatives you've considered

For the delete+insert strategy, I would propose possibly changing the name the "unique_key" config to "incremental_key", and modifying the query to use a correlated EXISTS subquery.

delete from {{ target }}
where exists (
    select 1
    from {{ source }}
    where
    {% if incremental_key is sequence and incremental_key is not string %}
        {% for key in incremental_key %}
            {{ source }}.{{ key }} = {{ target }}.{{ key }}
            {{ "and " if not loop.last}}
        {% endfor %}
    {% else %}
        {{ source }}.{{ incremental_key }} = {{ target}}.{{ incremental_key }}
    {% endif %}
)
{% if incremental_predicates %}
    {% for predicate in incremental_predicates %}
        {{ predicate }} {{ "and" if not loop.last }}
    {% endfor %}
{% endif %};

The using {{ source }} line in the microbatch strategy should simply be deleted.

Who will this benefit?

This would benefit every who uses these incremental strategies by significantly lowering warehouse load (and therefore costs) for these types of queries.

Are you interested in contributing this feature?

I think the code above should work. It may take me a while before I have time to do all the requisite testing before creating a PR.

Anything else?

I ran one query like DELETE FROM target USING source with no additional predicates where the target table had 10 million records and the source table a billion. You can see how relatively expensive that cartesian join is.

Most Expensive Nodes CartesianJoin 47.3% Delete 27.0% TableScan 1.3%

graciegoheen commented 1 month ago

This is related to: