dbt-labs / dbt-utils

Utility functions for dbt projects.
https://hub.getdbt.com/dbt-labs/dbt_utils/latest/
Apache License 2.0
1.34k stars 491 forks source link

BigQuery `deduplicate` macro doesn't support downstream partition pruning #928

Open austinclooney opened 3 months ago

austinclooney commented 3 months ago

Describe the feature

The Bigquery deduplicate macro uses array_agg to deduplicate. The way it is currently set up, any query using the macro will not be able to partition prune downstream of the macro due to the way array_agg interacts with partition pruning because the partition column is not explicitly selected separate from the array agg. This can be avoided in a table by doing the partition filtering in a CTE prior to using the deduplicate macro, but it can't be avoided in a view.

Describe alternatives you've considered

For models materialized as tables, we can add the partition filtering above the deduplication. For views, the alternative is to write the deduplication step ourselves to manual select the partition column outside of the array_agg.

Basically it is written like this:

    select unique.*
    from (
        select
            array_agg (
                original
                order by {{ order_by }}
                limit 1
            )[offset(0)] unique
        from {{ relation }} original
        group by {{ partition_by }}
    )

and instead it needs to look like this:

    select <partition pruning column>, unique.* except(<partition pruning column>)
    from (
        select
            <partition pruning column>,
            array_agg (
                original
                order by {{ order_by }}
                limit 1
            )[offset(0)] unique
        from {{ relation }} original
        group by <partition pruning column>, {{ partition_by }}
    )

Additional context

This is specific to the BQ deuplicate macro.

Who will this benefit?

BQ users who want to deduplicate in a view.

Are you interested in contributing this feature?

I can create a PR that implements the fix we used.

austinclooney commented 3 months ago

I've created a PR with an example solution, but will be happy to work with someone if it doesn't meet standards.