dbt-labs / dbt-utils

Utility functions for dbt projects.
https://hub.getdbt.com/dbt-labs/dbt_utils/latest/
Apache License 2.0
1.33k stars 487 forks source link

Create/update a macro to verify sum of row counts of n relations #940

Open aranke opened 1 month ago

aranke commented 1 month ago

Describe the feature

There are situations where a model simply is a union all of various other models:

-- int_my_source
select * from {{ ref('stg_my_source_a') }}
union all
select * from {{ ref('stg_my_source_b') }}
union all
select * from {{ ref('stg_my_source_c') }}

In this example, it would be useful to have a 1:many test to verify that:

count(int_my_source) = count(stg_my_source_a) + count(stg_my_source_b) + count(stg_my_source_c)

Describe alternatives you've considered

equal_rowcount is the closest thing in dbt-utils, but only does 1:1 relations

Additional context

I don't think this feature is database-specific.

Who will this benefit?

This feature has been requested several times over the years (dbt internal Slack, previous GitHub issue); this feature will benefit those users.

Are you interested in contributing to this feature?

Sure, happy to help.

dbeatty10 commented 1 month ago

@aranke Thanks for raising this issue and linking those two other places where this has come up!

If the model is really that simple, what would be the value of a data test here? Wouldn't the counts be guaranteed to be equal? I don't see how the count of rows could be anything but exactly the same in this case.

edmei1 commented 1 month ago

Hi adding my 2 cents here and I raised this with dbt support too. It's useful in our case when the unioned model is incremental. i.e. we're selecting from stg_my_source_a where timestamp > now - 2 days for example. And want to see if the unioned model still has all the rows in the individual staging models. Hope that makes sense.

dbeatty10 commented 1 month ago

Thanks for that info @edmei1 !

Could you share an example model that shows more detail? Something like this:

models/int_my_source.sql

{{
    config(
        materialized='incremental'
    )
}}

with unioned_models as (

    select * from {{ ref('stg_my_source_a') }}
    union all
    select * from {{ ref('stg_my_source_b') }}
    union all
    select * from {{ ref('stg_my_source_c') }}

)

select *
from unioned_models

{% if is_incremental() %}

where event_time >= (select coalesce(max(event_time),'1900-01-01') from {{ this }} )

{% endif %}