dbt-labs / dbt-utils

Utility functions for dbt projects.
https://hub.getdbt.com/dbt-labs/dbt_utils/latest/
Apache License 2.0
1.38k stars 499 forks source link

Alias column selection in group_by_columns to enable timestamp wrapping #946

Open macklin-fluehr opened 2 months ago

macklin-fluehr commented 2 months ago

Describe the feature

group_by_columns does not allow you to group by higher aggregates. For example, if I have a timestamp column, I might want to group by date_trunc('week', timestamp) to monitor that values are expected each week.

Right now, if my yml is set to:

- dbt_utils.not_null_proportion: 
      at_least: 0.95
      group_by_columns: [date_trunc('week', timestamp)]

The code compiles to:

select
      count(*) as failures,
      count(*) != 0 as should_warn,
      count(*) != 0 as should_error
    from (

with validation as (
  select
    date_trunc('week' ,timestamp), 
    sum(case when visit_reasons is null then 0 else 1 end) / cast(count(*) as numeric) as not_null_proportion
  from <my_table>
  group by date_trunc('week',timestamp)
),
validation_errors as (
  select
    date_trunc('week' ,timestamp), 
    not_null_proportion
  from validation
  where not_null_proportion < 0.95 or not_null_proportion > 1
)
select
  *
from validation_errors
    ) dbt_internal_test

which, as you can see, in the validation_errors cte does not work. That said, if we just aliased the column in the validation cte and referenced it in validation_errors, we'd be all set. My suggested change in macros/generic_tests/not_null_proportion.sql:

with validation as (
  select
    {% for c in select_gb_cols %}
    {{select_gb_cols}} as group_col_{{loop.counter}}{% if not loop.last %},{%endif%}
    {% endear %}
    sum(case when {{ column_name }} is null then 0 else 1 end) / cast(count(*) as {{ dbt.type_numeric() }}) as not_null_proportion
  from {{ model }}
  {{groupby_gb_cols}}
),
validation_errors as (
  select
   {% for c in select_gb_cols %}
    group_col_{{loop.counter}}{% if not loop.last %},{%endif%}
    {% endear %}
    not_null_proportion
  from validation
  where not_null_proportion < {{ at_least }} or not_null_proportion > {{ at_most }}

Describe alternatives you've considered

I could make my own custom test to do this

Additional context

Don't think database will impact anything here

Who will this benefit?

anybody trying to group by time

Are you interested in contributing this feature?

I hope the above is enough to get started

Let us know if you want to contribute the feature, and whether would need a hand getting started

marinewign commented 1 month ago

My team and I would need this feature to assess data quality on a day-by-day basis for data that we receive every hour thanks to a DATE_TRUNC('day', received_date) !