Add `grouping_sets` macro to dbt-utils

yauhen-sobaleu commented 2 years ago

Describe the feature

I would like to have a database-agnostic macro mimicking GROUPING SETS aggregate function that was initially introduced in Postgres 9 (https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS).

The function is super useful when you want to combine multiple GROUP BY clauses in a single GROUP BY clause without the need to write O(N) group by clauses and unioning them together.

Unfortunately, both Redshift and BigQuery do not support GROUPING SETS out of the box (as well as other multi-dimensional aggregate functions like ROLLUP and CUBE).

Example in Postgres SQL:

SELECT county, city, sum(margin) as total
FROM sale_view
GROUP BY  GROUPING SETS ((county),(city))
order by 1,2
;

Example in Redshift SQL:

SELECT county,NULL as city, sum(margin) as total
   FROM dbo.sale_view
   GROUP BY county
union all 
SELECT NULL as county, city, sum(margin) as total
   FROM dbo.sale_view
   GROUP BY city
ORDER BY  1, 2  
   ;

Additional context

Postgres supports it after version 9.
Redshift does not support it (because it was forked from Postgres version 8 before the function was introduced)
BigQuery does not support it
Snowflake (unknown)

Who will this benefit?

It will be useful for all analytics engineers who work with multi-dimensional reporting sparing them precious time otherwise spent copying UNION ALL expressions with minuscule differences over and over again

joellabes commented 2 years ago

Hey @yauhen-sobaleu, as discussed on the forum I'm very interested in this! I think it would be very similar to the width_bucket macro we already have, which has a native Snowflake version and for other warehouses we emulate it with other primitives from SQL.

I'll mark this as help wanted, it's a pretty beefy contribution but if you or someone else wants to have a go at building it, I'm happy to provide pointers.

I think the macro signature and implementation would look something like this. I haven't tested this at all, just wrote it on a plane so it will need some pretty serious finessing from here, as well as adding tests.

{% macro grouping_sets(group_sets, data_source, aggregate_expression) %}
    {{ adapter.dispatch('grouping_sets', 'dbt_utils')(group_sets, data_source, aggregate_expression) }}
{% endmacro %}

{% macro default__grouping_sets(group_sets, data_source, aggregate_expression) %}
    {% set all_list_elements = [] %}

    {% for item in group_sets %}
        {% do group_sets.append(item.values() %)
    {% endfor %}

    {% set unique_grouping_columns = set_strict(group_sets) %}

    select {{ unique_grouping_columns | join(", ") }}, {{ aggregate_expression }}
    from {{ data_source }}
    group by grouping sets (
        {% for grp in group_sets %}
            ( {{grp | join(", ") }} )
            {% if not loop.last %}
            , 
            {% endif %}
        {% endfor %}
    )
{% endmacro %}

{% macro redshift__grouping_sets(group_sets, data_source, aggregate_expression) %}
    {% set all_list_elements = [] %}

    {% for item in group_sets %}
        {% do group_sets.append(item.values() %)
    {% endfor %}

    {% set unique_grouping_columns = set_strict(group_sets) %}

    {% for group_set in group_sets %}

    select 

        {% for column in unique_grouping_columns %}
            {% if column in group_set %}
                {{ column }}
            {% else %}
                --TODO: this won't work if it's the first usage of the column, would need to be cast as the right data type. 
                --union_relations does this well but means these have to be Relations instead of just a CTE. 
                --Alternative is to require a dictionary of data types to be passed in, but that's janky
                null as {{ column }}
            {% end %}
            ,
        {% endfor %}
        {{ aggregate_expression }}

        from {{ data_source }}
        group by {{ group_set.values() | join(", ") }}
        {% if not loop.last %}
            union all 
        {% endif %}
    {% endfor %}
{% endmacro %}

Then it will be called as:

{{ grouping_sets([['county'], ['city'], ['county', 'city']]), 'sale_view', 'sum(margin) as total' }}

which will render to something that looks like one of these, depending on the warehouse:

select county, city, sum(margin) as total
from sale_view
group by grouping sets ((county), (city), (county, city))

select county, null, sum(margin) as total
from sale_view
group by county

union all 

select null, city, sum('margin') as total
from sale_view
group by city

union all 

select county, city, sum('margin') as total
from sale_view
group by county, city

yauhen-sobaleu commented 1 year ago

Amazon Redshift released support (in preview) for multi-dimensional aggregate functions! https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-redshift-sql-capabilities-speed-data-warehouse-migrations-preview/

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbt-labs / dbt-utils