dbt-labs / dbt-audit-helper

Useful macros when performing data audits
https://hub.getdbt.com/dbt-labs/audit_helper/latest/
Apache License 2.0
329 stars 40 forks source link

compare_column_values not working with Athena/Presto #87

Open evabilski opened 1 year ago

evabilski commented 1 year ago

Describe the bug

Function compare_column_values not working on Athena/Presto.

Steps to reproduce

Create test in Presto environment:

{% set old_etl_relation_query %}
    select
        'a' AS col_a,
        'b' AS col_b
{% endset %}

{% set new_etl_relation_query %}
    select
        'a' AS col_a,
        'b' AS col_b
{% endset %}

{{ audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="col_a",
    column_to_compare="col_b"
)
}}

Log output

COLUMN_NOT_FOUND: line 59:10: Column 'column_name' cannot be resolved or requester is not authorized to access requested resources

System information

dbt-labs/audit_helper version: 0.9.0

Which database are you using dbt with?

Additional context

Presto does not support grouping by aliases. Line 53 of compare_column_values.sql: group by column_name should be removed in order to work with Presto

Are you interested in contributing the fix?

I can try making a PR

sanga8 commented 6 months ago

I still have this issue in version 0.11.0

popcornylu commented 3 months ago

The root cause is that the group by clause does not allow to use the alias. See presto discussion

A workaround solution is to add a macro at macros/compare_column_values.sql. The content is

replace athena__compare_column_values to xxx__compare_column_values according to your warehouse adapter.

{% macro athena__compare_column_values(a_query, b_query, primary_key, column_to_compare, emojis, a_relation_name, b_relation_name) -%}
with a_query as (
    {{ a_query }}
),

b_query as (
    {{ b_query }}
),

joined as (
    select
        coalesce(a_query.{{ primary_key }}, b_query.{{ primary_key }}) as {{ primary_key }},
        a_query.{{ column_to_compare }} as a_query_value,
        b_query.{{ column_to_compare }} as b_query_value,
        case
            when a_query.{{ column_to_compare }} = b_query.{{ column_to_compare }} then '{% if emojis %}✅: {% endif %}perfect match'
            when a_query.{{ column_to_compare }} is null and b_query.{{ column_to_compare }} is null then '{% if emojis %}✅: {% endif %}both are null'
            when a_query.{{ primary_key }} is null then '{% if emojis %}🤷: {% endif %}missing from {{ a_relation_name }}'
            when b_query.{{ primary_key }} is null then '{% if emojis %}🤷: {% endif %}missing from {{ b_relation_name }}'
            when a_query.{{ column_to_compare }} is null then '{% if emojis %}🤷: {% endif %}value is null in {{ a_relation_name }} only'
            when b_query.{{ column_to_compare }} is null then '{% if emojis %}🤷: {% endif %}value is null in {{ b_relation_name }} only'
            when a_query.{{ column_to_compare }} != b_query.{{ column_to_compare }} then '{% if emojis %}❌: {% endif %}‍values do not match'
            else 'unknown' -- this should never happen
        end as match_status,
        case
            when a_query.{{ column_to_compare }} = b_query.{{ column_to_compare }} then 0
            when a_query.{{ column_to_compare }} is null and b_query.{{ column_to_compare }} is null then 1
            when a_query.{{ primary_key }} is null then 2
            when b_query.{{ primary_key }} is null then 3
            when a_query.{{ column_to_compare }} is null then 4
            when b_query.{{ column_to_compare }} is null then 5
            when a_query.{{ column_to_compare }} != b_query.{{ column_to_compare }} then 6
            else 7 -- this should never happen
        end as match_order

    from a_query

    full outer join b_query on a_query.{{ primary_key }} = b_query.{{ primary_key }}
),

aggregated as (
    select
        '{{ column_to_compare }}' as column_name,
        match_status,
        match_order,
        count(*) as count_records
    from joined

    group by '{{ column_to_compare }}', match_status, match_order
)

select
    column_name,
    match_status,
    count_records,
    round(100.0 * count_records / sum(count_records) over (), 2) as percent_of_total

from aggregated

order by match_order

{% endmacro %}