data-mie / dbt-profiler

Macros for generating dbt model data profiles
Apache License 2.0
81 stars 33 forks source link

Error when profiling a table in Redshift #84

Closed dico-teberfre closed 5 months ago

dico-teberfre commented 11 months ago

When using print_profile in Redshift with Version 0.8.1 I get the following error:

Using LISTAGG/PERCENTILE_CONT/MEDIAN aggregate functions with other distinct aggregate function not supported

When stepping down the versions I get a successful run with version 0.6.4.

Best, Frederik

stumelius commented 10 months ago

@dico-teberfre Thanks for reporting this, and sorry for the late reply. The profiler is using the MEDIAN function in Redshift to calculate the median measure and there are currently 3 measures (distinct_proportion, distinct_count and is_unique) that use distinct. I see two options here:

  1. We find another way to calculate the median in Redshift to work around the MEDIAN<->DISTINCT incompatibility issue
  2. We refactor the existing distinct measures in some way, maybe calculating them separately and then joining with the others (see https://stackoverflow.com/questions/67994522/using-median-aggregate-in-same-query-as-avg)

I don't currently have access to a Redshift instance so I can't test this out. Would you be interested in contributing this fix? :)

odikia commented 6 months ago

Hi @stumelius ,

I came upon this issue just now. Primary DB is Redshift. Happy to contribute towards the fix! :)

stumelius commented 5 months ago

@odikia Thank you for the offer ❤️ I'd like to take you up on that. I laid out two options to resolve this in the previous comments. What are your thoughts on those? Do you have other ideas? And is there anything you need from me to get started?

odikia commented 5 months ago

Hi @stumelius ,

Let me dig into your suggestions a bit, and see if I have others. Will start now, in fact! Nothing more needed at the moment. Will fork, and start playing. Thank you!

odikia commented 5 months ago

I worked through the suggestions, and found a subquery with median to be the most performant method. In testing locally on a dataset with 7m rows, I found the subquery to be 1.41x faster than the join method.

I think joining, as described in the stackexchange conversation mentioned above, would also require a lot of refactoring to get_profile and measures, so made as few changes as possible to get around the problem in redshift.

addition to measures.sql

...
{%- macro redshift__measure_median(column_name, data_type) -%}

{%- if dbt_profiler.is_numeric_dtype(data_type) and not dbt_profiler.is_struct_dtype(data_type) -%}
    select percentile_cont(0.5) within group (order by {{ adapter.quote(column_name) }}) from source_data
{%- else -%}
    cast(null as {{ dbt.type_numeric() }})
{%- endif -%}

{%- endmacro -%}
....

perhaps there is a better method than using source_data at the level of measures ? The challenge with the simplest method, is that I'm using a cte's name (source_data defined in get_profiles macro) at the level of measures , and this might not be ideal. I think an alternative would be to create a third table parameter for each median macro, and refactor the call of the measure_median macro in get_profile to have the correct cte (as of v0.8.1, source_data) , included as the third parameter. This might make it easier to change a table name in a single place and keep up with those changes, rather than digging around elsewhere. On the other hand, measure_median would no longer be consistent with the other measures if it were the only one to have a third parameter. I opted for the simple change in the above jinja block for the time being!

Let me know your thoughts, @stumelius , and I'd be happy to submit a PR

odikia commented 5 months ago

In further testing, I also needed to change the following repeating median call in get_profiles macro to include parentheses. This shouldn't have a negative impact on other adapter types:

...
{% if "median" not in exclude_measures -%}
    ({{ dbt_profiler.measure_median(column_name, data_type) }}) as median,
{%- endif %}
stumelius commented 5 months ago

@odikia Wow, this is really thorough - amazing! 💯 I'm okay with the subquery method as it requires less changes to the code base. Also, having a third "table_name"/"cte_name"/"relation_name" argument only in the measure_median macro is okay for me. It's not consistent with the other measures but I don't see much harm in that. If we ever need to add similar subquery logic to any of the other measures to work around limitations, we'll use that as an example.

Please go ahead and open a PR :)

odikia commented 5 months ago

@stumelius , pr submitted! I submitted pr 88 which didn't pass circe, as I only passed 2 parameters into the non redshift macros. I fixed it in pr 89, and resubmitted. Passed circe (postgres check it seems?) this time around and is awaiting the approval at data-mie.

stumelius commented 5 months ago

Thank you for your contribution! https://github.com/data-mie/dbt-profiler/releases/tag/0.8.2