Closed dico-teberfre closed 5 months ago
@dico-teberfre Thanks for reporting this, and sorry for the late reply. The profiler is using the MEDIAN
function in Redshift to calculate the median
measure and there are currently 3 measures (distinct_proportion
, distinct_count
and is_unique
) that use distinct. I see two options here:
I don't currently have access to a Redshift instance so I can't test this out. Would you be interested in contributing this fix? :)
Hi @stumelius ,
I came upon this issue just now. Primary DB is Redshift. Happy to contribute towards the fix! :)
@odikia Thank you for the offer ❤️ I'd like to take you up on that. I laid out two options to resolve this in the previous comments. What are your thoughts on those? Do you have other ideas? And is there anything you need from me to get started?
Hi @stumelius ,
Let me dig into your suggestions a bit, and see if I have others. Will start now, in fact! Nothing more needed at the moment. Will fork, and start playing. Thank you!
I worked through the suggestions, and found a subquery with median to be the most performant method. In testing locally on a dataset with 7m rows, I found the subquery to be 1.41x faster than the join method.
I think joining, as described in the stackexchange conversation mentioned above, would also require a lot of refactoring to get_profile
and measures
, so made as few changes as possible to get around the problem in redshift.
addition to measures.sql
...
{%- macro redshift__measure_median(column_name, data_type) -%}
{%- if dbt_profiler.is_numeric_dtype(data_type) and not dbt_profiler.is_struct_dtype(data_type) -%}
select percentile_cont(0.5) within group (order by {{ adapter.quote(column_name) }}) from source_data
{%- else -%}
cast(null as {{ dbt.type_numeric() }})
{%- endif -%}
{%- endmacro -%}
....
perhaps there is a better method than using source_data
at the level of measures
? The challenge with the simplest method, is that I'm using a cte's name (source_data
defined in get_profiles
macro) at the level of measures
, and this might not be ideal. I think an alternative would be to create a third table
parameter for each median macro, and refactor the call of the measure_median macro in get_profile
to have the correct cte (as of v0.8.1, source_data
) , included as the third parameter. This might make it easier to change a table name in a single place and keep up with those changes, rather than digging around elsewhere. On the other hand, measure_median would no longer be consistent with the other measures if it were the only one to have a third parameter. I opted for the simple change in the above jinja block for the time being!
Let me know your thoughts, @stumelius , and I'd be happy to submit a PR
In further testing, I also needed to change the following repeating median call in get_profiles
macro to include parentheses. This shouldn't have a negative impact on other adapter types:
...
{% if "median" not in exclude_measures -%}
({{ dbt_profiler.measure_median(column_name, data_type) }}) as median,
{%- endif %}
@odikia Wow, this is really thorough - amazing! 💯 I'm okay with the subquery method as it requires less changes to the code base. Also, having a third "table_name"/"cte_name"/"relation_name" argument only in the measure_median
macro is okay for me. It's not consistent with the other measures but I don't see much harm in that. If we ever need to add similar subquery logic to any of the other measures to work around limitations, we'll use that as an example.
Please go ahead and open a PR :)
@stumelius , pr submitted! I submitted pr 88 which didn't pass circe, as I only passed 2 parameters into the non redshift macros. I fixed it in pr 89, and resubmitted. Passed circe (postgres check it seems?) this time around and is awaiting the approval at data-mie.
Thank you for your contribution! https://github.com/data-mie/dbt-profiler/releases/tag/0.8.2
When using print_profile in Redshift with Version 0.8.1 I get the following error:
Using LISTAGG/PERCENTILE_CONT/MEDIAN aggregate functions with other distinct aggregate function not supported
When stepping down the versions I get a successful run with version 0.6.4.
Best, Frederik