Question: add TOP x column values and distribution.

data-mie / dbt-profiler

Macros for generating dbt model data profiles

Apache License 2.0

81 stars 33 forks source link

Question: add TOP x column values and distribution. #65

Open diegodewilde opened 1 year ago

diegodewilde commented 1 year ago

Hi,

I was looking at this project and I must say: it's awesome and something that dbt docs currently is missing.

One thing got in my mind is the question why there's not an option to add the TOP x column values and their distribution? Is there any other reason to not include this in the docs?

Like in this example where you show TOP 2 for example:

Column Name	Top 1 Value	Distribution	Top 2 Value	Distribution
Column 1	Value 1 A	0.50 ("number"/"total")	Value 1 B	0.20 ("number"/"total")
Column 2	Value 2 A	0.50 ("number"/"total")	Value 2 B	0.30 ("number"/"total")
Column 3	Value 3 A	0.10 ("number"/"total")	Value 3 B	0.05 ("number"/"total")
Column 4	Value 4 A	0.10 ("number"/"total"	Value 4 B	0.05 ("number"/"total")

Looking forward to your thoughts!

stumelius commented 1 year ago

@diegodewilde I've thought about adding a "mode" (most common value) profiling metric to the package but never around to implementing it. This proposal expands the mode concept into N most common values and I think it's a good idea.

Just throwing thoughts here:

What would be a sensible default for the number of top values? 1, 2, 3?
How should we name the columns? top_1_value, top_1_value_proportion, top_2_value, top_2_value_distribution, etc?
Is there a better way to display the distributions than the (value, proportion) pairs for each top value?

Would you be interested in implementing this? :)

diegodewilde commented 1 year ago

Hi stumelius,

It would make sense to make this dynamic, so you can choose the amount of top values you want to see in your docs. Not sure if that's possible?
Sounds like a good suggestion, I actually think this goes hand-in-hand with the visualization you would like to see here.

stumelius commented 1 year ago

@diegodewilde Circling back to this. Is this feature still in your interests and if so, would you like to contribute? :)