awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

[FEATURE] Supporing Aggregation metrics for a group #528

Open theajay87 opened 6 months ago

theajay87 commented 6 months ago

Is your feature request related to a problem? Please describe. We have a use-case where we need to generate aggregated metrics like SUM, Mean and scannable metrics like MAX, MIN, MIN-LENGHT, MAX-LENGTH on a group defined on a column (or columns) in dataframe.

Describe the solution you'd like Currently, the ScanShareableFrequencyBasedAnalyzer has only CountDistinct, Distinctness, Entropy, Uniqueness and UniqueValueRatio implementation. I would like to extend similar implementation for all other scannable and aggregation metrics so that each metrics can be computed at group level.

Describe alternatives you've considered