awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

[FEATURE] Extend RatioOfSums to support other aggregations #556

Open mentekid opened 3 months ago

mentekid commented 3 months ago

Is your feature request related to a problem? Please describe. PR 552 introduced a Ratio Of Sums analyzer that checks whether two columns' values add up to the same number. We can extend this analyzer to a Ratio Of Aggregation to accept any kind of Spark aggregation, e.g. average.

Describe the solution you'd like There should be a generic RatioOfAggregation check that accepts two columns and an aggregation function. An implementation of that would be RatioOfSums, which sets aggregation to sum.

Describe alternatives you've considered The alternative would be to let users define Check assertions as a function of another aggregator's value. Rather than saying this:

VerificationSuiteBuilder()
    ...
    .ratioOfSums("col1", "col2", _ > 0.9)

they could define their checks as

VerificationSuiteBuilder()
    ...
    .sum("col1", _ > 0.9 * sum("col2"))

(this is pseudocode, but basically pass an analyzer as part of the assertion)