awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.27k stars 536 forks source link

New analyzer, RatioOfSums #552

Closed scott-gunn closed 5 months ago

scott-gunn commented 6 months ago

Issue #, if available:

Description of changes: This PR creates a new analyzer called RatioOfSums. It aggregates and sums 2 separate columns, then divides them.

For example, imagine a baseball dataset that contains hits and total at-bats, with a row for every player in every game. If a player's batting average (total hits/total at-bats) changes by more than 0.2 in one week, an error or warning could be given. Another example is a % of total calculation in a table that has many rows with facts val1, val2, val3, total, and we want to ensure that val1's percent of the total doesn't change more than X in a given time period.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

akalotkin commented 5 months ago

@rdsharma26, could you take a look at this PR, please?

rdsharma26 commented 5 months ago

Thanks @akalotkin for the PR. Can you add the copyright header to the new file, in order to unblock the build?

scott-gunn commented 5 months ago

@rdsharma26 The copyright has been added.

rdsharma26 commented 5 months ago

Thank you. The changes look good and we are internally reviewing as well. We will get back to you by tomorrow.

mentekid commented 5 months ago

Opened this to track https://github.com/awslabs/deequ/issues/556