awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.27k stars 536 forks source link

Custom user analyzers #549

Open sonofagunn opened 6 months ago

sonofagunn commented 6 months ago

Is your feature request related to a problem? Please describe. I would like to create my own Analyzer, however, I can't serialize the result to/from a repository. If this is a direction the maintainers would like to go, I could give a shot at a PR.

Describe the solution you'd like I think the only missing piece is a way to add custom analyzers to AnalysisResultSerde - otherwise it is easy enough to create your own.

Describe alternatives you've considered I've also considered submitting a PR with the needed analyzer to the project, but the ability for users to create their own seems more powerful and useful.

Additional context Maybe there is another way to get what I want? I want what I'm calling a RatioOfSums analyzer. The analyzer sums up 2 columns and then divides the values for the final result.

For example, imagine a baseball dataset that contains hits and total at-bats, with a row for every player in every game. If a player's batting average (total hits/total at-bats) changes by more than 0.2 in one week, an error or warning could be given. Another example is a % of total calculation in a table that has many rows with facts val1, val2, val3, total, and we want to ensure that val1's percent of the total doesn't change more than X in a given time period.