awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

[FEATURE] Support Wilson Score Interval for RetainCompletenessRule #563

Closed zeotuan closed 1 month ago

zeotuan commented 2 months ago

Is your feature request related to a problem? Please describe. Currently, RetainCompletenessRule is using the Wald Interval (Normal approximation interval) for calculating the interval but this method perform poorly when p value is close to either 0 or 1 (documented as TODO in the code itself)

Describe the solution you'd like Switch to either Wilson score interval or maybe provide a strategy pattern so we can switch between Wilson, Wald or in future other Interval calculation techniques.

I am happy to help with this implementation.