awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.27k stars 536 forks source link

Adding Wilson Score Confidence Interval Strategy #567

Closed zeotuan closed 4 months ago

zeotuan commented 5 months ago

Fixes #563

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zeotuan commented 4 months ago

@rdsharma26 Hi can you help review this PR.

rdsharma26 commented 4 months ago

Thank you for the PR! @zeotuan The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

zeotuan commented 4 months ago

Thank you for the PR! @zeotuan The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

I think making Wilson Score Confidence Interval default right now might introduce some surprising changes to existing data quality pipelines. I will add example usage documentation for this. We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

rdsharma26 commented 4 months ago

We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

This seems like a safe approach. Could we configure the default to be the Wald strategy in the following line?

 private val defaultIntervalStrategy: ConfidenceIntervalStrategy = WilsonScoreIntervalStrategy()

The build also failed due to:

error file=/home/runner/work/deequ/deequ/src/test/scala/com/amazon/deequ/suggestions/rules/interval/IntervalStrategyTest.scala message=expected start of definition, but was Token(VAL,val,1285,val)
rdsharma26 commented 4 months ago

Left a comment and a unit test needs fixing.