awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Adding chi-square distance method for categorical variables #444

Closed bevhanno closed 1 year ago

bevhanno commented 1 year ago

Description of changes:

The distance method used in src/main/scala/com/amazon/deequ/analyzers/Distance.scala applies the Kolmogorov–Smirnov(KS) test to numerical and categorical variables. As the KS test is mostly suited for numerical variables this PR adds the chi-square test method that can be optionally used with categorical variables. The function signature of categoricalDistance remains unchanged and still applies the KS test as default method.

Modified files: src/main/scala/com/amazon/deequ/analyzers/Distance.scala : added chi-square method src/test/scala/com/amazon/deequ/KLL/KLLDistanceTest.scala : added tests for chi-square method pom.xml : added dependency to spark-mllib

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rdsharma26 commented 1 year ago

Please add build logs to the PR, since we do not have auto build configured.

bevhanno commented 1 year ago

The original version of function selectMetrics() was using a hard coded value for c(alpha) of 1.8. It is more convenient to directly use the alpha value as parameter and calculating c(alpha) accordingly. This option has been added in the latest commit:

       val cAlpha : Double =  alpha match {
         case Some(a)  => Math.sqrt(-Math.log(a/2) * 1/2)
         case None => defaultCAlpha
       }
       val linfRobust = Math.max(0.0, linfSimple - cAlpha * Math.sqrt((n + m) / (n * m)))
bevhanno commented 1 year ago

thanks for your reviews @rdsharma26 and @shehzad-qureshi , thanks for merging !

shehzad-qureshi commented 1 year ago

thank you for your contribution!