Closed bevhanno closed 1 year ago
Please add build logs to the PR, since we do not have auto build configured.
The original version of function selectMetrics()
was using a hard coded value for c(alpha) of 1.8. It is more convenient to directly use the alpha value as parameter and calculating c(alpha) accordingly. This option has been added in the latest commit:
val cAlpha : Double = alpha match {
case Some(a) => Math.sqrt(-Math.log(a/2) * 1/2)
case None => defaultCAlpha
}
val linfRobust = Math.max(0.0, linfSimple - cAlpha * Math.sqrt((n + m) / (n * m)))
thanks for your reviews @rdsharma26 and @shehzad-qureshi , thanks for merging !
thank you for your contribution!
Description of changes:
The distance method used in
src/main/scala/com/amazon/deequ/analyzers/Distance.scala
applies the Kolmogorov–Smirnov(KS) test to numerical and categorical variables. As the KS test is mostly suited for numerical variables this PR adds the chi-square test method that can be optionally used with categorical variables. The function signature ofcategoricalDistance
remains unchanged and still applies the KS test as default method.Modified files:
src/main/scala/com/amazon/deequ/analyzers/Distance.scala
: added chi-square methodsrc/test/scala/com/amazon/deequ/KLL/KLLDistanceTest.scala
: added tests for chi-square methodpom.xml
: added dependency to spark-mllibBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.