awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.31k stars 538 forks source link

Using KLL to compare 2 distributions #378

Open marcostong17 opened 3 years ago

marcostong17 commented 3 years ago

I have a similar use case requiring to compare two distributions, or same column from 2 date periods to detect significant change (i.e. anomaly detection). I understand a well known method KL Divergence can be used to compare 2 distributions. But seems that requires the 2 distributions to have same set of bins/buckets.

Can KLL be configured to produce such type of output that can be used for KL-Divergence? Or is there other methods (distance function) for comparing 2 distributions and produce a consistent number indicating between them?

Thanks in advance

sscdotopen commented 3 years ago

https://github.com/awslabs/deequ/blob/db63229e83bf60da0f7cff323f081b2490578b38/src/main/scala/com/amazon/deequ/analyzers/Distance.scala