awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

[FEATURE] Improve performance of KLLSketch and DataType Analyzer #583

Open zeotuan opened 2 months ago

zeotuan commented 2 months ago

Is your feature request related to a problem? Please describe. Currently, KLLSketch and DataType analyzer is implemented use the UserDefinedAggregateFunction

https://github.com/awslabs/deequ/blob/3b1a3ec5d1aac8e5e15e694be709530fd343d8a3/src/main/scala/com/amazon/deequ/analyzers/catalyst/StatefulKLLSketch.scala#L29

https://github.com/awslabs/deequ/blob/3b1a3ec5d1aac8e5e15e694be709530fd343d8a3/src/main/scala/com/amazon/deequ/analyzers/catalyst/StatefulDataType.scala#L26

which is considered deprecated and should be replaced with Aggregator which offer much greater performance which was outlined here https://github.com/apache/spark/pull/25024#issue-293548866

Describe the solution you'd like Reimplement StatefulDataType and StatefulKLLSketch using Aggregator

I am happy to help with this implementation.