awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

[Bugfix] Improve histogram performance #458

Closed mentekid closed 1 year ago

mentekid commented 1 year ago

Issue #, if available:

Description of changes:

While benchmarking Deequ, I noticed that our CountDistinct spec has a few irregularities compared to Distinctness and ShareableAnalyzers in general:

  1. It is not a shareable analyzer, so it is forced to run on its own. This makes it less efficient.
  2. It is replacing null values with a sentinel value, which Distinctness does not.
  3. When computing the Histogram that it uses to derive the number of distinct values, Histogram performs an extraneous count() on the input dataframe. This is useful when the caller is interested in relative frequencies, but doesn't make sense when computing the number of distinct values.

This PR aims to solve the third issue.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.