While benchmarking Deequ, I noticed that our CountDistinct spec has a few irregularities compared to Distinctness and ShareableAnalyzers in general:
It is not a shareable analyzer, so it is forced to run on its own. This makes it less efficient.
It is replacing null values with a sentinel value, which Distinctness does not.
When computing the Histogram that it uses to derive the number of distinct values, Histogram performs an extraneous count() on the input dataframe. This is useful when the caller is interested in relative frequencies, but doesn't make sense when computing the number of distinct values.
This PR aims to solve the third issue.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue #, if available:
Description of changes:
While benchmarking Deequ, I noticed that our CountDistinct spec has a few irregularities compared to Distinctness and ShareableAnalyzers in general:
count()
on the input dataframe. This is useful when the caller is interested in relative frequencies, but doesn't make sense when computing the number of distinct values.This PR aims to solve the third issue.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.