awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Feature/uniqueness row level results #471

Closed eycho-am closed 1 year ago

eycho-am commented 1 year ago

Issue #, if available: N/A

Description of changes: Building on this PR to add more analyzers that support row level results.

This PR adds row level results for the Uniqueness analyzer, which is used by several checks (isUnique, isPrimaryKey, hasUniqueness, uniqueValueRatio)

Row level results for other ScanShareableFrequencyBasedAnalyzers will be added in a future PR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

eycho-am commented 1 year ago

Note: Tests for FileSystemMetricsRepositoryTest are currently failing. I'm working on a way to not serialize the fullColumn field of metrics, which is currently causing the issue:

Syntax Error
== SQL ==
CASE WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 1) THEN true WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 0) THEN NULL ELSE false END
        ----------^^^
eycho-am commented 1 year ago

Note: Tests for FileSystemMetricsRepositoryTest are currently failing. I'm working on a way to not serialize the fullColumn field of metrics, which is currently causing the issue:

Syntax Error
== SQL ==
CASE WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 1) THEN true WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 0) THEN NULL ELSE false END
        ----------^^^

Resolved this issue by not serializing the full column field in metrics for now