awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

[MinLength/MaxLength] Apply filtered row behavior at the row level evaluation #547

Closed rdsharma26 closed 4 months ago

rdsharma26 commented 4 months ago

Description of changes:

val analyzerOptions = AnalyzerOptions(
  nullBehavior = NullBehavior.EmptyString,
  filteredRow = FilteredRowOutcome.TRUE
)

val check = new Check(CheckLevel.Error, "test-check")
  .hasMinLength("Company", _ == 8, analyzerOptions = Some(analyzerOptions)).where("ID > 2")
  .hasMaxLength("Company", _ == 8, analyzerOptions = Some(analyzerOptions)).where("ID > 2")

+---+----------------+-------+-----+-----------+----------+
|ID |Company         |ZipCode|State|City       |test-check|
+---+----------------+-------+-----+-----------+----------+
|1  |Acme            |90210  |CA   |Los Angeles|false     |   <-- Incorrect outcome for filtered row
|2  |Acme            |90211  |CA   |Los Angeles|false     |   <-- Incorrect outcome for filtered row 
|3  |Robocorp        |NULL   |NJ   |NULL       |true      |
|4  |Robocorp        |NULL   |NY   |New York   |true      |
+---+----------------+-------+-----+-----------+----------+

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rdsharma26 commented 4 months ago

Thanks for the review @eycho-am . I've addressed the comments in the latest commit.