awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

[BUG] Row based output incorrect when using satisfies check and assertion with upper bound < 1 #519

Closed arsenalgunnershubert777 closed 7 months ago

arsenalgunnershubert777 commented 1 year ago

Describe the bug When using satisfies check, the columnar row based output seems unexpected based on the assertion being passed in. This specifically occurs when assertion has bound where upper bound < 1.

To Reproduce Steps to reproduce the behavior:

  1. Create custom check using satisfies, with some sql column condition.
  2. Pass in an assertion function with bounds where the upper bound < 1
  3. Run check on input dataframe where some rows pass and some rows fail the column condition.
  4. The row based output when calling rowLevelResultsAsDataFrame will show all rows as false/fail

Code:

Check(CheckLevel.Error, id.value)
        .satisfies(
          sqlColumnCondition,
         "name",
          (d: Double) => d > 0 && d < 1.0
        )

Output row based dataframe:

+-----+------+------+
|index|values|result|
+-----+------+------+
|    1|  blue| false|
|    2| green| false|
|    3|  blue  false|
|    4|   red| false|
|    5|purple| false|
+-----+------+------+
  1. However, if the assertion bounds is adjusted where the upper bound < 1.1 (instead of 1), then the row based results look correct

Code:

Check(CheckLevel.Error, id.value)
        .satisfies(
          sqlColumnCondition,
         "name",
          (d: Double) => d > 0 && d < 1.1
        )

Output row based dataframe (this is correct behavior):

+-----+------+------+
|index|values|result|
+-----+------+------+
|    1|  blue|  true|
|    2| green|  true|
|    3|  blue|  true|
|    4|   red| false|
|    5|purple| false|
+-----+------+------+

Expected behavior The row based output should show rows that passed and rows that failed based on the columnCondition and shouldn’t be impacted by the assertion. The row based output shouldn’t show every row as false when there are certain rows that passed the columnCondition. The correct example is the one shown directly above.

Screenshots N/A

Additional context This row output issue may be due to this line from Verification result constraintResultToColumn. I'm not sure if that line is needed for some other functionality. Also, the overall verification result check status (Success or Error) seems to be working correctly. Thanks for the help!

Sat30 commented 8 months ago
arsenalgunnershubert777 commented 8 months ago

Hi @Sat30 thanks for the response, can you clarify what you mean by those bullet points? yes the row level result should be dependent on sqlCondition only, but when changing the assertionFunction the result gets affected when it shouldn't be

rdsharma26 commented 8 months ago

@arsenalgunnershubert777

Thank you so much for reporting this issue. It has been fixed as part of PR #553 We will be releasing this to Maven as part of our next release cycle.