awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Empty state for analyzer - all input values were NULL #395

Open srinivasvinnakota opened 3 years ago

srinivasvinnakota commented 3 years ago

When trying to run deequ rules on an empty dataframe, seeing error.

Sometimes, we have empty dataframes which we pass through Deequ to explicitly check if the "Size" check is passing for 0 row count as expected.

Below is an example of running Deequ check on empty dataframe for zero row size and applied filter too, which fails but i expect it should pass.

val dataFrame=getNumberDataFrame(13).filter(col("Number")===100) //this returns empty dataframe

val result1 = VerificationSuite() .onData(dataFrame) .addCheck(Check(CheckLevel.Error,"") .hasSize(_ ==0) .where("Number=10")) .run()

Post executing above logic, i am expecting the result should say "success", but i see error below. Is this as per design? I feel this should work, as this is generally one of the normal scenarios.

But when i run the same code above without .where("Number=10")) it says success.

Error : VerificationResult(Error,Map(Check(Error,,List(UniquenessConstraint(Uniqueness(List(Number),None)))) -> CheckResult(Check(Error,,List(UniquenessConstraint(Uniqueness(List(Number),None)))),Error,List(ConstraintResult(UniquenessConstraint(Uniqueness(List(Number),None)),Failure,Some(Empty state for analyzer Uniqueness(List(Number),None), all input values were NULL.),Some(DoubleMetric(Column,Uniqueness,Number,Failure(com.amazon.deequ.analyzers.runners.EmptyStateException: Empty state for analyzer Uniqueness(List(Number),None), all input values were NULL.))))))),Map(Uniqueness(List(Number),None) -> DoubleMetric(Column,Uniqueness,Number,Failure(com.amazon.deequ.analyzers.runners.EmptyStateException: Empty state for analyzer Uniqueness(List(Number),None), all input values were NULL.))))