awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Check isContainedIn does not recognize string in quotes as allowed value #462

Closed markushc closed 1 year ago

markushc commented 1 year ago

Steps to reproduce:

Run unit test shown below.

import org.scalatest.flatspec.AnyFlatSpec
import org.apache.spark.sql.SparkSession
import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.Check
import com.amazon.deequ.checks.CheckLevel
import com.amazon.deequ.checks.CheckStatus

private case class SomeData(data: String)

class IsContainedInQuotesTest extends AnyFlatSpec {
  it should "accept string in quotes as allowed value" in {
    val someData = Seq("'a'")
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    val df = spark.createDataFrame(someData.map(SomeData))
    val verificationResult = VerificationSuite()
      .onData(df)
      .addCheck(
        Check(CheckLevel.Error, "myCheck")
          .isContainedIn("data", someData.toArray)
      )
      .run()
    assert(verificationResult.status == CheckStatus.Success)
  }
}

Expected outcome: Unit test passes, because 'a' is in the list of allowed values and 'a' is the value of the column being checked.

Actual outcome: Unit test fails. It seems this could be related to the allowed values having quotes.

Deequ version: 2.0.3-spark-3.3

Java version: 11

mentekid commented 1 year ago

Thanks for reporting this. We will look into it.

marcantony commented 1 year ago

To add some context to this, it looks like the isContainedIn check is trying to escape single quotes in the allowed values list by replacing ' with '': https://github.com/awslabs/deequ/blob/d8bfb9c71bdce712d64d861343a801a5c5a9562c/src/main/scala/com/amazon/deequ/checks/Check.scala#L1007

Although this works with standard SQL, it seems like a \ needs to be used in Spark SQL: https://spark.apache.org/docs/latest/sql-ref-literals.html#parameters.