awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

Is Redshift supported as a data source? #522

Open jbleduigou opened 7 months ago

jbleduigou commented 7 months ago

Hello,

I have been testing Deequ. So far I had mixed results when using Redshift as a datasource.

I am using Spark Redshift library in order to load a data frame from Redshift.

One example of the problems I had is with uniqueness verification of a column. I get the following error:

java.sql.SQLException: Exception thrown in awaitResult: 
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:172) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:145) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.UnloadDataToS3(RedshiftRelation.scala:328) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.$anonfun$buildScanFromSQL$1(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at scala.Option.orElse(Option.scala:447) ~[scala-library-2.12.17.jar:?]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.buildScanFromSQL(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:53) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:49) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: com.amazon.redshift.util.RedshiftException: ERROR: cannot cast type boolean to double precision
        at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2613) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2281) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1878) ~[redshift-jdbc42-2.1.0.23.jar:?]

The data itself is the Sample Database provided by AWS. The verification code is as follows:

    val verificationResult = VerificationSuite()
      .onData(df)
      .addCheck(
        Check(CheckLevel.Error, "Data Quality Checks")
          .isUnique("eventid")
      )
      .run()

Is using Redshift as a datasource supported by Deequ?