awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Getting Error name 'isComplete' is not defined while running deequ code in Azure Databricks #504

Closed dilkushpatel closed 1 year ago

dilkushpatel commented 1 year ago

Ask questions that don't apply to the other templates (Bug report, Feature request)

I'm trying to implement basic checks on columns of table which is in SQL Azure DW

till reading data works fine

I can also run ConstraintSuggestionRunner

When I run VerificationSuite with single check isComplete its giving error

Error: name 'isComplete' is not defined

Code: import sagemaker_pyspark import pydeequ from pyspark.sql import SparkSession from pydeequ.analyzers import from pydeequ.checks import from pydeequ.verification import from pydeequ.anomaly_detection import

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession .builder .config("spark.driver.extraClassPath", classpath) .config("spark.jars.packages", pydeequ.deequ_maven_coord) .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())

check = Check(spark, CheckLevel.Error, "Data QC")

checkResult = VerificationSuite(spark) \ .onData(df) \ .addCheck(isComplete("month_id")).run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.show()

tried google did not get anything relevant.

Same error with any other check as well.

rdsharma26 commented 1 year ago

Change

checkResult = VerificationSuite(spark)
    .onData(df)
    .addCheck(
        isComplete("month_id")
    )
    .run()

to

checkResult = VerificationSuite(spark)
    .onData(df)
    .addCheck(
        check.isComplete("month_id")
    )
    .run()

See full code example here: https://github.com/awslabs/python-deequ#constraint-verification

dilkushpatel commented 1 year ago

interesting! I was actually trying that

still error though

Error: Check.isComplete() missing 1 required positional argument: 'column'

Code: checkResult = VerificationSuite(spark) \ .onData(df) \ .addCheck(Check.isComplete("month_id")).run()

dilkushpatel commented 1 year ago

Ignore...

changed Check to check and that worked.

Thanks.

rdsharma26 commented 1 year ago

Thanks for confirming. Since you have the following line, check = Check(spark, CheckLevel.Error, "Data QC")

check.isComplete is correct as opposed to Check.isComplete