awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Issue adding multiple constraint/checks to Verification run #171

Open Awes35 opened 7 months ago

Awes35 commented 7 months ago

Describe the bug When repeatedly using addConstraint() to add to a Check object, the constraint values seem to be mixed during verification run. The outputs indicate other constraints fail, when in isolation the constraints succeed.

To Reproduce Steps to reproduce the behavior:

  1. See below example
    
    from pydeequ.checks import *
    from pydeequ.verification import *

max_vals_dict = {"TELCO98_SCORE":999, "ADVANCEDENERGYRISK_SCORE":999, "BANKRUPTCYNAVIGATOR_SCORE":300, "EQUIFAXRISK_SCORE":999, "VANTAGE_SCORE":999, "WIRELESS2000_SCORE":999, "AUTOFINANCEPREDICTOR_SCORE":650}

check = Check(spark, CheckLevel.Warning, "Review Check")

data_df = spark.read.table("mydb.mytablename")

for c, mval in max_vals_dict.items(): check.addConstraint(check.hasMax(c.lower(), lambda x: x <= mval))

check.hasMax(c.lower(), lambda x: x <= int(mval))

checkResult = VerificationSuite(spark).onData(data_df).addCheck(check).run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.show(truncate=False)


Outputs:

+------------+-----------+------------+-----------------------------------------------------------+-----------------+------------------------------------------------------+ |check |check_level|check_status|constraint |constraint_status|constraint_message | +------------+-----------+------------+-----------------------------------------------------------+-----------------+------------------------------------------------------+ |Review Check|Warning |Warning |MaximumConstraint(Maximum(telco98_score,None)) |Failure |Value: 988.0 does not meet the constraint requirement!| |Review Check|Warning |Warning |MaximumConstraint(Maximum(advancedenergyrisk_score,None)) |Failure |Value: 979.0 does not meet the constraint requirement!| |Review Check|Warning |Warning |MaximumConstraint(Maximum(bankruptcynavigator_score,None)) |Success | | |Review Check|Warning |Warning |MaximumConstraint(Maximum(equifaxrisk_score,None)) |Failure |Value: 829.0 does not meet the constraint requirement!| |Review Check|Warning |Warning |MaximumConstraint(Maximum(vantage_score,None)) |Failure |Value: 844.0 does not meet the constraint requirement!| |Review Check|Warning |Warning |MaximumConstraint(Maximum(wireless2000_score,None)) |Failure |Value: 997.0 does not meet the constraint requirement!| |Review Check|Warning |Warning |MaximumConstraint(Maximum(autofinancepredictor_score,None))|Failure |Value: 702.0 does not meet the constraint requirement!| +------------+-----------+------------+-----------------------------------------------------------+-----------------+------------------------------------------------------+


**Expected behavior**
1. If I were to individually perform a check on just one column, ie "telco98_score", then I get:

max_vals_dict = {"TELCO98_SCORE":999}

check = Check(spark, CheckLevel.Warning, "Review Check")

for c, mval in max_vals_dict.items(): check.addConstraint(check.hasMax(c.lower(), lambda x: x <= mval))

checkResult = VerificationSuite(spark).onData(data_df).addCheck(check).run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.show(truncate=False)

+------------+-----------+------------+----------------------------------------------+-----------------+------------------+ |check |check_level|check_status|constraint |constraint_status|constraint_message| +------------+-----------+------------+----------------------------------------------+-----------------+------------------+ |Review Check|Warning |Success |MaximumConstraint(Maximum(telco98_score,None))|Success | | +------------+-----------+------------+----------------------------------------------+-----------------+------------------+



**Version info**
I am on AWS Databricks, using:
- Spark 3.3.2 (Scala 2.12)
- PyDeequ 1.1.1
- Deequ 2.0.3 (deequ:2.0.3-spark-3.3)

**Additional Context**
I don't know if I have applied this wrong, but I need to add constraints separately or dynamically (meaning given list of columns to check, add 1 constraint per column. Aka: N number of constraints)
chenliu0831 commented 5 months ago

@Awes35 Have you tried the addConstraints interface which support a list of constraints?

MarionJHolloway commented 1 month ago

I have had the same problem as outlined above. I have found it occurs with hasMin(), hasMax(), and hasNumberOfDistinctValues(), but not isContainedIn() when I add multiple constraints of the same type on different columns. I also get different results depending on the order in which I add constraints (of the same type). There seems to be no problems with adding constraints where each a different type.

addConstraints() did not solve this for me. I have been able to workaround for now by adding each constraint as a separate check to the run.

I'm using pydeequ 1.2.0 and pyspark 3.3.4.

iWantToKeepAnon commented 3 weeks ago

Python does not capture mval into your lambda's closure. So at best mval contains the last value in your dictionary, at worst it goes out of scope/gets gc-ed (maybe that's why you get None comparisons).

I had the same issue and found this helpful : https://stackoverflow.com/a/2295372

Basically use [warning, untested]:

for c, mval in max_vals_dict.items():
    check.addConstraint(check.hasMax(c.lower(), lambda x, mval = mval: x <= mval))