awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Suggestion code for isContainedIn does not work #142

Closed cullse closed 9 months ago

cullse commented 11 months ago

Describe the bug A clear and concise description of what the bug is.

After running the Constraint Suggestion Runner the code it suggests includes a constraint with extra parameters that are not accepted by the isContainedIn function. Example: "isContainedIn(\"column\", [\"1\", \"2\", \"3\"], lambda x: x >=0.99, \"It should be above 0.99!\")"

To Reproduce Steps to reproduce the behavior: 1.Follow tutorial constraint_suggestion_example.ipynb

  1. Add a new row of data to the test data frame with productName=None or with status=None

Expected behavior A clear and concise description of what you expected to happen.

Looks like the hasCompleteness and the isContainedIn have been combined. They should stay as two separate suggestions for the user to test/use.

SETUP: python 3.10.10 Pyspark 3.2.2 Pydeequ 1.1.0 Pydeequ jar: 2.0.1-spark-3.2

chenliu0831 commented 9 months ago

Thanks for reporting the issue!

After running the Constraint Suggestion Runner the code it suggests includes a constraint with extra parameters that are not accepted by the isContainedIn function.

This is a bug, fix in https://github.com/awslabs/python-deequ/pull/157

Looks like the hasCompleteness and the isContainedIn have been combined. They should stay as two separate suggestions for the user to test/use.

I run the example with Spark 3.3 and it seems to be separated suggestions

Constraint suggestion for 'productName': 'productName' has value range 'thingC', 'thingA', 'thingB', 'thingE', 'thingD' for at least 82.0% of values
The corresponding Python code is: .**isContainedIn**("productName", ["thingC", "thingA", "thingB", "thingE", "thingD"], lambda x: x >= 0.82, "It should be above 0.82!")

Constraint suggestion for 'productName': 'productName' has less than 18% missing values
The corresponding Python code is: .**hasCompleteness**("productName", lambda x: x >= 0.82, "It should be above 0.82!")