awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

Enablement of constraints parameter in pydeequ.checks.Check #37

Closed ianclari closed 3 years ago

ianclari commented 3 years ago

Is your feature request related to a problem? Please describe. When executing VerificationSuite, I need to list down all the constraints I need to check one by one. There should be a way to register the list of constraints so it's easier to run VerificationSuite.

Describe the solution you'd like I observed that running VerificationSuite means we need to add each constraint we want to evaluate by instantiating pydeequ.checks.Check (https://pydeequ.readthedocs.io/en/latest/pydeequ.html#module-pydeequ.checks).

(in code below I wanted to implement column checks using .IsComplete() , isUnique() and isNonNegative() on certain columns )

check = Check(spark_session=spark, level=CheckLevel.Warning, description="Review Check")
checkResult = VerificationSuite(spark) \
.onData(data_df) \
.addCheck(
check.isComplete("gender")  \
.isUnique("id")  \
.isUnique("gender")  \
.isNonNegative("income")) \
.run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

I saw that there is a placeholder parameter under pydeequ.checks.Check called constraints ( (Check(spark_session=spark, level=CheckLevel.Warning, description="Review Check", constraints=[]))) which can be the way to register the list of constraints and make the call to VerificationSuite more generic/simple.

I believe enabling this functionality will increase adoption of pydeequ for Python developers dabbling in data quality use cases.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

gucciwang commented 3 years ago

To allow functional-like usage like before and the fact that Python is a fully functional programming language, PyDeequ will need you to construct a Check object first, and then reference it in the list to subsequently add your constraints in a list manner. Take a look at this and let me know if that works!

check = Check(self.spark, CheckLevel.Warning, "test list constraints")
check.addConstraints([check.isComplete('c'),
                            check.isUnique('b')])

result = VerificationSuite(self.spark).onData(self.df) \
            .addCheck(check) \
            .run()
ianclari commented 3 years ago

thank you @gucciwang

EDIT: disregard notes below as I understand now that it still has to be applied. the logic presented above does make sense to me!

_i am trying this code following your sample and encountering an error ("'Check' object has no attribute 'addConstraints' "). is it a version issue?

i'm using the following


check = Check(spark, CheckLevel.Warning, "Review Check 2")

check.addConstraints([check.isComplete('gender')])

checkResult = VerificationSuite(spark) \
.onData(data_df) \
.addCheck(check) \
.run()

checkResult_df2 = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df2.show()
```_