awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

Passing constraints list to check object #23

Closed annsajee closed 3 years ago

annsajee commented 3 years ago

check = Check(spark, CheckLevel.Error, "Test", constraints) I have a list of constraints that needs to be passed to the check object to verify constraints. However, because the check.addConstraint feature is not implemented yet, it gives me error as follows:

check = Check(spark, CheckLevel.Error, "Test", constraints) self.addConstraint(constraint) raise NotImplementedError("Private factory method for other check methods")

Is there a work around for this? I have a dynamic list of constraints that need to be passed and hence this feature is important to me.

Kindly reply.

Thanks

gucciwang commented 3 years ago

Hi @annsajee ! Unfortunately, constraints have not been included in our first release, but it is in our long list of to-do's! In the meantime, are you able to use checks as a workaround? Example like so:

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x >= 3000000) \
        .hasMin("star_rating", lambda x: x == 1.0) \
        .hasMax("star_rating", lambda x: x == 5.0)  \
        .isComplete("review_id")  \
        .isUnique("review_id")  \
        .isComplete("marketplace")  \
        .isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"]) \
        .isNonNegative("year")) \
    .run()
annsajee commented 3 years ago

Hello @gucciwang,

My requirement was passing a list of constraints and addConstraint to be able to handle it as the constraints are dynamic. I have a workaround for now of using getattr() on check object for individual method calls.

miltad commented 3 years ago

@annsajee could you provide the full solution for your workaround? It seems that I'm struggling with exactly the same issue.

gucciwang commented 3 years ago

Hi @annsajee & @miltad !

Apologies for forgetting to link this issue -- but the feature you requested has been completed and released with PyDeequ-0.1.7 on pip! Please give it a try and let me know how it goes!

check = Check(self.spark, CheckLevel.Warning, "test list constraints")
check.addConstraints([check.isComplete('c'),
                            check.isUnique('b')])

result = VerificationSuite(self.spark).onData(self.df) \
            .addCheck(check) \
            .run()

Linking this issue for a similar thread: https://github.com/awslabs/python-deequ/issues/37

annsajee commented 3 years ago

@miltad @gucciwang create a check object: check = Check(spark, CheckLevel.Error, "Test")

you can use getattr() to refer to a constraint method to add to check object: getattr(check, Completeness)(column, assertion, hint=None)

You can add any number of constraints on it and then run: Eg: VerificationSuite(spark).onData(data).addCheck(check).run()

ml6cz commented 3 years ago

How would you pass the assertion if you only have it as a string-- for example, I got the suggested constraint "hasDataType" on a specific column with the assertion "ConstrainableDataTypes.Fractional". However, passing this as a string currently throws an error.

annsajee commented 3 years ago

Hello @ml6cz,

You can use eval since it passes to an assertion (lambda function) in the code.

For eg: getattr(check, constraint)(column, eval(str(datatype)))

Sairam90 commented 2 years ago

Hi @annsajee Could you please provide an example - I am unable to fetch Completeness attribute using getattr

rajasekaranmpomelo commented 2 years ago

could you pls share the code for this as I have a similar requirement