awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.31k stars 539 forks source link

How to Add the suggested constraints to verification run using PySpark ? #383

Open mkviswanadh opened 3 years ago

mkviswanadh commented 3 years ago

I'm trying to add the suggested constraints in verification run using pyspark, but somehow I am getting compilation error, I have tried several ways but looks like it is not working, all the references from google are referring only scala examples but I didn't get any reference using pyspark how to dynamically pass generated constraints to Verification Suite hence seeking help on how we can add the suggested constraints in verification run method. Here is scala code snippet,

we don't have pattern matching using case statements and Seq methods in pyspark so not able to convert below specified scala code into pyspark version..

val allConstraints = suggestionResult.constraintSuggestions .flatMap { case (, suggestions) => suggestions.map { .constraint }} .toSeq

could you please suggest pyspark code for the same?

poudelankit commented 1 year ago

I hope this will solve the issue:

suggestionresult = ConstraintSuggestionRunner(spark).onData(test).addConstraintRule(DEFAULT()).run()

suggestion_string = "" for suggestion in suggestion_result['constraint_suggestions']: suggestion_string=suggestion_string+suggestion['code_for_constraint']

suggestion_string = 'check'+suggestion_string

check = Check(spark,CheckLevel.Error,'Validataion') verificationrunner = VerificationSuite(spark).onData(test).addCheck(eval(suggestion_string)) verification_result = verification_runner.run() verification_df = VerificationResult.checkResultsAsDataFrame(spark,verification_result) verification_df.show(truncate=False,n=20)