awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Do we have anything in DeeQu similar to the Expect/ExpectAll function in scalaSql. #438

Open Reshmakoganti opened 2 years ago

Reshmakoganti commented 2 years ago

Hello ,

I have a use case where I have to compare 2 datasets for its similarities/changes based on schema and its data. Example : Name email ssn
John John@abc.com. 111-111-1111 Doe. doe@abc.com. 222-222-222

              Name   Email                    ssn
              John   John@abc.com.   111-211-1112
              Doe.    doe@abc.com.    222-222-222

We have to get the result as that ssn is changed when comparing the above 2 datasets. I know that in scala we can user ExpectAll function to get the mismatch of the data. I am wondering how I can do this comparison of both Schema ( datatype, Column name ) and Data ( row level and column level) mismatch in AWS DeeQu for the large data with large set of columns.

I am trying to do this but not sure if this is suitable for large dataset val datasetJoin = dataset.join(newDataset, dataset("old_id").equalTo(newDataset("new_id")), "full_outer") val verificationResult = getVerificationResult(datasetJoin)

def getVerificationResult(df:DataFrame): VerificationResult = { val verificationResult: VerificationResult = { VerificationSuite() // data to run the verification on .onData(df) // define a data quality check .addCheck( Check(CheckLevel.Error, "Data Validation Check") .hasSize(_ == 1000 ) .satisfies("old_first_name == new_first_name", "both are equal") .hasUniqueness(Seq("old_id","new_id"),Check.IsOne) ) .run() // compute metrics and verify check conditions } verificationResult }

But with the above implementation, I have to literally spell out hundreds of columns and also not sure of the memory usage. Can this implementation be simplified ? Thank you for your help in advance.