Do we have anything in DeeQu similar to the Expect/ExpectAll function in scalaSql.

Hello ,

I have a use case where I have to compare 2 datasets for its similarities/changes based on schema and its data. Example : Name email ssn
John John@abc.com. 111-111-1111 Doe. doe@abc.com. 222-222-222

              Name   Email                    ssn
              John   John@abc.com.   111-211-1112
              Doe.    doe@abc.com.    222-222-222

We have to get the result as that ssn is changed when comparing the above 2 datasets. I know that in scala we can user ExpectAll function to get the mismatch of the data. I am wondering how I can do this comparison of both Schema ( datatype, Column name ) and Data ( row level and column level) mismatch in AWS DeeQu for the large data with large set of columns.

I am trying to do this but not sure if this is suitable for large dataset val datasetJoin = dataset.join(newDataset, dataset("old_id").equalTo(newDataset("new_id")), "full_outer") val verificationResult = getVerificationResult(datasetJoin)

def getVerificationResult(df:DataFrame): VerificationResult = { val verificationResult: VerificationResult = { VerificationSuite() // data to run the verification on .onData(df) // define a data quality check .addCheck( Check(CheckLevel.Error, "Data Validation Check") .hasSize(_ == 1000 ) .satisfies("old_first_name == new_first_name", "both are equal") .hasUniqueness(Seq("old_id","new_id"),Check.IsOne) ) .run() // compute metrics and verify check conditions } verificationResult }

But with the above implementation, I have to literally spell out hundreds of columns and also not sure of the memory usage. Can this implementation be simplified ? Thank you for your help in advance.

awslabs / deequ

Do we have anything in DeeQu similar to the Expect/ExpectAll function in scalaSql. #438