I have a use case where I have to compare 2 datasets for its similarities/changes based on schema and its data.
Example : Name email ssn
John John@abc.com. 111-111-1111
Doe. doe@abc.com. 222-222-222
Name Email ssn
John John@abc.com. 111-211-1112
Doe. doe@abc.com. 222-222-222
We have to get the result as that ssn is changed when comparing the above 2 datasets.
I know that in scala we can user ExpectAll function to get the mismatch of the data. I am wondering how I can do this comparison of both Schema ( datatype, Column name ) and Data ( row level and column level) mismatch in AWS DeeQu for the large data with large set of columns.
I am trying to do this but not sure if this is suitable for large dataset
val datasetJoin = dataset.join(newDataset, dataset("old_id").equalTo(newDataset("new_id")), "full_outer")
val verificationResult = getVerificationResult(datasetJoin)
def getVerificationResult(df:DataFrame): VerificationResult = {
val verificationResult: VerificationResult = {
VerificationSuite()
// data to run the verification on
.onData(df)
// define a data quality check
.addCheck(
Check(CheckLevel.Error, "Data Validation Check")
.hasSize(_ == 1000 )
.satisfies("old_first_name == new_first_name", "both are equal")
.hasUniqueness(Seq("old_id","new_id"),Check.IsOne)
)
.run() // compute metrics and verify check conditions
}
verificationResult
}
But with the above implementation, I have to literally spell out hundreds of columns and also not sure of the memory usage. Can this implementation be simplified ? Thank you for your help in advance.
Hello ,
I have a use case where I have to compare 2 datasets for its similarities/changes based on schema and its data. Example : Name email ssn
John John@abc.com. 111-111-1111 Doe. doe@abc.com. 222-222-222
We have to get the result as that ssn is changed when comparing the above 2 datasets. I know that in scala we can user ExpectAll function to get the mismatch of the data. I am wondering how I can do this comparison of both Schema ( datatype, Column name ) and Data ( row level and column level) mismatch in AWS DeeQu for the large data with large set of columns.
I am trying to do this but not sure if this is suitable for large dataset val datasetJoin = dataset.join(newDataset, dataset("old_id").equalTo(newDataset("new_id")), "full_outer") val verificationResult = getVerificationResult(datasetJoin)
def getVerificationResult(df:DataFrame): VerificationResult = { val verificationResult: VerificationResult = { VerificationSuite() // data to run the verification on .onData(df) // define a data quality check .addCheck( Check(CheckLevel.Error, "Data Validation Check") .hasSize(_ == 1000 ) .satisfies("old_first_name == new_first_name", "both are equal") .hasUniqueness(Seq("old_id","new_id"),Check.IsOne) ) .run() // compute metrics and verify check conditions } verificationResult }
But with the above implementation, I have to literally spell out hundreds of columns and also not sure of the memory usage. Can this implementation be simplified ? Thank you for your help in advance.