awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

add data synchronization test to verification Suite. #526

Closed VenkataKarthikP closed 10 months ago

VenkataKarthikP commented 11 months ago

*Issue, if available: #501

Description of changes: Adding data synchronization check to verification suite, with this change users can define isDataSynchronized check.

Example usage -


val verificationResult = VerificationSuite()
  .onData(data)
  .addCheck(Check(CheckLevel.Error, "must have data in sync")
                                      .isDataSynchronized(dfToCompare, Map("id" -> "id"), _ > 0.7)
  .run()

cc: @mentekid @rdsharma26 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

VenkataKarthikP commented 10 months ago

@rdsharma26 thanks for the review, updated with review comments.