awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

[FEATURE] Can we enhance `VerificationSuite` to supports more than one Dataframe? #548

Open Sat30 opened 8 months ago

Sat30 commented 8 months ago

Is your feature request related to a problem? Please describe. Many Quality checks involve table which is result of joins

Describe the solution you'd like Curious to find an Optimized Approach for handling Multiple Dataframe.

Describe alternatives you've considered Now I'm joining two dataframe and passing resulting dataframe for verification. But this way is not efficient for large scale data quality Checks. Deequ is build to handle large scale Data

Additional context I'm always looking for ways to optimize. If anyone has ideas or would like to collaborate on optimizing this process, I'd be happy to connect and discuss further.