awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

Question: DQ over time #554

Open jonathanapp opened 3 months ago

jonathanapp commented 3 months ago

We have years of historical data in addition to (daily) updated data streams. Can I use Deequ to view metrics over time? For us, data quality is a trajectory, not a point. Based on the examples I see, I'd have to create hundreds (thousands, actually) of DataFrames for each day of data and run the analysis for each. Is there no way to run the metrics for all my data and disaggregate by date? TIA.