awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.27k stars 536 forks source link

Anomaly checks when fails #531

Open dinjazelena opened 8 months ago

dinjazelena commented 8 months ago

Hey, so when we use anomaly checks which compares DataFrame metrics to previous DataFrame. Lets say we have batch jobs with pydeequ checks, and one of the checks failed from anomaly check. I go back repair it, but then when i rerun batch job again, it would compare it to failed metric and fail again.

How can i avoid this, or is there option to compare only to baseline DataFrame?

to sum it up:

I have monthly jobs with anomaly checks with lets say relative changes of +-20%, if it fails, job fails, i repair, but then it would compare new run to failed metric and it would fail again.