awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.18k stars 519 forks source link

[FEATURE] Exposing Anomaly Strategy Calculation Thresholds for Users #521

Open arsenalgunnershubert777 opened 8 months ago

arsenalgunnershubert777 commented 8 months ago

Is your feature request related to a problem? Please describe. Right now Anomaly Checks only return a success or fail response, but I'd like to retrieve the thresholds used in the Anomaly Strategy calculations. This would help DeeQu users clearly see what numbers the Anomaly Check actually used for the calculation.

Describe the solution you'd like I'd like to have the Constraint Result return a field that contains thresholds used in the Anomaly Strategy. I'm also planning to make a pull request to implement this feature, and would love to hear your feedback on that.

Describe alternatives you've considered N/A

Additional context My current plan for implementing this feature is as follows:

  1. isNewestPointNonAnomalous function currently is an assertion function that takes in a Double metric and returns a Boolean of if there are anomalies detected. Change this function to return an AnomalyAssertionResult which contains a Boolean and also Doubles to represent thresholds used in the Anomaly Strategy calculations.
  2. Create an AnomalyBasedConstraint where the pickValueAndAssert function will retrieve the Boolean from the anomalyAssertionResult.
  3. Create an AnomalyConstraintResult with a field (or fields) for those thresholds, where the pickValueAndAssert function passes in those thresholds from the anomalyAssertionResult. Now the user can view the results.
  4. In Constraint, make the anomalyConstraint function use the AnomalyBasedConstraint class.
  5. Make any necessary upstream or downstream changes to make the functionality work.
  6. User inheritance, traits, and try to make the code as clean as possible.