awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.27k stars 536 forks source link

Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Thresholds) #525

Open arsenalgunnershubert777 opened 9 months ago

arsenalgunnershubert777 commented 9 months ago

Issue #, if available: 521

Description of changes:

This PR adds functionality to expose anomaly detection metadata for anomaly checks

Would love to hear any thoughts, feedback, things to change. Whether relating to overall design, or small renaming and code changes. Thanks so much!

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rdsharma26 commented 9 months ago

Thank you @arsenalgunnershubert777 for the PR and happy new year! We will be reviewing the PR this week.

arsenalgunnershubert777 commented 8 months ago

@rdsharma26 thanks for the heads up! I am working on fixing the merge conflicts, should be done soon

rdsharma26 commented 8 months ago

@arsenalgunnershubert777 Thanks again for the PR and for your patience while we reviewed this PR.

Having looked deeper into the changes, I have one high level comment. This PR introduces changes that are backwards incompatible. Any consumer of Deequ who is using the Anomaly class or the isNewestPointNonAnomalous method in Check.scala will see their code fail to compile when they upgrade to this version of Deequ.

Could you update the PR such that the new class is added alongsde Anomaly? We can also create a new method called isNewestPointNonAnomalousWithExtendedResult which uses your new class as its return type.

arsenalgunnershubert777 commented 8 months ago

Hi @rdsharma26 thank you for the response. I have some questions for clarity. I will try to make changes to create new class and new method and work for backwards compatibility. But I'm wondering, in the VerificationRunBuilder, in order to make the Anomaly Check use the new feature, I would have to also update this getAnomalyCheck method to use .isNewestPointAnomalousWithExtendedResult right? Would that change affect backwards compatibility as well? Let me know if my understanding of any of this is correct, thanks again!

rdsharma26 commented 8 months ago

Hi @rdsharma26 thank you for the response. I have some questions for clarity. I will try to make changes to create new class and new method and work for backwards compatibility. But I'm wondering, in the VerificationRunBuilder, in order to make the Anomaly Check use the new feature, I would have to also update this getAnomalyCheck method to use .isNewestPointAnomalousWithExtendedResult right? Would that change affect backwards compatibility as well? Let me know if my understanding of any of this is correct, thanks again!

Instead of updating, you will add a new method, which then uses the new isNewestPointAnomalousWithExtendedResult method. Think of the following 3 scenarios:

  1. Anyone who is using Deequ's anomaly detection faces no interruption when upgrading their Deequ version to the one that contains your changes.
  2. Anyone who wishes to use Deequ's anomaly detection can use your new method and reference the documentation on which one of the two methods to pick (vanilla results vs extended results). Therefore, do update the documentation as well.
  3. Anyone who is using Deequ's anomaly detection feature and wishes to use your new feature can simply switch to the new method and everything should continue to work as expected. Again, updated documentation will help users with the transition.
arsenalgunnershubert777 commented 8 months ago

@rdsharma26 got it, that makes sense! I think I will add an optional parameter in both addAnomalyCheck and getAnomalyCheckMethod to choose which method to use, and it can default to original method. That way everything should be compatible

rdsharma26 commented 8 months ago

@rdsharma26 got it, that makes sense! I think I will add an optional parameter in both addAnomalyCheck and getAnomalyCheckMethod to choose which method to use, and it can default to original method. That way everything should be compatible

Sounds good. Looks like there is AnomalyCheckConfig available where this new config parameter could be added.