Is your feature request related to a problem? Please describe.
I am unable to leverage the full benefits of the MetricsRepository feature in PyDeequ - especially with respect to persisting the metrics to an S3 file, and reload in a subsequent job run for historical comparison and anomaly detection
Describe the solution you'd like
An example code snippet where DeeQu and Analyzer metrics are persisted to an S3 file, and reloaded into another job run to perform anomaly detection. Currently, the example covers persisting to JSON file, would be great to know if the metrics can be persisted in parquet files as dataframe, and then reloaded to get historical metrics repository details.
Describe alternatives you've considered
Currently, the workaround is to compute the metrics by partition during every job run, and then do a comparison of analyzer metrics - but this is not very efficient, especially if the number of historical partitions is large
Is your feature request related to a problem? Please describe. I am unable to leverage the full benefits of the MetricsRepository feature in PyDeequ - especially with respect to persisting the metrics to an S3 file, and reload in a subsequent job run for historical comparison and anomaly detection
Describe the solution you'd like An example code snippet where DeeQu and Analyzer metrics are persisted to an S3 file, and reloaded into another job run to perform anomaly detection. Currently, the example covers persisting to JSON file, would be great to know if the metrics can be persisted in parquet files as dataframe, and then reloaded to get historical metrics repository details.
Describe alternatives you've considered Currently, the workaround is to compute the metrics by partition during every job run, and then do a comparison of analyzer metrics - but this is not very efficient, especially if the number of historical partitions is large