awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Example for persisting metrics from metricsRepository to file on S3 and reloading the metrics file in Glue to perform anomaly detection #144

Open jayashreeraman opened 11 months ago

jayashreeraman commented 11 months ago

Is your feature request related to a problem? Please describe. I am unable to leverage the full benefits of the MetricsRepository feature in PyDeequ - especially with respect to persisting the metrics to an S3 file, and reload in a subsequent job run for historical comparison and anomaly detection

Describe the solution you'd like An example code snippet where DeeQu and Analyzer metrics are persisted to an S3 file, and reloaded into another job run to perform anomaly detection. Currently, the example covers persisting to JSON file, would be great to know if the metrics can be persisted in parquet files as dataframe, and then reloaded to get historical metrics repository details.

Describe alternatives you've considered Currently, the workaround is to compute the metrics by partition during every job run, and then do a comparison of analyzer metrics - but this is not very efficient, especially if the number of historical partitions is large