awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.26k stars 533 forks source link

Removal of a metric from a metric repository #359

Open JonathonShields opened 3 years ago

JonathonShields commented 3 years ago

Is it possible to remove a metric from a metric repository? I have had a look at the APIs and see no obvious way of doing this. Even if the ability exists to update a metric, i.e. I could add a status tag to its key and modified it to 'cancelled', but again I see no means to do that.

The use case is that I receive data in batches, and each is assessed re its data quality as it comes in, the metrics for each batch is saved during its ingestion. At some time in the future it has been decided that a particular batch of data needs to be removed from the system, this removal should also remove any metrics related to that batch in the metric repository.

Any recommendations on the best way to achieve this given the current APIs?

Many thanks.

jameskyle commented 3 years ago

Chiming in on this. Doesn't seem to be an obvious way to manage recomputing of data. E.g. I might process and store metrics for a year's data...then there's some problem with the underlying data requiring reprocessing and, thus, recomputing of metrics.

There doesn't seem to be a clear way to remove/replace/etc a subset of the old metrics with new for this data.

It also raises the question on why json was chosen as the default storage medium over a partitioned table format. This would allow something like overwriting data for a partition if it's recomputed.