aws / amazon-managed-service-for-prometheus-roadmap

Amazon Managed Service for Prometheus Public Roadmap
Other
41 stars 3 forks source link

Ability to delete individual and/or multiple metrics/time series #16

Open barthel opened 2 years ago

barthel commented 2 years ago

In addition to regular deletion after the (custom #2) retention period, the ability to delete individual and/or multiple metrics/time series is needed.

Deletion becomes necessary when

ampabhi-aws commented 2 years ago

Hey folks, we're in the process of better understanding the experience for this feature. If anyone would be interested in providing more detailed feedback on API designs, we'd love to schedule some 1:1 conversations. Please email ampabhi@amazon.com if you'd be interested in participating in design deep dives.

rsheldon-ansira commented 1 year ago

Some other use-cases:

AndrewFarley commented 1 year ago

A rogue developer has unintentionally poisoned our metrics with labels causing sharding and breaking our alerting and dashboarding and without the ability to delete metrics we're kinda stuck. Our options are...

And, because AMP doesn't let us insert metrics going backwards more than an hour, we can't really export (backup) all the metrics somewhere else (eg: a file, or another data store), then re-create AMP, and re-insert them.

Without the ability to delete there's a really tough spot we are in. Makes me wish we didn't use AMP instead just self-hosted Prom so we'd have control over this. Please add delete ASAP?

jeromeinsf commented 1 year ago

Hi @AndrewFarley

Could you clarify how the poison pill extent? Would deleting specific timeseries help or do you need to wipe a specific timerange of the whole workspace. In other words, would https://cortexmetrics.io/docs/proposals/block-storage-time-series-deletion/ actual work for your use case?

BTW, AMP allows to insert metrics older than one hour, the limitation is a little bit more subtle. You cannot insert metrics one hour in before the most recent data point absorbed., but you can effectively replay ingestion.

AndrewFarley commented 1 year ago

@jeromeinsf

First, per the poison pill issue at the moment I only need to delete a specific series, in my use-case I don't really need to limit the time range, but I could see the value of this. Reviewing that proposal I believe that would satisfy our current need.

Second, I'm not sure that the older than one-hour point you made is accurate. I've tried ingesting items older than an hour, and I get rejections from the remote_write API, saying specifically it won't accept metrics older than one hour. Is there some trick to this I'm unaware of?

rsheldon-ansira commented 1 year ago

Additionally, I would also love to be able to setup a rule-like process to periodically delete metrics according to labels, effectively allowing me to do things like:

With an API I could setup a lambda function to call the delete API, but it would be ideal to define "delete" or "expiry" rules in the same way as you can for recording and alerting rules.

jeromeinsf commented 1 year ago

@rsheldon-ansira that's a cool idea! While deleting timeseries would allow customers to setup such rules, I think that could be addressed as a slightly different request, eg to specify storage retention by label matchers instead of for the whole workspace. No rules needed, fully managed by the service. WDYT?

rsheldon-ansira commented 1 year ago

jeromeinsf - I like that - different retention periods specified by label matchers/query. Adding at an extension or rules makes it a little more generic for prometheus/cortex (should anyone care) vs adding it as AMP workspace specific feature.

If I was not using AMP, I would have the same requirements, and in fact have something similar in place with our current graphite metrics system.

AndrewFarley commented 1 year ago

Additionally, I would also love to be able to setup a rule-like process to periodically delete metrics according to labels, effectively allowing me to do things like:

I also LOVE this feature. Fantastic idea. Right now we use separate AMP instances because of the lack-of-this feature, this would allow us to consolidate!

jfharden commented 1 year ago

There's an additional use case for compliance.

If some PII accidentally (or on purpose I suppose) makes it into metric labels, names, or values, it's important it can be redacted. Especially in a heavily regulated environment like PCI-DSS.

I notice on https://github.com/aws/amazon-managed-service-for-prometheus-roadmap/issues/25#issuecomment-1317540131 it was mentioned last year that you're looking into providing support for the prometheus delete-metrics api. I think that would solve any compliance redaction concerns for me.

jjti commented 1 year ago

@ampabhi-aws we could definitely make use of this feature. In our particular case we want to communicate to our users how long their data is being retained. The default -- 150 days -- is extremely long given that we cap our visualization/usage at 1 month.

I had tried calling the /api/v1/admin/tsdb/delete_series endpoint assuming it was already supported because of this blog post: https://aws.amazon.com/blogs/opensource/building-a-series-deletion-api-in-cortex/

We could make use of this if it existed.

jcdauchy-moodys commented 10 months ago

@ampabhi-aws Do you know when the awesome feature would be available in AMP ? We all miss this feature. Thanks

hexionas commented 10 months ago

Yes this feature would help us greatly. Sometimes our developers have a high cardinality issue in one of their metric labels, which causes us to use up our limit greatly. It would be easily fixed with an API call rather than wait till our metric retention period is surpassed.

jeromeinsf commented 10 months ago

That adds a requirement that makes this feature even harder to deliver to have a delete API faster than simply wait for the extra cardinality to become inactive. May I recommend opening another feature request along the lines of ignoring specific timeseries to be discarded wrt to cardinality limits.

hexionas commented 10 months ago

I should add that the extra cardinality also in some cases contains sensitive data, so the high cardinality is a side effect of the main issue which is having data in the metrics that should not be there.