elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.21k stars 24.84k forks source link

Enable More Flexible SLM Retention Policies? #65826

Open original-brownbear opened 3 years ago

original-brownbear commented 3 years ago

There has been a recent request for longer snapshot retention via SLM in ECE/ECS. This is understandable since the default of of 100 snapshots retained and taken at 30 minute intervals only gives the user a ~2 day (less than a weekend potentially ;)) window to realize a problem before the last snapshot containing a healthy cluster state ages out.

Currently available solution to increasing the retention time are obvious but sub-optimal:

A possible solution that would be to make the retention interval dynamic such that older snapshots are retained with a larger interval between them before being phased out completely. Concretely, we could for example keep the first 10 snapshots at intervals of 30 min, then the next 10 at intervals of 1h and then the next 10 at intervals of 8h and then keep the remaining 70 at intervals of 1d or so (i.e. deletes would delete in between existing snapshots and not just delete via LIFO).

Under the assumption that time-wise resolution loses value the older a snapshot gets (which seems very reasonable to me) this would allow keeping 3 months of snapshots in 100 snapshots compared to 2 days with the current model. This keeps the cost terms of resources for managing the repository constant relative to the current approach. It does however increase the storage use due to the incremental nature of snapshots. Take the concrete number suggested for the intervals with a grain of salt obviously, there are many options here though I believe we should keep it simple. Technically speaking one could achieve this kind of retention period already by running multiple SLM policies in parallel I believe but that's fairly cumbersome.

-> WDYT about adding functionality for such a more dynamic snapshot retention interval?

elasticmachine commented 3 years ago

Pinging @elastic/es-core-features (Team:Core/Features)

parmsib commented 3 years ago

This would be useful for us.

We are currently running multiple policies, specifically for this purpose. On top of being a little cumbersome, it also results in some redundant snapshots when the policies' schedules "collide" (e.g once a day for a daily and an hourly policy) and multiple snapshots are taken in close succession, for the sole purpose of them having different retention configurations.

Maybe this could already be avoided by using more involved cron expressions, but that doesn't feel like a great solution either.

joegallo commented 1 year ago

I'm removing the team-discuss label from some older Team:Data Management issues -- we've had plenty of time to discuss them, but we haven't, so the label isn't serving its purpose. Feel free to delete this comment and/or re-add the team-discuss label.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)