Allow Configuring an SLM Interval instead of a CRON Schedule?

original-brownbear commented 3 years ago

We are currently seeing some trouble when working with SLM and multiple deployments sharing resources. This is a result of the fact that configuring SLM cron schedules encourages running all snapshot and all retention jobs at the same time, causing a thundering herd situation. Another issue with the current scheduling we're seeing is when the scheduled snapshot period is shorter than the time it takes to create a single snapshot. This tends to lead to an ever increasing number of concurrent snapshots (now that we have concurrent snapshots) piling up. The behavior of these concurrent snapshots can be improved but the usefulness of having many overlapping snapshots is questionable nonetheless.

One way I think we could fix both problems and improve usability of SLM is to offer the ability to schedule snapshots at an interval instead of via a CRON schedule. Concretely a policy could just contain a field:

"interval": "30m"

which would result it:

take snapshots at 30 minute intervals, measured from the success of the last snapshot to the start of the next snapshot (this is how Cloud behaved prior to SLM btw.)
start snapshotting as son as the policy is created

This would solve both issues. The natural variation in how long snapshots take would resolve the thundering herd problem, measuring the interval from snapshot end to snapshot start would resolve the overlapping snapshots issue.

Could we do something like that?

elasticmachine commented 3 years ago

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

jugsofbeer commented 3 years ago

In our ECE environment we suffer from this exact issue and its driving us insane with the frequent snapshot failures and lack of spacing out of activity for multiple deployments.

joegallo commented 3 years ago

One point we brought up in discussion here: imagine a user who specifies "interval": "1d", but whose snapshots take 8 hours to complete. They might be surprised that "interval": "1d" doesn't mean "schedule a snapshot every day" but that it instead means "allow a day to pass between the end of one 8 hour snapshot and the start of the next". In this scenario, instead of getting a daily snapshot, seven per week, they'd get less than that, most weeks having only five snapshots.

I'm +100 to interval schedules, I think they'd be quite a bit simpler for users to interact with than cron schedules. Likewise, I agree that we should base the start of the interval on when the policy was created, though I slightly prefer scheduling for the next interval rather than starting the first snapshot immediately.

For ESS/ECE, then, the default schedule could be changed to "30m" rather than the current "0 0/30 [...]", and that would fix the thundering herd effect of many clusters snapshotting simultaneously. Each cluster would be snapshotting every "30m" since it was created, rather than very specifically the very top and bottom of the hour.

joegallo commented 1 year ago

I'm removing the team-discuss label from some older Team:Data Management issues -- we've had plenty of time to discuss them, but we haven't, so the label isn't serving its purpose. Feel free to delete this comment and/or re-add the team-discuss label.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

joegallo commented 10 months ago

Whatever we do here, we should do the same the SLM retention process, too. It's controlled via the slm. retention_schedule cluster setting. The same thundering herd effect applies -- if a lot of clusters are all using the same underlying cloud object storage resource, and they're all using the same slm. retention_schedule, then they'll be synchronizing their snapshot deletes to the second.

parkertimmins commented 2 weeks ago

Fixed by https://github.com/elastic/elasticsearch/pull/110847

elastic / elasticsearch

Allow Configuring an SLM Interval instead of a CRON Schedule? #64035