[Enhancement] Backup-restore should calculate previous cron schedule of full snapshot

Enhancement (What you would like to be added): Currently while deciding whether to take full snapshot or not during startup of backup-restore previous cron schedule is not considered(check here). It will be better to take decision of taking full snapshot during startup by calculating the previous cron schedule of full snapshot.

Motivation (Why is this needed?): In this PR: https://github.com/gardener/etcd-backup-restore/pull/574, we tried to calculate previous full snapshot schedule but still we are missing many permutations of cron schedule configured for full snapshot, Please see this comment: https://github.com/gardener/etcd-backup-restore/pull/574#discussion_r1100994411

Approach/Hint to the implement solution (optional): https://github.com/robfig/cron/issues/224

Maybe I don't understand the full extent, but here some thoughts:

One problem seems to stem from fullSnapshotSchedule. Why do we have it? Can we change it? What we originally wanted, is to make sure, we take full snapshots in between the incremental snapshots for performance and safety reasons. Instead of that field, we could have defined a period like 24h or 3d. This way, whenever the sidecar awakes, it checks when the last full snapshot was taken and takes one immediately if the timespan has lapsed or schedules the next full snapshot (last full snapshot timestamp plus expected full snapshot period) process-internally, e.g. last taken 16:43 yesterday, period 24h, next one 16:43 today.
Some problems seem to stem from hibernation, but why don’t we always take a full snapshot when hibernating a cluster? Wouldn’t that be generally the best/safest/sensible option and also work best with clusters that are manually hibernated (do not have a wake up time) and also fit the bill perfectly later when we want to get rid of that costly volume/the volumes for ETCD that sit unused on our credit card?
If we want to stick to fullSnapshotSchedule, what’s the reason we have to make it so complex? If that’s what people in the wild do manually, it’s their responsibility. For us (managed service), if we do the above (full snapshot on hibernate), we can assume that otherwise the cluster is up and running when we want to take the next full snapshot and if not, we have the alert. If the cluster was hibernated and woken up, we don’t take a full snapshot (and don’t feel bad about it) until the next full snapshot schedule. If the next full snapshot schedule is always during a time window of hibernation, it doesn’t matter if we take a full snapshot, as written above, when the cluster hibernates. There would be no problem with daily hibernation schedules, no problem with weekends or multi-day spans either, and no problem with manual hibernates/wake ups. If the cluster runs for 24h, the full snapshot is triggered by fullSnapshotSchedule and if it doesn’t run that long, it’s hibernated, in which case we take a full snapshot.
What remains, but I believe in all cases, is to make the alert smarter. I don’t see how running a full snapshot at the time of hibernation can help here. The only thing that helps is to take a full snapshot on schedule or always when the cluster wakes up and the timespan has lapsed before the monitoring stack is up. But that’s suboptimal as taking a full snapshot when the cluster wakes up, just to please the alert, is less useful (not at all useful actually) than taking it when it hibernates. This dilemma raises a couple of questions:
- What’s defined in the alert, because fullSnapshotSchedule can be anything (not necessarily daily)? If 24h is the basis for the alert, it’s already a period and fullSnapshotSchedule doesn’t actually fit.
- Or is the alert checking whether the last full snapshot is newer than the last previous full snapshot schedule, but that will never work well with hibernated clusters, will it (fire if we don’t take that useless full snapshot when the cluster wakes up)?
- There is always a race between taking the full snapshot (even if decided to take at the beginning) and the monitoring stack, is there not? Or is it that the reconciliation flow protects us from the race, because ETCD must be healthy before the flow continues and scales up the monitoring stack?
- Then again, it’s suboptimal that we would then take full snapshots when waking up, because it makes a lot more sense to take the full snapshot before hibernation, not when waking up.
And finally, isn't it so that with auto-compaction, full snapshots are no longer necessary and don't have to be taken? There is no point anymore in those then (yes/no?) and the alert is then what's critical (like with not taking full snapshots regularly).

cc @abdasgupta @unmarshall

What we originally wanted, is to make sure, we take full snapshots in between the incremental snapshots for performance and safety reasons. Instead of that field, we could have defined a period like 24h or 3d. This way, whenever the sidecar awakes, it checks when the last full snapshot was taken and takes one immediately if the timespan has lapsed or schedules the next full snapshot (last full snapshot timestamp plus expected full snapshot period) process-internally, e.g. last taken 16:43 yesterday, period 24h, next one 16:43 today.

This is exactly we were doing before backup-restore:v0.20.0, we had a hardcoded window of 24h using that backup-restore triggers a full snapshot during its startup(could be container restart). But we saw Prometheus alerts for some clusters were raised of full snapshot not taken for more than 24hours, I hope you already aware of this issue I have described what was happening: https://github.com/gardener/etcd-backup-restore/issues/570

Some problems seem to stem from hibernation, but why don’t we always take a full snapshot when hibernating a cluster? Wouldn’t that be generally the best/safest/sensible option and also work best with clusters that are manually hibernated (do not have a wake up time) and also fit the bill perfectly later when we want to get rid of that costly volume/the volumes for ETCD that sit unused on our credit card?

Yes, I also second that but that comes with its own complexities:

How backup-restore knows that cluster are being hibernated and its time of taking a full-snapshot ?
If we are deleting the PVCs of etcd cluster then we can't move 0->3 directly, we have move 0->1 (restoration from full snapshot) then scale-up (1->3)
Deleting the PVCs of etcd to save cost then waking up cluster with scale-up might also increase waking up cluster time.

I would still like way of taking full snapshot before hibernation to delete PVCs to save cost but we have analyse and designs things.

And finally, isn't it so that with auto-compaction, full snapshots are no longer necessary and don't have to be taken? There is no point anymore in those then (yes/no?) and the alert is then what's critical (like with not taking full snapshots regularly).

with compaction running in background, yes we can think of removing the schedule full snapshot but not full snapshot taken before hibernation as we need full snapshot(which has full data upto last seconds) to reduce time of waking up of clusters as we don't want to run delta snapshots during waking up of clusters.

Meeting minutes from an out-of-band meeting:

We agreed that it’s best to take a full snapshot at the time when we hibernate a cluster, possibly like CPM is triggering it It makes more sense to take it before hibernation, so that we have it should something go wrong
- Then we do not have to take it when the cluster is woken up (even if we have none for the past 24h wall-clock time), which is pointless as it was not running in the meantime, so if there was a good reason to take it at wake-up, it was exactly the same good reason (even better) to take it when hibernating the cluster (passed wall-clock time doesn’t make it more urgent/useful/right to take a full snapshot)
- This also (marginally) helps with the wake-up time, which is then faster (if hibernation is slower, it’s not relevant to the people)
- Wake-up can be manual, so we do not know when it happens next and the schedule (if it even exists for wake-up) is just a best guess as users can modify it anytime anyway
- This will allow us later (not part of this discussion) to get rid of also the volumes when a cluster is hibernated to save (more) cost
We agreed that we still plan to go for compaction, but we need to improve the conditions when to take it:
- Today: 1M revisions
- In the future, useful for clusters with only few changes: Based on cluster runtime (as opposed to cluster wall-clock time), e.g. every 1d or 7d at a minimum (numbers are just examples)
- In the future, useful for clusters that don’t make many revisions, but with huge resources that result in many delta snapshots: Based on number of delta snapshots since last full snapshot, e.g. every 200 or 2000 delta snapshots (numbers are just examples, but Ishan reported of a performance test cluster with 19K delta snapshots that was still restoring for 8h when he spoke)
Even if we decide that compaction is too expensive or doesn’t work reliably, we can go back to explicit full snapshots, but still use the same useful conditions above
Consequently, we then have to update the alert definition, so that exactly these conditions define whether or not an alert is raised, e.g. by doubling the thresholds, i.e.:
- If we plan to compact or take full snapshots every 1M revision, fire the alert if 2M revisions have accumulated since the last full snapshot (compacted or explicitly obtained)
- If we plan to compact or take full snapshots every 24h cluster runtime, fire the alert if 48h have passed (do not use wall-clock time for condition and/or alert)…
- If we plan to compact or take full snapshots every 200 delta snapshots, fire the alert if 400 delta snapshots have accumulated…

Relates to https://github.com/gardener/etcd-druid/issues/231, where we already discussed about disabling regular full snapshots in favour of compacted snapshots.

gardener / etcd-backup-restore

[Enhancement] Backup-restore should calculate previous cron schedule of full snapshot #587