gardener / etcd-backup-restore

Collection of components to backup and restore the etcd of a Kubernetes cluster.
Apache License 2.0
284 stars 99 forks source link

[Enhancement] Backup-restore should calculate previous cron schedule of full snapshot #587

Open ishan16696 opened 1 year ago

ishan16696 commented 1 year ago

Enhancement (What you would like to be added): Currently while deciding whether to take full snapshot or not during startup of backup-restore previous cron schedule is not considered(check here). It will be better to take decision of taking full snapshot during startup by calculating the previous cron schedule of full snapshot.

Motivation (Why is this needed?): In this PR: https://github.com/gardener/etcd-backup-restore/pull/574, we tried to calculate previous full snapshot schedule but still we are missing many permutations of cron schedule configured for full snapshot, Please see this comment: https://github.com/gardener/etcd-backup-restore/pull/574#discussion_r1100994411

Approach/Hint to the implement solution (optional): https://github.com/robfig/cron/issues/224

vlerenc commented 1 year ago

Maybe I don't understand the full extent, but here some thoughts:

cc @abdasgupta @unmarshall

ishan16696 commented 1 year ago

What we originally wanted, is to make sure, we take full snapshots in between the incremental snapshots for performance and safety reasons. Instead of that field, we could have defined a period like 24h or 3d. This way, whenever the sidecar awakes, it checks when the last full snapshot was taken and takes one immediately if the timespan has lapsed or schedules the next full snapshot (last full snapshot timestamp plus expected full snapshot period) process-internally, e.g. last taken 16:43 yesterday, period 24h, next one 16:43 today.

This is exactly we were doing before backup-restore:v0.20.0, we had a hardcoded window of 24h using that backup-restore triggers a full snapshot during its startup(could be container restart). But we saw Prometheus alerts for some clusters were raised of full snapshot not taken for more than 24hours, I hope you already aware of this issue I have described what was happening: https://github.com/gardener/etcd-backup-restore/issues/570

Some problems seem to stem from hibernation, but why don’t we always take a full snapshot when hibernating a cluster? Wouldn’t that be generally the best/safest/sensible option and also work best with clusters that are manually hibernated (do not have a wake up time) and also fit the bill perfectly later when we want to get rid of that costly volume/the volumes for ETCD that sit unused on our credit card?

Yes, I also second that but that comes with its own complexities:

  1. How backup-restore knows that cluster are being hibernated and its time of taking a full-snapshot ?
  2. If we are deleting the PVCs of etcd cluster then we can't move 0->3 directly, we have move 0->1 (restoration from full snapshot) then scale-up (1->3)
  3. Deleting the PVCs of etcd to save cost then waking up cluster with scale-up might also increase waking up cluster time.

    I would still like way of taking full snapshot before hibernation to delete PVCs to save cost but we have analyse and designs things.

And finally, isn't it so that with auto-compaction, full snapshots are no longer necessary and don't have to be taken? There is no point anymore in those then (yes/no?) and the alert is then what's critical (like with not taking full snapshots regularly).

with compaction running in background, yes we can think of removing the schedule full snapshot but not full snapshot taken before hibernation as we need full snapshot(which has full data upto last seconds) to reduce time of waking up of clusters as we don't want to run delta snapshots during waking up of clusters.

vlerenc commented 1 year ago

Meeting minutes from an out-of-band meeting:

shreyas-s-rao commented 1 year ago

Relates to https://github.com/gardener/etcd-druid/issues/231, where we already discussed about disabling regular full snapshots in favour of compacted snapshots.