Open abdasgupta opened 3 years ago
As mentioned by @vlerenc in https://github.com/gardener/etcd-backup-restore/issues/587#issuecomment-1447946804, we need to make snapshot compaction much smarter than it is today if it is to replace scheduled full snapshots.
Paraphrasing from Vedran's comment the points related to snapshot compaction here, since etcd-druid handles snapshot compaction:
We need to improve the conditions that trigger a snapshot compaction job.
We also need to improve the alerts:
Additionally, I would also like to add that we need to check the cost difference between the current and proposed approach, and see whether we see a cost improvement. If not, whether the added costs is acceptable. My gut feeling is that since the proposed approach plans to utilize cluster runtime rather than wall-clock time, we might see a cost reduction for the average cluster by avoiding "unnecessary" full snapshots. For larger clusters, we will definitely see more frequent full snapshots due to higher rate/size of events, but that is acceptable and necessary to avoid slow restorations on potential data corruptions.
Feature (What you would like to be added): We recently added compaction subcommand in ETCD Backup Restore here . This subcommand compacts, defragments and take full snapshot of ETCD database. We can run this subcommand parallel to ETCD BR at regular interval instead of full snapshots. But first full snapshot and a full snapshot every 24 hours is needed still. First snapshot is needed still because compaction can't run in parallel if there is not at least one full snapshot already in backup storage. We also need full snapshots every 24 hours because there may come situation for some cluster where not even a single compacted snapshot may not be taken in 24 hours. it would be really critical for those clusters to not have even a single full snapshot for 24 hours.
Motivation (Why is this needed?): We need this because we want our snapshots to take less space in backup storage. ETCD DB when restored from our compacted snapshots will take lesser space in main memory as well. Moreover regular, compacted snapshots will keep number of events in delta snapshots limited as well. please check this
Approach/Hint to the implement solution (optional):