[Feature] Replace full snapshots at regular interval with compacted snapshots

Feature (What you would like to be added): We recently added compaction subcommand in ETCD Backup Restore here . This subcommand compacts, defragments and take full snapshot of ETCD database. We can run this subcommand parallel to ETCD BR at regular interval instead of full snapshots. But first full snapshot and a full snapshot every 24 hours is needed still. First snapshot is needed still because compaction can't run in parallel if there is not at least one full snapshot already in backup storage. We also need full snapshots every 24 hours because there may come situation for some cluster where not even a single compacted snapshot may not be taken in 24 hours. it would be really critical for those clusters to not have even a single full snapshot for 24 hours.

Motivation (Why is this needed?): We need this because we want our snapshots to take less space in backup storage. ETCD DB when restored from our compacted snapshots will take lesser space in main memory as well. Moreover regular, compacted snapshots will keep number of events in delta snapshots limited as well. please check this

Approach/Hint to the implement solution (optional):

As mentioned by @vlerenc in https://github.com/gardener/etcd-backup-restore/issues/587#issuecomment-1447946804, we need to make snapshot compaction much smarter than it is today if it is to replace scheduled full snapshots.

Paraphrasing from Vedran's comment the points related to snapshot compaction here, since etcd-druid handles snapshot compaction:

We need to improve the conditions that trigger a snapshot compaction job.

Today a threshold-based trigger, based on the number of events accumulated in the latest set of delta snapshots in the snapstore is used.
Needs to be enhanced to also accommodate the size of the accumulated events - required for clusters that write and update huge resources, although the number of events may be relatively small.

We also need to improve the alerts:

If we plan to compact or take full snapshots every 1M revision, fire the alert if 2M revisions have accumulated since the last full snapshot (compacted or explicitly obtained)
If we plan to compact or take full snapshots every 24h cluster runtime, fire the alert if 48h have passed (do not use wall-clock time for condition and/or alert)…
If we plan to compact or take full snapshots every 200 delta snapshots, fire the alert if 400 delta snapshots have accumulated…

Additionally, I would also like to add that we need to check the cost difference between the current and proposed approach, and see whether we see a cost improvement. If not, whether the added costs is acceptable. My gut feeling is that since the proposed approach plans to utilize cluster runtime rather than wall-clock time, we might see a cost reduction for the average cluster by avoiding "unnecessary" full snapshots. For larger clusters, we will definitely see more frequent full snapshots due to higher rate/size of events, but that is acceptable and necessary to avoid slow restorations on potential data corruptions.

gardener / etcd-druid

[Feature] Replace full snapshots at regular interval with compacted snapshots #231