gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
Apache License 2.0
70 stars 49 forks source link

[Feature] Defragment during restoration for compaction with ETCD BR #232

Open abdasgupta opened 3 years ago

abdasgupta commented 3 years ago

Feature (What you would like to be added): There is support for compaction sub command in ETCD BR, see here. Compaction sub command actually compact ETCD delta snapshots and turn the snapshots into a full snapshot. To compact the delta snapshots, first a temporary ETCD instance is restored from a full snapshot and subsequent delta snapshots from the backup, then the ETCD DB is compacted, then the DB is defragmented and finally a snapshot of the DB is taken. We want to defragment the ETCD DB few times during the restoration of temporary ETCD from the delta snapshots. Motivation (Why is this needed?): There might be quite a number of events in Delta snapshots that are used for restoration. During restoration, those events may consume lots of memory as they may fragment the memory space. So to keep memory usage limited during compaction, few degrament operations are needed during interim restorations. Approach/Hint to the implement solution (optional):

shreyas-s-rao commented 1 year ago

As discussed recently as part of https://github.com/gardener/etcd-backup-restore/issues/604, the task of repeatedly defragmenting the embedded etcd during restoration can be taken up by etcd-backup-restore itself, based on certain criteria like number of deltas/events applied, or DB size. This applies for both regular restoration as well as restoration as part of snapshot compaction.

Although it would still make sense to allow druid to orchestrate periodic or threshold-based defragmentations on the etcd members in the cluster, it would still make sense for etcd-backup-restore to defragment its own embedded etcd used for restorations since it is not visible to druid (and should not be) and etcdbr knows best about the number of deltas/events applied or the size growth of the embedded etcd DB.

I would vote to move this issue to the scope of etcd-backup-restore (with some rephrasing ofcourse), and create a separate issue in druid to track the task of druid-orchestrated defragmentations of the etcd members. WDYT @unmarshall @abdasgupta ?

unmarshall commented 1 year ago

Defragmentation should always be done by backup-restore for its peer etcd container. With EtcdMemberState CR it would be possible for druid to know when to trigger threshold based defragmentation. It is still important for druid to do this because it has an overview of the entire cluster. So it knows that it would not be prudent to trigger a defragmentation because currently there are only 2/3 members. Even though the quorum is established but it cannot start defragmentation on a LIVE member as that would bring down the quorum. Etcd-backup-restore also check its own DB size and start defragmentation because of the same reasons.

shreyas-s-rao commented 1 year ago

/assign @ishan16696