Story

As a galasa administrator I want to be able to defrag my etcd instance without taking Galasa down, so that users are not effected.

Background

Galasa uses etcd to hold runtime cache storage and CPS properties, which needs to be shared over the ecosystem.

Etcd uses disk space in a Kube PVC. It uses more and more until it runs out. To keep this in check, you need to defrag the etcd storage using etcd commands.

Issue the following to compact and defrag:

 $ rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]' | egrep -o '[0-9].')
 $ ETCDCTL_API=3 etcdctl compact $rev
 compacted revision 1516
 $ ETCDCTL_API=3 etcdctl defrag

But there is a problem with this: That etcd blocks calls while defrag happens. The docs for etcd are here: https://etcd.io/docs/v3.5/op-guide/maintenance/

They say this:

Defragmentation After compacting the keyspace, the backend database may exhibit internal fragmentation. Any internal fragmentation is space that is free to use by the backend but still consumes storage space. Compacting old revisions internally fragments etcd by leaving gaps in backend database. Fragmented space is available for use by etcd but unavailable to the host filesystem. In other words, deleting application data does not reclaim the space on disk.

The process of defragmentation releases this storage space back to the file system. Defragmentation is issued on a per-member basis so that cluster-wide latency spikes may be avoided.

And then this: (which is the concern I have):

Note that defragmentation to a live member blocks the system from reading and writing data while rebuilding its states.

It's not clear what 'blocks the system' means in that context. For how long ? It might depend on how much work the defrag has to do?

Seems that galasa should have a cron job to schedule such defragmentation automagically at frequent and regular periods, so that the defrag issue doesn't a) cause future defrags to take a long time, and b) cause etcd to run out of disk in its' PVC.

There is some comments on this topic here: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/ where they mention a cron job here: https://github.com/ahrtr/etcd-defrag/blob/main/doc/etcd-defrag-cronjob.yaml

FYI: That cron job uses a schedule of 14 9 * * 1-5 which means "At 09:14 AM, Monday through Friday"

Tasks

[ ] find out what 'blocks the system' means
[ ] create a cron job which performs defrag regularly and frequently
[ ] helm chart regular install causes etcd to contually be defragmented
[ ] A separate cron job takes a snapshot backup of the etcd information and puts it into a separate PVC on the cluster. Then if the defrag fails, or etcd fails, we should be able to recover any data.

galasa-dev / projectmanagement

Galasa needs to periodically defrag etcd #1887

Story

Background

Tasks