galasa-dev / projectmanagement

Project Management repo for Issues and ZenHub
7 stars 3 forks source link

Galasa needs to periodically defrag etcd #1887

Open techcobweb opened 4 weeks ago

techcobweb commented 4 weeks ago

Story

As a galasa administrator I want to be able to defrag my etcd instance without taking Galasa down, so that users are not effected.

Background

Galasa uses etcd to hold runtime cache storage and CPS properties, which needs to be shared over the ecosystem.

Etcd uses disk space in a Kube PVC. It uses more and more until it runs out. To keep this in check, you need to defrag the etcd storage using etcd commands.

Issue the following to compact and defrag:

 $ rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]' | egrep -o '[0-9].')
 $ ETCDCTL_API=3 etcdctl compact $rev
 compacted revision 1516
 $ ETCDCTL_API=3 etcdctl defrag

But there is a problem with this: That etcd blocks calls while defrag happens. The docs for etcd are here: https://etcd.io/docs/v3.5/op-guide/maintenance/

They say this:

Defragmentation After compacting the keyspace, the backend database may exhibit internal fragmentation. Any internal fragmentation is space that is free to use by the backend but still consumes storage space. Compacting old revisions internally fragments etcd by leaving gaps in backend database. Fragmented space is available for use by etcd but unavailable to the host filesystem. In other words, deleting application data does not reclaim the space on disk.

The process of defragmentation releases this storage space back to the file system. Defragmentation is issued on a per-member basis so that cluster-wide latency spikes may be avoided.

And then this: (which is the concern I have):

Note that defragmentation to a live member blocks the system from reading and writing data while rebuilding its states.

It's not clear what 'blocks the system' means in that context. For how long ? It might depend on how much work the defrag has to do?

Seems that galasa should have a cron job to schedule such defragmentation automagically at frequent and regular periods, so that the defrag issue doesn't a) cause future defrags to take a long time, and b) cause etcd to run out of disk in its' PVC.

There is some comments on this topic here: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/ where they mention a cron job here: https://github.com/ahrtr/etcd-defrag/blob/main/doc/etcd-defrag-cronjob.yaml

FYI: That cron job uses a schedule of 14 9 * * 1-5 which means "At 09:14 AM, Monday through Friday"

Tasks

techcobweb commented 3 weeks ago

This is relevant https://github.com/etcd-io/etcd/issues/15477