Script for defragmentation

guettli commented 1 year ago

What would you like to be added?

I would like to see an official solution how to defragment etcd.

AFAIK a one-line cron-job is not enough, since you should not defragment the current leader.

Maybe it is enough to add a simple example script to the docs.

Why is this needed?

defragmenting the leader can lead to performance degradation and should be avoided.

I don't think it makes sense that every company running etcd invents its own way to solve this.

guettli commented 1 year ago

Here is one possible solution: https://github.com/ugur99/etcd-defrag-cronjob

jmhbnz commented 1 year ago

Thanks for raising the discussion on this. One complication with this is that while the largest, kubernetes is not the only user of etcd.

With this in mind I think we would need to consider what would be best suited to sit under etcd op guide docs versus kubernetes etcd operations docs.

It may make more sense for this issue or a tandem issue to be raised against the kubernetes etcd operations docs?

guettli commented 1 year ago

I would like to have a solution (or documentation) for etcd.io first. I got bitten by outdated etcd docs on kubernetes.io once, and think having docs at two places is confusing.

chaochn47 commented 1 year ago

defragmenting the leader can lead to performance degradation and should be avoided.

Hi @guettli

I think defraging against leader is equivalent to against followers. Raft is not blocking because of rewriting the db file generally speaking.

For example, etcdctl endpoint status will show Raft Index is incrementing (commited Index) but Raft Applied Index is not during defraging on leader.

guettli commented 1 year ago

@chaochn47 thank you for you answer. What is your advice for defragmenting etcd? How do you handle it?

chaochn47 commented 1 year ago

Hi @guettli, Here is how I would suggest

Every a couple of minutes, evaluates if etcd should run defrag.

It will run defrag if

More than 500 MB space can be freed. AND
DB size breaches a high water mark of quota (80%) OR it has been 24 hours since last defrag on the node (each cluster runs 3 etcd nodes, cut the time off to local timezone midnight)

It is guaranteed defrag won’t occur on more than one node at any given time.

tjungblu commented 1 year ago

Raft is not blocking because of rewriting the db file generally speaking.

That's true, in OpenShift we recommend doing the leader last, because the additional IO/memory/cache churn can impact the performance negatively. If a defrag takes down the leader, the other nodes are at least safely defrag'd already and can continue with the next election. We also do not defrag if any member is unhealthy.

@guettli Are you looking for a simple bash script in etcd/contrib or something more official as part of the CLI?

guettli commented 1 year ago

@tjungblu I don't have a preference about what the solution looks like. It could be a shell script, something added to etcdutil or maybe just some docs.

@chaochn47 explained the steps, but I a not familiar enough with etcd to write a corresponding script to implement this. I hope that someone with more knowledge of etcd can provide an executable solution.

jmhbnz commented 1 year ago

Taking a quick look at how etcdctl defrag works currently I'm wondering if we should make func defragCommandFunc more opinionated so that if it is passed the --cluster flag it would complete the defrag on all non leader members first and then do leader.

This would simplify any downstream implementations of defrag functionality as each implementation would not have to reinvent how to prioritize the cluster wide defrag provided they were built on top of etcdctl.

We could then update website docs or add contrib/defrag reference implementation for Kubernetes.

guettli commented 1 year ago

@jmhbnz would it be possible to get this into etcdctl:

It will run defrag if More than 500 MB space can be freed. AND DB size breaches a high water mark of quota (80%) OR it has been 24 hours since last defrag on the node (each cluster runs 3 etcd nodes, cut the time off to local timezone midnight) It is guaranteed defrag won’t occur on more than one node at any given time.

Then calling defragmentation does not need to be wrapped in a "dirty" shell.

jmhbnz commented 1 year ago

Hey @guettli - I don't think we can build all of that into etcdctl as for example etcdctl doesn't do any cron style scheduling currently to my knowledge.

As mentioned I do think we can solve the issue of completing defrag for members one at a time first before the leader as a built in approach in etcdctl provided --cluster flag is used.

For some of the other requirements we have been working out in this issue, like scheduling or perhaps some of the monitoring based checks I think those will need to be handled as either documentation or additional resources in etcd/contrib for example a kubernetes cronjob or shell script example implementation.

@ahrtr, @serathius - Keen for maintainer input on this. If what I have suggested makes sense feel free to assign to me and I can work on it.

jmhbnz commented 1 year ago

Apologies, removing my assignment for this as I am about to be traveling for several weeks and attending Kubecon so I likely won't have much capacity for a while. If anyone else has capacity they are welcome to pick it up.

serathius commented 1 year ago

I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.

geetasg commented 1 year ago

https://github.com/etcd-io/etcd/issues/9222 looks related. Seems like reducing bbolt fragmentation will be a third option in addition to option 1 and option 2 mentioned here - https://github.com/etcd-io/etcd/issues/9222#issuecomment-363194672. Has this been discussed before and was there any conclusion on preferred design approach? Should contributors interested in solving this start from scratch or build upon prior guidance? @serathius @ptabor /cc @chaochn47 @cenkalti

cenkalti commented 1 year ago

To me, it makes more sense to fix this at BoltDB layer. By design (based on LMDB), the database should not require any maintenance operation. BoltDB has FillPercent parameter to control page utilization when adding items to a page but no control when removing items from a page. Related: https://github.com/etcd-io/bbolt/issues/422

tjungblu commented 1 year ago

From K8s perspective most fragmentation we see comes from Events, OpenShift also suffers from Images (CRDs for container image builds) on build-heavy clusters.

On larger clusters we advise to shard those to another etcd instance within the cluster, but maybe we can offer some "ephemeral keys" that have some more relaxed storage and consistency guarantees? Or which use a different storage than bbolt, rocksdb/leveldb (or anything LSM based)...

guettli commented 1 year ago

I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.

@serathius this would be great. Having a cron-job which defragments the non-leaders first, then the leader is extra overhead. Especially since there is no official version of such a script, and people solve the same task again and again.

Let me know, if I can help somehow.

serathius commented 1 year ago

cc @ptabor who mentioned some ideas to limit bbolt fragmentation.

ahrtr commented 1 year ago

Note that I don't expect bbolt side change, at least in the near future, because we are still struggling to reproduce https://github.com/etcd-io/bbolt/issues/402 and https://github.com/etcd-io/bbolt/issues/446.

I think it makes sense to provide a official reference (just reference!) on how to perform defragmentation. The rough idea (on top of all the inputs in this thread, e.g. @tjungblu @chaochn47 etc.) is:

Defragmentation is a time-consuming task, so it's recommended to do it for members one by one;
Please do not do defragmentation if any member is unhealthy;
It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);
There is a known issue that etcd might run into data inconsistency issue if it crashes in the middle of an online defragmentation operation using etcdctl or clientv3 API. All the existing v3.5 releases are affected, including 3.5.0 ~ 3.5.5. So please use etcdutl to offline perform defragmentation operation, but this requires taking each member offline one at a time. It means that you need to stop each etcd instance firstly, then perform defragmentation using etcdutl, start the instance at last. Please refer to the issue 1 in public statement

Please also see Compaction & Defragmentation

I might spend some time to provide a such script for reference.

guettli commented 1 year ago

@ahrtr "I might spend some time to provide a such script for reference."

An official script would realy help here. The topic is too hot to let everybody re-solve this on its own.

bradjones1320 commented 1 year ago

I've been writing my own script for this. I guess my biggest question is: If I get all the members of my cluster and then loop through them in bash, and execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish before moving on to the next member?

chaochn47 commented 1 year ago

execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish

It does not. Please take a look at the discussion in https://github.com/etcd-io/etcd/discussions/15664.

You could use curl -sL http://localhost:2379 | grep "etcd_disk_defrag_inflight" to determine if defrag has been completed in your script.

https://github.com/etcd-io/etcd/blob/9e1abbab6e4ba2886238f49b0d48fc19a546c7cf/server/storage/backend/backend.go#L450-L451

ahrtr commented 1 year ago

execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish

It does not.

It isn't correct. It waits for the defrag to finish before moving on to the next member.

FYI. I am implementing a tool to do defragmentation. Hopefully the first version can be ready next week.

cenkalti commented 1 year ago

It does wait but it timeouts after some duration. IIRC it's 30 seconds.

ahrtr commented 1 year ago

It does wait but it timeouts after some duration

Yes, it's another story. The default command timeout is 5s. It's recommended to set a bigger value (e.g. 1m. ) for defragmentation, because it may take a long time to defragment a large DB. I don't have the performance data for now on how much time it may need for different DB size.

cenkalti commented 1 year ago

The pattern I see is usually 10s per GB. @bradjones1320 you can set a larger timeout with --command-timeout=60s flag.

ahrtr commented 1 year ago

FYI. https://github.com/ahrtr/etcd-defrag

Just as I mentioned in https://github.com/etcd-io/etcd/issues/15477#issuecomment-1506050843, the tool etcd-defrag,

run defragmentation only when all members are healthy. Note that it ignores the NOSPACE alarm
run defragmentation on the leader last

cenkalti commented 1 year ago

@ahrtr

It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);

When saying "stop-the-world", are you only referring to the following check:

https://github.com/etcd-io/etcd/blob/63c9fe1d000842ed8b50edf0b41d1101456fa1f7/server/etcdserver/v3_server.go#L669-L671

or there may be other reasons that might stop-the-world?

ahrtr commented 1 year ago

When etcdserver is processing the defragmentation, it can't serve any client requests, see https://github.com/etcd-io/etcd/blob/63c9fe1d000842ed8b50edf0b41d1101456fa1f7/server/storage/backend/backend.go#L456-L465

The main functionality of https://github.com/ahrtr/etcd-defrag is ready, the left work is to add more utilities (e.g. Dockerfile, manifest for K8s, etc.). Please anyone feel free to let me know if you have any suggestions or questions.

miancheng7 commented 1 year ago

@ahrtr

It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);

Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.

FYI, I did a test on a 3 nodes cluster. When defraging, Etcd leader node healthy check failed but there was not leader election.

Test logic

1. Feed 8GiB data to Etcd cluster.
2. Set up clients, continuously send Read/Write to all nodes,
3. Start defrag in leader.
4. Check cluster healthy
5. Check if there are leader election

test output

Before defragment, the cluster raft term is 7

% etcdctl endpoint status --cluster
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |   3.5.7 |  8.0 GB |      true |         7 |    1394738 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

When defraging in leader, the leader is unhealthy, clients connecting to the leader are blocked or receive "too many requests" error.
```
status = StatusCode.UNKNOWN
details = "etcdserver: too many requests"
```

The defragment takes 6m17s

{
"msg": "finished defragmenting directory",
"current-db-size": "8.0 GB",
"took": "6m17.969308853s"
}

After defragment, leader becomes healthy and raft term is still 7 which means no leader transfer.

% etcdctl endpoint status --cluster -w table
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|  http://127.0.0.1:2379 | 8211f1d0f64f3269 |   3.5.7 |  8.1 GB |     false |         7 |    1808612 |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |   3.5.7 |  8.0 GB |      true |         7 |    1808612 |
| http://127.0.0.1:32379 | fd422379fda50e48 |   3.5.7 |  8.1 GB |     false |         7 |    1808612 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

ahrtr commented 1 year ago

It turned out to be that the leader doesn't stop the world during processing defragmentation, because the apply workflow is executed async, https://github.com/etcd-io/etcd/blob/63c9fe1d000842ed8b50edf0b41d1101456fa1f7/server/etcdserver/server.go#L847

so defrag will not cause leadership transfer.

Confirmed that it doesn't cause leadership transfer no matter how long the leader is blocked on processing defragmentation. It should be an issue, and we should fix it. I have a pending PR https://github.com/etcd-io/etcd/pull/15440, let me think how to resolve them together.

ahrtr commented 1 year ago

Again, it's still recommended to run defragmentation on leader last, because leader has more responsibilities (e.g. send snapshot, etc.) than followers, once it's blocked for a long time, then all the responsibilities dedicated to leader are not working.

Please also read https://github.com/ahrtr/etcd-defrag

ahrtr commented 1 year ago

Since we already have https://github.com/ahrtr/etcd-defrag, can we close this ticket? @guettli

FYI. I might formally release etcd-defrag v0.1.0 in the following 1 ~ 2 weeks.

tjungblu commented 1 year ago

Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.

A defrag call will not cause leadership transfer, but the resulting IO+CPU load might cause this. Try again on a machine with a very slow disk or limited CPU. We've definitely seen this happening on loaded control planes.

guettli commented 1 year ago

closing this, since https://github.com/ahrtr/etcd-defrag exists.

TechDufus commented 1 year ago

It would still be awesome to get built-in / official support for this, yeah?

Or will https://github.com/ahrtr/etcd-defrag be the official and supported cluster defrag tool?

guettli commented 1 year ago

@TechDufus good question. If you know the answer, please write it here into this issue. Thank you.

ahrtr commented 1 year ago

Or will https://github.com/ahrtr/etcd-defrag be the official and supported cluster defrag tool?

I am the owner of the tool, so I will definitely support it;
It's an open source project, so any contribution is welcome.

etcd-io / etcd