Closed guettli closed 1 year ago
Here is one possible solution: https://github.com/ugur99/etcd-defrag-cronjob
Thanks for raising the discussion on this. One complication with this is that while the largest, kubernetes is not the only user of etcd.
With this in mind I think we would need to consider what would be best suited to sit under etcd op guide docs versus kubernetes etcd operations docs.
It may make more sense for this issue or a tandem issue to be raised against the kubernetes etcd operations docs?
I would like to have a solution (or documentation) for etcd.io first. I got bitten by outdated etcd docs on kubernetes.io once, and think having docs at two places is confusing.
defragmenting the leader can lead to performance degradation and should be avoided.
Hi @guettli
I think defraging against leader is equivalent to against followers. Raft is not blocking because of rewriting the db file generally speaking.
For example, etcdctl endpoint status
will show Raft Index
is incrementing (commited Index) but Raft Applied Index
is not during defraging on leader.
@chaochn47 thank you for you answer. What is your advice for defragmenting etcd? How do you handle it?
Hi @guettli, Here is how I would suggest
Every a couple of minutes, evaluates if etcd should run defrag.
It will run defrag if
It is guaranteed defrag won’t occur on more than one node at any given time.
Raft is not blocking because of rewriting the db file generally speaking.
That's true, in OpenShift we recommend doing the leader last, because the additional IO/memory/cache churn can impact the performance negatively. If a defrag takes down the leader, the other nodes are at least safely defrag'd already and can continue with the next election. We also do not defrag if any member is unhealthy.
@guettli Are you looking for a simple bash script in etcd/contrib or something more official as part of the CLI?
@tjungblu I don't have a preference about what the solution looks like. It could be a shell script, something added to etcdutil or maybe just some docs.
@chaochn47 explained the steps, but I a not familiar enough with etcd to write a corresponding script to implement this. I hope that someone with more knowledge of etcd can provide an executable solution.
Taking a quick look at how etcdctl defrag
works currently I'm wondering if we should make func defragCommandFunc more opinionated so that if it is passed the --cluster
flag it would complete the defrag on all non leader members first and then do leader.
This would simplify any downstream implementations of defrag functionality as each implementation would not have to reinvent how to prioritize the cluster wide defrag provided they were built on top of etcdctl
.
We could then update website docs or add contrib/defrag
reference implementation for Kubernetes.
@jmhbnz would it be possible to get this into etcdctl:
It will run defrag if More than 500 MB space can be freed. AND DB size breaches a high water mark of quota (80%) OR it has been 24 hours since last defrag on the node (each cluster runs 3 etcd nodes, cut the time off to local timezone midnight) It is guaranteed defrag won’t occur on more than one node at any given time.
Then calling defragmentation does not need to be wrapped in a "dirty" shell.
Hey @guettli - I don't think we can build all of that into etcdctl
as for example etcdctl
doesn't do any cron style scheduling currently to my knowledge.
As mentioned I do think we can solve the issue of completing defrag for members one at a time first before the leader as a built in approach in etcdctl
provided --cluster
flag is used.
For some of the other requirements we have been working out in this issue, like scheduling or perhaps some of the monitoring based checks I think those will need to be handled as either documentation or additional resources in etcd/contrib
for example a kubernetes cronjob or shell script example implementation.
@ahrtr, @serathius - Keen for maintainer input on this. If what I have suggested makes sense feel free to assign to me and I can work on it.
Apologies, removing my assignment for this as I am about to be traveling for several weeks and attending Kubecon so I likely won't have much capacity for a while. If anyone else has capacity they are welcome to pick it up.
I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.
https://github.com/etcd-io/etcd/issues/9222 looks related. Seems like reducing bbolt fragmentation will be a third option in addition to option 1 and option 2 mentioned here - https://github.com/etcd-io/etcd/issues/9222#issuecomment-363194672. Has this been discussed before and was there any conclusion on preferred design approach? Should contributors interested in solving this start from scratch or build upon prior guidance? @serathius @ptabor /cc @chaochn47 @cenkalti
To me, it makes more sense to fix this at BoltDB layer. By design (based on LMDB), the database should not require any maintenance operation. BoltDB has FillPercent
parameter to control page utilization when adding items to a page but no control when removing items from a page. Related: https://github.com/etcd-io/bbolt/issues/422
From K8s perspective most fragmentation we see comes from Events, OpenShift also suffers from Images (CRDs for container image builds) on build-heavy clusters.
On larger clusters we advise to shard those to another etcd instance within the cluster, but maybe we can offer some "ephemeral keys" that have some more relaxed storage and consistency guarantees? Or which use a different storage than bbolt, rocksdb/leveldb (or anything LSM based)...
I would recommend looking into reducing bbolt fragmentation so we can get rid of defrag all together, instead of adding another feature/subproject that increases maintenance cost.
@serathius this would be great. Having a cron-job which defragments the non-leaders first, then the leader is extra overhead. Especially since there is no official version of such a script, and people solve the same task again and again.
Let me know, if I can help somehow.
cc @ptabor who mentioned some ideas to limit bbolt fragmentation.
Note that I don't expect bbolt side change, at least in the near future, because we are still struggling to reproduce https://github.com/etcd-io/bbolt/issues/402 and https://github.com/etcd-io/bbolt/issues/446.
I think it makes sense to provide a official reference (just reference!) on how to perform defragmentation. The rough idea (on top of all the inputs in this thread, e.g. @tjungblu @chaochn47 etc.) is:
Please also see Compaction & Defragmentation
I might spend some time to provide a such script for reference.
@ahrtr "I might spend some time to provide a such script for reference."
An official script would realy help here. The topic is too hot to let everybody re-solve this on its own.
I've been writing my own script for this. I guess my biggest question is:
If I get all the members of my cluster and then loop through them in bash, and execute etcdctl --user root: --endpoints="$my_endpoint" defrag
will it wait for the defrag to finish before moving on to the next member?
execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish
It does not. Please take a look at the discussion in https://github.com/etcd-io/etcd/discussions/15664.
You could use curl -sL http://localhost:2379 | grep "etcd_disk_defrag_inflight"
to determine if defrag has been completed in your script.
execute etcdctl --user root: --endpoints="$my_endpoint" defrag will it wait for the defrag to finish
It does not.
It isn't correct. It waits for the defrag to finish before moving on to the next member.
FYI. I am implementing a tool to do defragmentation. Hopefully the first version can be ready next week.
It does wait but it timeouts after some duration. IIRC it's 30 seconds.
It does wait but it timeouts after some duration
Yes, it's another story. The default command timeout is 5s. It's recommended to set a bigger value (e.g. 1m. ) for defragmentation, because it may take a long time to defragment a large DB. I don't have the performance data for now on how much time it may need for different DB size.
The pattern I see is usually 10s per GB. @bradjones1320 you can set a larger timeout with --command-timeout=60s
flag.
FYI. https://github.com/ahrtr/etcd-defrag
Just as I mentioned in https://github.com/etcd-io/etcd/issues/15477#issuecomment-1506050843, the tool etcd-defrag,
@ahrtr
It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);
When saying "stop-the-world", are you only referring to the following check:
or there may be other reasons that might stop-the-world?
When etcdserver is processing the defragmentation, it can't serve any client requests, see https://github.com/etcd-io/etcd/blob/63c9fe1d000842ed8b50edf0b41d1101456fa1f7/server/storage/backend/backend.go#L456-L465
The main functionality of https://github.com/ahrtr/etcd-defrag is ready, the left work is to add more utilities (e.g. Dockerfile, manifest for K8s, etc.). Please anyone feel free to let me know if you have any suggestions or questions.
@ahrtr
It's recommended to defragment the leader last, because it might stop-the-world & cause transferring leadership multiple times, and cause additional performance impact (although usually it isn't a big deal);
Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.
FYI, I did a test on a 3 nodes cluster. When defraging, Etcd leader node healthy check failed but there was not leader election.
1. Feed 8GiB data to Etcd cluster.
2. Set up clients, continuously send Read/Write to all nodes,
3. Start defrag in leader.
4. Check cluster healthy
5. Check if there are leader election
% etcdctl endpoint status --cluster
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://127.0.0.1:22379 | 91bc3c398fb3c146 | 3.5.7 | 8.0 GB | true | 7 | 1394738 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
When defraging in leader, the leader is unhealthy, clients connecting to the leader are blocked or receive "too many requests" error.
status = StatusCode.UNKNOWN
details = "etcdserver: too many requests"
The defragment takes 6m17s
{
"msg": "finished defragmenting directory",
"current-db-size": "8.0 GB",
"took": "6m17.969308853s"
}
After defragment, leader becomes healthy and raft term is still 7 which means no leader transfer.
% etcdctl endpoint status --cluster -w table
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| http://127.0.0.1:2379 | 8211f1d0f64f3269 | 3.5.7 | 8.1 GB | false | 7 | 1808612 |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 | 3.5.7 | 8.0 GB | true | 7 | 1808612 |
| http://127.0.0.1:32379 | fd422379fda50e48 | 3.5.7 | 8.1 GB | false | 7 | 1808612 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
It turned out to be that the leader doesn't stop the world during processing defragmentation, because the apply workflow is executed async, https://github.com/etcd-io/etcd/blob/63c9fe1d000842ed8b50edf0b41d1101456fa1f7/server/etcdserver/server.go#L847
so defrag will not cause leadership transfer.
Confirmed that it doesn't cause leadership transfer no matter how long the leader is blocked on processing defragmentation. It should be an issue, and we should fix it. I have a pending PR https://github.com/etcd-io/etcd/pull/15440, let me think how to resolve them together.
Again, it's still recommended to run defragmentation on leader last, because leader has more responsibilities (e.g. send snapshot, etc.) than followers, once it's blocked for a long time, then all the responsibilities dedicated to leader are not working.
Please also read https://github.com/ahrtr/etcd-defrag
Since we already have https://github.com/ahrtr/etcd-defrag, can we close this ticket? @guettli
FYI. I might formally release etcd-defrag v0.1.0
in the following 1 ~ 2 weeks.
Could you share why defragmentation will cause leadership transfers? For my understanding, when the leader is processing the defragmentation, it blocks the system from reading and writing data. However, raft is not blocked, so defrag will not cause leadership transfer.
A defrag call will not cause leadership transfer, but the resulting IO+CPU load might cause this. Try again on a machine with a very slow disk or limited CPU. We've definitely seen this happening on loaded control planes.
closing this, since https://github.com/ahrtr/etcd-defrag exists.
It would still be awesome to get built-in / official support for this, yeah?
Or will https://github.com/ahrtr/etcd-defrag be the official
and supported
cluster defrag tool?
@TechDufus good question. If you know the answer, please write it here into this issue. Thank you.
Or will https://github.com/ahrtr/etcd-defrag be the official and supported cluster defrag tool?
What would you like to be added?
I would like to see an official solution how to defragment etcd.
AFAIK a one-line cron-job is not enough, since you should not defragment the current leader.
Related: https://github.com/etcd-io/etcd/discussions/14975
Maybe it is enough to add a simple example script to the docs.
Why is this needed?
defragmenting the leader can lead to performance degradation and should be avoided.
I don't think it makes sense that every company running etcd invents its own way to solve this.