Open serathius opened 9 months ago
cc @ahrtr
I am interested in helping. But I am not sure I know exactly what needs to be done. Could I shadow someone or get some guidance if I were to attempt this? @serathius
There are 4 different changes proposed. Looking at the first one Compact raft.InMemoryStorage after every apply instead of once every snapshot.
. High level what we would want to do is:
compactRaftLog
etcdProgress.appliedi
instead of snapi
to decide compacti
comactRaftLog
to be done only once per 100 or 1000 entries to avoid copying the raft log to frequently/assign
There are 4 different changes proposed. Looking at the first one
Compact raft.InMemoryStorage after every apply instead of once every snapshot.
. High level what we would want to do is:
- Move the RaftLog compaction logic https://github.com/etcd-io/etcd/blob/dfdffe48f9e7c622f1b863d013754c2a824f6d35/server/etcdserver/server.go#L2097-L2125 to separate function
compactRaftLog
- Use
etcdProgress.appliedi
instead ofsnapi
to decidecompacti
- Call the function from applyAll https://github.com/etcd-io/etcd/blob/dfdffe48f9e7c622f1b863d013754c2a824f6d35/server/etcdserver/server.go#L900-L920 after triggerSnapshot
- Benchmark the result to confirm a performance result
- Possibly limit the calls to
comactRaftLog
to be done only once per 100 or 1000 entries to avoid copying the raft log to frequently
@serathius this snapi
is already etcdProgress.appliedi
from triggerSnapshot(ep *etcdProgress)
https://github.com/etcd-io/etcd/blob/main/server/etcdserver/server.go#L1168-L1185
@tangwz Are you still planning to work on this?
Hi @serathius , I could give this a shot, but I would like to understand the proposed changes a little better. Are all 4 changes necessary to address the problem? It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower (see https://github.com/etcd-io/etcd/blob/557e7f09df7f71e586e64b031043d57246842138/server/etcdserver/server.go#L1191). The code appears to already compact the log right after snapshotting, as you pasted above.
It sounds like change 2 by itself would address the problem in the common case where followers are able to keep up.
Together, changes 3 and 4 sound like they would also address the problem in a different way. If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?
r. Are all 4 changes necessary to address the problem?
I don't have full context as some time passed since I created the issue, still I think we need all the changes, first one to make compaction more frequent, second to improve cases where all members are healthy, third is needed to pick the best that the memory/ time to recovery tradeoff, fourth to better handle cases where one member is fully down. Feel free to add your own suggestions or ask more questions. The best way to reach me is on K8s slack https://github.com/etcd-io/etcd?tab=readme-ov-file#contact
It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower.
The first case is not about having it lower, it's about making InMemoryStorage compaction independent from snapshots.
If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?
Yes
/assign @clement2026
@serathius: GitHub didn't allow me to assign the following users: clement2026.
Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
/assign @clement2026
Hey @serathius, I'm still on it and will be for a bit. Could you reassign this to me. Thanks!
/assign
I can assign myself. Awesome😎
I’ve run a few experiments to observe the heap size of an etcd instance. Below is a table I put together from my observations, showing how the heap size changes when benchmarking etcd.
putSize
: average size of put
requestsputSize | --snapshot-count | --experimental-snapshot -catchup-entries |
heap size v3.5.16 |
heap size v3.6.0-alpha.0 |
---|---|---|---|---|
1 KB | 10000 | 5000 | 6 MB ~ 28 MB | 13 MB ~ 31.7 MB |
10 KB | 10000 | 5000 | 64 MB ~ 180 MB | |
100 KB | 10000 | 5000 | 569 MB ~ 1.5 GB | 536 MB ~ 1.62 GB |
1 MB | 10000 | 5000 | 5 GB ~ 14.2 GB | |
--- | --- | --- | --- | --- |
1 KB | 100000 | 5000 | 15 MB ~ 143 MB | 15 MB ~ 152 MB |
10 KB | 100000 | 5000 | 67 MB ~ 1.1GB | |
100 KB | 100000 | 5000 | 900 MB ~ 10.6 GB | 690 MB ~ 10.4 GB |
--- | --- | --- | --- | --- |
1 MB | 500 | 500 | 550 MB ~ 1 GB |
Both v3.5 and v3.6 use 5000 as the default value for
--experimental-snapshot-catchup-entries
; however, the default value for--snapshot-count
is set much lower in v3.6 at 10,000, compared to 100,000 in v3.5.
The etcd member catch-up mechanism maintains a list of entries to keep the leader and followers in sync. When etcd receives a put request, it appends the request data to the entries. These entries significantly impact etcd’s heap size.
As put
requests keep appending to the entries, --snapshot-count
and --experimental-snapshot-catchup-entries
control when and how to shrink/compact the entries.
Once we know the average size of put
requests (let’s call it putSize
), we can estimate the heap size of these entries. It ranges from:
experimental-snapshot-catchup-entries * putSize
to:
(experimental-snapshot-catchup-entries + snapshot-count) * putSize
The heap size of these entries, plus some overhead, is roughly the heap size and RSS of etcd.
With this in mind, we can try to answer some questions.
If putSize
is small, like 1KB, the heap size should be under 200 MB for v3.5 and under 50 MB for v3.6. With such low memory usage, there is no need to manually set --snapshot-count
and --experimental-snapshot-catchup-entries
. The default settings work fine.
If putSize
is big, you can estimate the heap size of etcd according to the table and calculations we discussed earlier. You can also set custom values for --snapshot-count
and --experimental-snapshot-catchup-entries
to control the heap size.
--snapshot-count
?Setting a low value for --snapshot-count
makes etcd create snapshots more often. This can cause CPU spikes and isn't ideal for latency-sensitive situations.
Here’s an example of the spikes:
--experimental-snapshot-catchup-entries
?If --experimental-snapshot-catchup-entries
is set too low, slow followers might need to use snapshots to catch up with the leader. This is less efficient and puts more pressure on the leader compared to just using the entries. This often occurs when the network connection between the leader and followers is bad.
However, it’s fine to set --experimental-snapshot-catchup-entries
to as low as 1
if you only have a single instance of etcd.
The analysis above focuses solely on the heap size of etcd. It doesn’t include memory allocated through mmap
(used by bbolt) and cgo
. To determine the total physical memory requirements for etcd, memory allocated through mmap
must also be taken into account.
Thank you @clement2026 for the analysis, which makes sense. Great work!
A couple of thoughts/points,
--experimental-snapshot-catchup-entries
at all. We can just compact all entries prior to the appliedIndex.min('--experimental-snapshot-catchup-entries', smallest_member_appliedIndex)
.
--snapshot-count
, It's 10K in 3.6 and 100K in 3.5. Probably we can change it to 10K as well in 3.5 as well. It's open to any discussion.In the long run, we don't actually need the v2 snapshot since it only contains membership data. However, removing it would have a significant impact, so let's hold off until we've thoroughly discussed and understood the implications to ensure we're confident in the decision.
@ahrtr Thanks for sharing your thoughts, it’s really helpful!
- You can get the member's progress using Node.Status()
// Status contains information about this Raft peer and its view of the system.
// The Progress is only populated on the leader.
type Status struct {
BasicStatus
Config tracker.Config
Progress map[uint64]tracker.Progress
}
I checked out Node.Status()
and noticed that Progress
is only there for the leader. For followers, we can also compact all entries prior to the appliedIndex. It might lead to issues if a follower becomes the leader, but the existing tests should reveal any risks.
- For
--snapshot-count
, It's 10K in 3.6 and 100K in 3.5. Probably we can change it to 10K as well in 3.5 as well. It's open to any discussion.
I can start a PR to identify any risks and discuss it further.
I ran some benchmarks for PR #18589, which changes DefaultSnapshotCount
from 100,000 to 10,000 in etcd v3.5, and they show higher throughput.
The results should be reliable, as I ran the benchmark twice on the release-3.5
branch and rebooted the OS between each run.
etcd-benchmark-20240917-07-58-13.zip
I analyzed the pprof profile data. It appears that mvcc.(*keyIndex).get
is the main factor. I’m still trying to understand how this relates to DefaultSnapshotCount
.
pprof profile data and benchmark script.zip
pprof profiling was run several times with different VALUE_SIZE
and CONN_CLI_COUNT
settings, and the results were consistent.
Based on the benchmarks from #18589 and #18459, we can see that smaller raft log entries lead to lower heap usage and higher throughput. I'm sharing the benchmark results here, hoping it boosts our confidence and motivation to keep pushing forward.
What would you like to be added?
All requests made to etcd are serialized into raft entry proto and persisted on disk WAL. That's good, but to allow slow/disconnected members to catchup etcd also stores last 10`000 entries in raft.InMemoryStorage, all loaded into memory. In some cases this can cause huge memory bloat of etcd. Imagine you have a sequence of large put requests (for example 1MB configmaps in Kubernetes). etcd will keep all 10GB in memory, doing nothing.
This can be reproduced by running
./bin/tools/benchmark put --total=1000 --val-size=1000000
and collecting inuse_space heap profile.The mechanism is really dump and could benefit from following improvements:
Why is this needed?
Prevent etcd memory bloating and make memory usage more predictable.