etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.41k stars 9.73k forks source link

Reduce memory usage of etcd member catchup mechanism #17098

Open serathius opened 9 months ago

serathius commented 9 months ago

What would you like to be added?

All requests made to etcd are serialized into raft entry proto and persisted on disk WAL. That's good, but to allow slow/disconnected members to catchup etcd also stores last 10`000 entries in raft.InMemoryStorage, all loaded into memory. In some cases this can cause huge memory bloat of etcd. Imagine you have a sequence of large put requests (for example 1MB configmaps in Kubernetes). etcd will keep all 10GB in memory, doing nothing.

This can be reproduced by running ./bin/tools/benchmark put --total=1000 --val-size=1000000 and collecting inuse_space heap profile.

Selection_012

The mechanism is really dump and could benefit from following improvements:

Why is this needed?

Prevent etcd memory bloating and make memory usage more predictable.

serathius commented 9 months ago

cc @ahrtr

moficodes commented 9 months ago

I am interested in helping. But I am not sure I know exactly what needs to be done. Could I shadow someone or get some guidance if I were to attempt this? @serathius

serathius commented 9 months ago

There are 4 different changes proposed. Looking at the first one Compact raft.InMemoryStorage after every apply instead of once every snapshot.. High level what we would want to do is:

tangwz commented 8 months ago

/assign

tangwz commented 8 months ago

There are 4 different changes proposed. Looking at the first one Compact raft.InMemoryStorage after every apply instead of once every snapshot.. High level what we would want to do is:

@serathius this snapi is already etcdProgress.appliedi from triggerSnapshot(ep *etcdProgress)

https://github.com/etcd-io/etcd/blob/main/server/etcdserver/server.go#L1168-L1185

serathius commented 5 months ago

@tangwz Are you still planning to work on this?

yipal commented 5 months ago

Hi @serathius , I could give this a shot, but I would like to understand the proposed changes a little better. Are all 4 changes necessary to address the problem? It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower (see https://github.com/etcd-io/etcd/blob/557e7f09df7f71e586e64b031043d57246842138/server/etcdserver/server.go#L1191). The code appears to already compact the log right after snapshotting, as you pasted above.

It sounds like change 2 by itself would address the problem in the common case where followers are able to keep up.

Together, changes 3 and 4 sound like they would also address the problem in a different way. If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?

serathius commented 5 months ago

r. Are all 4 changes necessary to address the problem?

I don't have full context as some time passed since I created the issue, still I think we need all the changes, first one to make compaction more frequent, second to improve cases where all members are healthy, third is needed to pick the best that the memory/ time to recovery tradeoff, fourth to better handle cases where one member is fully down. Feel free to add your own suggestions or ask more questions. The best way to reach me is on K8s slack https://github.com/etcd-io/etcd?tab=readme-ov-file#contact

It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower.

The first case is not about having it lower, it's about making InMemoryStorage compaction independent from snapshots.

If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?

Yes

serathius commented 2 months ago

/assign @clement2026

k8s-ci-robot commented 2 months ago

@serathius: GitHub didn't allow me to assign the following users: clement2026.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/etcd-io/etcd/issues/17098#issuecomment-2208814210): >/assign @clement2026 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
clement2026 commented 3 weeks ago

/assign @clement2026

Hey @serathius, I'm still on it and will be for a bit. Could you reassign this to me. Thanks!

clement2026 commented 2 weeks ago

/assign

clement2026 commented 2 weeks ago

I can assign myself. Awesome😎

clement2026 commented 6 days ago

18382 raises the issue of high memory usage related to the etcd member catch-up mechanism. I’ve been working on it for a while and have some findings to share.

Experiments

I’ve run a few experiments to observe the heap size of an etcd instance. Below is a table I put together from my observations, showing how the heap size changes when benchmarking etcd.

putSize --snapshot-count --experimental-snapshot
-catchup-entries
heap size
v3.5.16
heap size
v3.6.0-alpha.0
1 KB 10000 5000 6 MB ~ 28 MB 13 MB ~ 31.7 MB
10 KB 10000 5000 64 MB ~ 180 MB
100 KB 10000 5000 569 MB ~ 1.5 GB 536 MB ~ 1.62 GB
1 MB 10000 5000 5 GB ~ 14.2 GB
--- --- --- --- ---
1 KB 100000 5000 15 MB ~ 143 MB 15 MB ~ 152 MB
10 KB 100000 5000 67 MB ~ 1.1GB
100 KB 100000 5000 900 MB ~ 10.6 GB 690 MB ~ 10.4 GB
--- --- --- --- ---
1 MB 500 500 550 MB ~ 1 GB

Both v3.5 and v3.6 use 5000 as the default value for --experimental-snapshot-catchup-entries; however, the default value for --snapshot-count is set much lower in v3.6 at 10,000, compared to 100,000 in v3.5.

v3.5.16 f20bbadd4

putSize 1 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --snapshot-count=10000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000 # Monitor heap size using live-pprof (https://github.com/moderato-app/live-pprof) live-pprof 2379 ``` 1K-10K-5000
putSize 10 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=10000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=10000 # Monitor heap size using live-pprof live-pprof 2379 ``` 10K-10K-5000
putSize 100 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=10000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000 # Monitor heap size using live-pprof live-pprof 2379 ``` 100K-10K-5000
putSize 1 MB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=10000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000000 # Monitor heap size using live-pprof live-pprof 2379 ``` 1000K-10K-5000
putSize 1 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --snapshot-count=100000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000 # Monitor heap size using live-pprof live-pprof 2379 ``` 1K-100K-5000
putSize 10 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=100000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=10000 # Monitor heap size using live-pprof live-pprof 2379 ``` 10K-100K-5000
putSize 100 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=100000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000 # Monitor heap size using live-pprof live-pprof 2379 ``` 100K-100K-5000
putSize 1 MB snapshot-count 500 experimental-snapshot-catchup-entries 500 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=500 \ --experimental-snapshot-catchup-entries=500 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000000 # Monitor heap size using live-pprof live-pprof 2379 ``` 1000K-500-500

v3.6.0-alpha.0 981061a49

putSize 1 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --snapshot-count=10000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000 # Monitor heap size using live-pprof live-pprof 2379 ``` 1K-10K-5000
putSize 100 KB snapshot-count 10,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=10000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000 # Monitor heap size using live-pprof live-pprof 2379 ``` 100K-10K-5000
putSize 1 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --snapshot-count=100000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=1000 # Monitor heap size using live-pprof live-pprof 2379 ``` 1K-100K-5000
putSize 100 KB snapshot-count 100,000 experimental-snapshot-catchup-entries 5000 ```bash # Run etcd rm -rf tmp.etcd; etcd --data-dir tmp.etcd \ --enable-pprof=true \ --auto-compaction-mode=periodic \ --auto-compaction-retention=5s \ --snapshot-count=100000 \ --experimental-snapshot-catchup-entries=5000 # Benchmark ./bin/tools/benchmark txn-mixed --total=99999999999 --val-size=100000 # Monitor heap size using live-pprof live-pprof 2379 ``` 100K-100K-5000

How to estimate the heap size of etcd

The etcd member catch-up mechanism maintains a list of entries to keep the leader and followers in sync. When etcd receives a put request, it appends the request data to the entries. These entries significantly impact etcd’s heap size.

As put requests keep appending to the entries, --snapshot-count and --experimental-snapshot-catchup-entries control when and how to shrink/compact the entries.

Once we know the average size of put requests (let’s call it putSize), we can estimate the heap size of these entries. It ranges from:

experimental-snapshot-catchup-entries * putSize

to:

(experimental-snapshot-catchup-entries + snapshot-count) * putSize

The heap size of these entries, plus some overhead, is roughly the heap size and RSS of etcd.

With this in mind, we can try to answer some questions.


Q1: Do I need to worry about the heap size of etcd?

If putSize is small, like 1KB, the heap size should be under 200 MB for v3.5 and under 50 MB for v3.6. With such low memory usage, there is no need to manually set --snapshot-count and --experimental-snapshot-catchup-entries. The default settings work fine.

If putSize is big, you can estimate the heap size of etcd according to the table and calculations we discussed earlier. You can also set custom values for --snapshot-count and --experimental-snapshot-catchup-entries to control the heap size.

Q2: Is it okay to set a really low value for --snapshot-count?

Setting a low value for --snapshot-count makes etcd create snapshots more often. This can cause CPU spikes and isn't ideal for latency-sensitive situations. Here’s an example of the spikes:

1000K-10K-5000

Q3: Is it okay to set a really low value for --experimental-snapshot-catchup-entries?

If --experimental-snapshot-catchup-entries is set too low, slow followers might need to use snapshots to catch up with the leader. This is less efficient and puts more pressure on the leader compared to just using the entries. This often occurs when the network connection between the leader and followers is bad.

However, it’s fine to set --experimental-snapshot-catchup-entries to as low as 1 if you only have a single instance of etcd.


The analysis above focuses solely on the heap size of etcd. It doesn’t include memory allocated through mmap (used by bbolt) and cgo. To determine the total physical memory requirements for etcd, memory allocated through mmap must also be taken into account.

ahrtr commented 6 days ago

Thank you @clement2026 for the analysis, which makes sense. Great work!

A couple of thoughts/points,

In the long run, we don't actually need the v2 snapshot since it only contains membership data. However, removing it would have a significant impact, so let's hold off until we've thoroughly discussed and understood the implications to ensure we're confident in the decision.

clement2026 commented 5 days ago

@ahrtr Thanks for sharing your thoughts, it’s really helpful!

// Status contains information about this Raft peer and its view of the system.
// The Progress is only populated on the leader.
type Status struct {
    BasicStatus
    Config   tracker.Config
    Progress map[uint64]tracker.Progress
}

I checked out Node.Status() and noticed that Progress is only there for the leader. For followers, we can also compact all entries prior to the appliedIndex. It might lead to issues if a follower becomes the leader, but the existing tests should reveal any risks.

  • For --snapshot-count, It's 10K in 3.6 and 100K in 3.5. Probably we can change it to 10K as well in 3.5 as well. It's open to any discussion.

I can start a PR to identify any risks and discuss it further.

ahrtr commented 2 days ago

https://github.com/etcd-io/etcd/pull/18588#issuecomment-2352831211

clement2026 commented 8 hours ago

I ran some benchmarks for PR #18589, which changes DefaultSnapshotCount from 100,000 to 10,000 in etcd v3.5, and they show higher throughput.

The results should be reliable, as I ran the benchmark twice on the release-3.5 branch and rebooted the OS between each run.

release-3.5 vs PR ![release-3 5 vs backport](https://github.com/user-attachments/assets/5f27261c-7c4d-44f0-a57b-05949adb51d3)
release-3.5 vs release-3.5 ![release-3 5 vs release-3 5](https://github.com/user-attachments/assets/6cc0a368-2d48-4653-a33d-ce5cc32d4226)

etcd-benchmark-20240917-07-58-13.zip

What causes the throughput increase?

I analyzed the pprof profile data. It appears that mvcc.(*keyIndex).get is the main factor. I’m still trying to understand how this relates to DefaultSnapshotCount.

release-3.5(as base) vs PR ```bash go tool pprof -http=: -diff_base release-3.5.pb.gz pr.pb.gz ``` ![release-3 5 vs release-3 5](https://github.com/user-attachments/assets/20b4efa1-8b15-4163-ae9b-3dfe72882e09)
release-3.5 vs release-3.5 ```bash go tool pprof -http=: -diff_base release-3.5.pb.gz release-3.5-again.pb.gz ``` ![release-3 5 vs backport](https://github.com/user-attachments/assets/3cd89365-38b7-4c49-979e-bcca632573f0)

pprof profile data and benchmark script.zip

pprof profiling was run several times with different VALUE_SIZE and CONN_CLI_COUNT settings, and the results were consistent.


Based on the benchmarks from #18589 and #18459, we can see that smaller raft log entries lead to lower heap usage and higher throughput. I'm sharing the benchmark results here, hoping it boosts our confidence and motivation to keep pushing forward.