Open vitalif opened 1 month ago
Looks like https://github.com/etcd-io/etcd/issues/17098 for me, but to confirm. Can you provide:
Could you also attach the generated pprof.etcd.alloc_objects.alloc_space.inuse_objects.inuse_space.*.pb.gz
file?
The database size is < 2 MB (
etcdctl get --write-out=json --prefix /
gives me 1966139 byte dump). The database takes 1.5 GB on disk (/var/lib/etcd0.etcd
).
Please also perform a compaction + defragmentation, and retry, please let's know the result. thx
@ahrtr Topic starter concerns on RAM consumption, not about disk usage. If they are connected somehow, yes it might help, but disk usage itself is not the issue. Note, RAM usage is about 10 times bigger then the disk usage.
If they are connected somehow
They are related, bbolt mmaps the file into memory directly.
Here's the profile pprof.etcd.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
Percentage of PUT requests I think was close to 100% )). There were puts + 6 watchers, that's all. RPC rate graph shows 0.85 rps and 2.5 MByte/s traffic.
Memory usage actually dropped the next day along with update traffic when I restarted my software and it stopped updating... something. I think it was old statistics but I'm not sure. 2.5 MB/s is quite a lot given that total amount of data in etcd is 2 MB :-). My software updates statistics every 5 seconds in that installation, so 2.5 MB/s is 12.5 MB every 5 seconds... Still more than 2 MB. So I'm not sure what it was. Maybe these were some unapplied compare
failures and retries... I'll monitor this instance for more time.
Here's the screenshot from grafana:
@ahrtr okay, but
I'm not an expert in Go. Is there any memory mapper ? I mean a tool that shows what memory block is used for.
@socketpair I get that you’re concerned about memory size. You might want to try setting the --experimental-snapshot-catchup-entries
flag. The default value is 5000, but you can lower it to, say, 500, which should bring your memory usage down to around 3GB.
The downside of setting a low value for --experimental-snapshot-catchup-entries
is that slow followers might need to catch up with the leader using a snapshot, which is less efficient. But if you’re running a single instance of etcd, you can even set it to 1 and can keep your memory usage under 2GB.
Based on your posts, it seems the average size of a put request is around 2MB.
Your etcd instance’s heap size is roughly between (--experimental-snapshot-catchup-entries) * 2MB
and (--experimental-snapshot-catchup-entries + --snapshot-count) * 2MB
. In your case, that’s about 10GB to 12GB.
I’m sharing some tests here for your reference.
--experimental-snapshot-catchup-entries 5000
Setup:
rm -rf etcd0.etcd; etcd -name etcd0 \
--data-dir etcd0.etcd \
--enable-pprof=true \
--snapshot-count=1000 \
--advertise-client-urls http://localhost:2379 \
--listen-client-urls http://localhost:2379 \
--initial-advertise-peer-urls http://localhost:2380 \
--listen-peer-urls http://localhost:2380 \
--initial-cluster-token vitastor-etcd-1 \
--initial-cluster etcd0=http://localhost:2380 \
--initial-cluster-state new \
--max-txn-ops=100000 \
--max-request-bytes=104857600 \
--auto-compaction-mode=periodic \
--auto-compaction-retention=5s \
--experimental-snapshot-catchup-entries 5000
Benchmark script:
./bin/tools/benchmark txn-mixed --total=50000000 --val-size=2000000 --key-space-size=1 --rw-ratio=1
The setup and benchmark script is the same except --experimental-snapshot-catchup-entries 500
.
--experimental-snapshot-catchup-entries 1
Hope it helps.
Awesome analysis @clement2026, makes me very excited about your work in https://github.com/etcd-io/etcd/issues/17098. Would be awesome to paste those findings there too.
I’m glad you found the analysis helpful. I’ll include them in #17098.
Bug report criteria
What happened?
My etcd 3.5.8 ate 12 GB of RAM... The database size is < 2 MB (
etcdctl get --write-out=json --prefix /
gives me 1966139 byte dump). The database takes 1.5 GB on disk (/var/lib/etcd0.etcd
). There are 6 active streams. Client traffic shows 2.2 MB/s in and 2.5 MB/s out in monitoring dashboard. There are 128+5 keys which are bumped continuosly through leases. There are other 139 keys which are updated every 30 seconds.As I've already experienced such memory usage with this etcd, it was started with --enable-pprof=true, pprof shows the following:
Is it a memory leak or what?...
What did you expect to happen?
I expected etcd to consume at most 1 GB of memory with 2 MB database
How can we reproduce it (as minimally and precisely as possible)?
Install Vitastor on 1 node. :-) but I'm not sure it will be reproduced easily.
Anything else we need to know?
It's still running, I can collect other information or connect with debugger if you tell me what I should collect. :-)
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response