etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.74k stars 9.76k forks source link

etcd memory raise rapidly and never decline after exec get --prefix/range #18087

Open lengbingbin opened 5 months ago

lengbingbin commented 5 months ago

Bug report criteria

What happened?

When I execute the etcdctl get --prefix command in the script to simulate the business execution of the range command to query a large number of KVs, etcd memory does not decrease after executing the execution to obtain the results. Every time etcdctl get --prefix is ​​executed, the memory will continue to grow and never decline,.

image

What did you expect to happen?

After execute the etcdctl get --prefix, the memory of etcd should decline。

How can we reproduce it (as minimally and precisely as possible)?

Execute etcdctl get --prefix command multiple times. And make ensure that the data is more than 20M

Anything else we need to know?

When I used pprof that comes with etcd, I found that most of the memory was consumed in the Unmarshl function. image

image

Etcd version (please run commands below)

```console $ etcd --version # paste output here $ etcdctl version # paste output here ``` tcd Version: 3.5.11 Git SHA: GitNotFound Go Version: go1.22.1 Go OS/Arch: linux/amd64

Etcd configuration (command line flags or environment variables)

# paste your configuration here

(the managekvs is the etcd)

export ETCD_NAME=managekvs-2 export ETCD_DATA_DIR=/opt/etcd/managekvs export ETCD_WAL_DIR=/opt/etcd/managekvs/member/wal export ETCD_AUTO_COMPACTION_MODE=periodic export ETCD_AUTO_COMPACTION_RETENTION=5m export ETCD_CONTAINER_MODE=true export ETCD_SNAPSHOT_COUNT=10000 export ETCD_HEARTBEAT_INTERVAL=200 export ETCD_ELECTION_TIMEOUT=2000 export ETCD_QUOTA_BACKEND_BYTES=0 export ETCD_MAX_SNAPSHOTS=10 export ETCD_MAX_WALS=10 export ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster export ETCD_STRICT_RECONFIG_CHECK=true export ETCD_ENABLE_V2=true export ETCD_ENABLE_PPROF=true export ETCD_PROXY=off export ETCD_FORCE_NEW_CLUSTER=false export ETCD_INITIAL_CLUSTER=managekvs-1=https://managekvs-1.managekvs.manager.svc.cluster.local:2480,managekvs-0=https://managekvs-0.managekvs.manager.svc.cluster.local:2480,managekvs-2=https://managekvs-2.managekvs.manager.svc.cluster.local:2480 export ETCD_INITIAL_CLUSTER_STATE=existing export ETCD_INITIAL_ADVERTISE_PEER_URLS=https://managekvs-2.managekvs.manager.svc.cluster.local:2480 export ETCD_LISTEN_PEER_URLS=https://[172.18.137.104]:2480 export ETCD_LISTEN_CLIENT_URLS=https://[172.18.137.104]:2379 export ETCD_ADVERTISE_CLIENT_URLS=https://managekvs-2.managekvs.manager.svc.cluster.local:2379 export ETCD_LOG_LEVEL=debug export ETCD_ENABLE_LOG_ROTATION=true export ETCD_LOGGER=zap export ETCD_LOG_OUTPUTS=/opt/log/textlog/logs/etcd-server.log export ETCD_LOG_ROTATION_CONFIG_JSON='{"maxsize":10,"maxage":0,"maxbackups":2,"localtime":true,"compress":true}' export CONNECT_LIMIT_SIZE=900 export ETCD_ENABLE_HUAWEI_CBB=false export ETCD_EXPERIMENTAL_WARNING_APPLY_DURATION=1000ms export ETCD_TRUSTED_CA_FILE=/opt/etcd/cert/trust.cer export ETCD_CERT_FILE=/opt/etcd/cert/server.cer export ETCD_KEY_FILE=/opt/etcd/cert/server_key_crypto.pem export ETCD_CLIENT_CERT_AUTH=true export ETCD_PEER_TRUSTED_CA_FILE=/opt/etcd/cert/trust.cer export ETCD_PEER_CERT_FILE=/opt/etcd/cert/server.cer export ETCD_PEER_KEY_FILE=/opt/etcd/cert/server_key_crypto.pem export ETCD_PEER_CLIENT_CERT_AUTH=true export ETCD_AUTO_TLS=false export ETCD_CIPHER_SUITES=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 export ETCDCTL_INSECURE_SKIP_TLS_VERIFY=false export ETCDCTL_CSE_VERIFY_PEER=

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

tjungblu commented 5 months ago

We had several customers of OpenShift run into this behavior, which is effectively caused by the Go GC. Checkout this blog post: https://tip.golang.org/doc/gc-guide

I would recommend you try to tune the GOGC and GOMEMLIMIT settings and observe the memory allocation.

vivekpatani commented 5 months ago

@tjungblu curious if you could share your findings. Also what version of GO were you running. Thanks.

lengbingbin commented 5 months ago

We had several customers of OpenShift run into this behavior, which is effectively caused by the Go GC. Checkout this blog post: https://tip.golang.org/doc/gc-guide

I would recommend you try to tune the GOGC and GOMEMLIMIT settings and observe the memory allocation.

@tjungblu Thank you for your answering. But When I set GOMEMLIMIT=800MIB or 1GiB and GOGC=80,after exec get --prefix to get about 500MB data, the memory still raise to 1.23GB and never decline

lengbingbin commented 5 months ago

go version is 1.22.1