Open Uzay-G opened 2 years ago
The OOM is a known issue, the workaroud AFAIK are:
With respect to the WAL files, it looks like an issue because the max WAL file is 5, but there are lots of WAL files. It means that the old WAL files failed to be purged. I checked the log file you attached, but did not see anything useful, the reason should be you attachd isn't the complete log. Please try to reproduce the issue and attached the complete log if possible.
Can I just delete the WAL files?
Can I just delete the WAL files?
It isn't recommended to manually delete the WAL files, otherwise the WAL files may not be matching the snap files. Please try to reproduce the issue and provide complete log. If you are interested, please try to figure out why etcd failed to purge the WAL files automatically.
When I try getting debug info:
halcyon@espial:~/milvus$ sudo docker exec milvus-etcd etcdctl member list -w table
{"level":"warn","ts":"2022-08-27T05:34:10.842Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002f2000/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
When I try to get the logs:
{"level":"info","ts":1661578407.7117872,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_MODE","variable-value":"revision"}
{"level":"info","ts":1661578407.722684,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"1000"}
{"level":"info","ts":1661578407.7227707,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ENABLE_PPROF","variable-value":"true"}
{"level":"info","ts":1661578407.7228498,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_SNAPSHOTS","variable-value":"2"}
{"level":"info","ts":1661578407.7228956,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_WALS","variable-value":"5"}
{"level":"info","ts":1661578407.7229593,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"4294967296"}
{"level":"info","ts":1661578407.7230117,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"50000"}
{"level":"info","ts":"2022-08-27T05:33:27.723Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","-advertise-client-urls=http://127.0.0.1:2379","-listen-client-urls","http://0.0.0.0:2379","--data-dir","/etcd"]}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/etcd","dir-type":"member"}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:598","msg":"pprof is enabled","path":"/debug/pprof"}
{"level":"info","ts":"2022-08-27T05:33:27.740Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"946a5a6f2","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"default","data-dir":"/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":50000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":4294967296,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"warn","ts":1661578407.740217,"caller":"fileutil/fileutil.go:57","msg":"check file permission","error":"directory \"/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"info","ts":"2022-08-27T05:33:30.958Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/etcd/member/snap/db","took":"3.207592575s"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":300003,"snapshot-size":"7.1 kB"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:518","msg":"recovered v3 backend from snapshot","backend-size-bytes":123691008,"backend-size":"124 MB","backend-size-in-use-bytes":68882432,"backend-size-in-use":"69 MB"}
I'm trying to get more info but etcd hogs memory and I can't even use my system properly. Does this help at all? Is there any way I can just reset it so it keeps existing data and ignores the problematic stuff?
The huge memory usage might be caused by the db file size.
What's the size of the db file, which locates in ${DATA_DIR}/member/snap/db
? Please try to perform compaction + defragmentation per guide https://etcd.io/docs/v3.5/op-guide/maintenance/#space-quota
I just added more log for debugging the reason why etcd fails to purge WAL file.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
/sub
/mark
What happened?
Whenever I run my etcd container, its memory usage slowly goes up to meet the docker limit on my machine (~7.6 GB) and then it OOMs. I've investigated and believe this is because it has created a large amount of wal files, above the limit (5), and when it has to process these it breaks down.
I've attached etcd docker logs.
etcdlogs.txt
What did you expect to happen?
etcd would run normally.
How can we reproduce it (as minimally and precisely as possible)?
I had inserted lots of data and now when I launch etcd it seems it can simply not handle the saved .wal files.
Anything else we need to know?
ls -has on the etcd wal directory:
Etcd version (please run commands below)
etcd v3.5.0 from docker
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
I cannot run these commands because etcd hangs and crashes.
Relevant log output
No response