etcd memory usage spikes to unsustainable levels and OOMs

Uzay-G commented 2 years ago

What happened?

Whenever I run my etcd container, its memory usage slowly goes up to meet the docker limit on my machine (~7.6 GB) and then it OOMs. I've investigated and believe this is because it has created a large amount of wal files, above the limit (5), and when it has to process these it breaks down.

I've attached etcd docker logs.

etcdlogs.txt

What did you expect to happen?

etcd would run normally.

How can we reproduce it (as minimally and precisely as possible)?

I had inserted lots of data and now when I launch etcd it seems it can simply not handle the saved .wal files.

Anything else we need to know?

ls -has on the etcd wal directory:

total 4.1G
4.0K .                       62M 000000000000007b-0000000000050618.wal   62M 0000000000000092-0000000000056f14.wal
4.0K ..                      62M 000000000000007c-0000000000050b41.wal   62M 0000000000000093-00000000000573a7.wal
 62M 0.tmp                   62M 000000000000007d-0000000000051024.wal   62M 0000000000000094-0000000000057988.wal
 62M 0000000000000067-0000000000049181.wal   62M 000000000000007e-000000000005160d.wal   62M 0000000000000095-0000000000057d93.wal
 62M 0000000000000068-00000000000497f2.wal   62M 000000000000007f-0000000000051baf.wal   62M 0000000000000096-0000000000058198.wal
 62M 0000000000000069-0000000000049deb.wal   62M 0000000000000080-00000000000520fa.wal   62M 0000000000000097-00000000000585b8.wal
 62M 000000000000006a-000000000004a37d.wal   62M 0000000000000081-0000000000052873.wal   62M 0000000000000098-00000000000589c5.wal
 62M 000000000000006b-000000000004aa3e.wal   62M 0000000000000082-0000000000052e7d.wal   62M 0000000000000099-0000000000058d89.wal
 62M 000000000000006c-000000000004af7c.wal   62M 0000000000000083-000000000005333f.wal   62M 000000000000009a-000000000005913b.wal
 62M 000000000000006d-000000000004b58c.wal   62M 0000000000000084-000000000005380b.wal   62M 000000000000009b-000000000005948f.wal
 62M 000000000000006e-000000000004bae9.wal   62M 0000000000000085-0000000000053c20.wal   62M 000000000000009c-0000000000059815.wal
 62M 000000000000006f-000000000004c0d3.wal   62M 0000000000000086-0000000000054004.wal   62M 000000000000009d-0000000000059c02.wal
 62M 0000000000000070-000000000004c6af.wal   62M 0000000000000087-00000000000543ef.wal   62M 000000000000009e-000000000005a010.wal
 62M 0000000000000071-000000000004ccb6.wal   62M 0000000000000088-00000000000547fe.wal   62M 000000000000009f-000000000005a3c1.wal
 62M 0000000000000072-000000000004d2e7.wal   62M 0000000000000089-0000000000054b58.wal   62M 00000000000000a0-000000000005a7a5.wal
 62M 0000000000000073-000000000004d82a.wal   62M 000000000000008a-0000000000054f52.wal   62M 00000000000000a1-000000000005ab92.wal
 62M 0000000000000074-000000000004de82.wal   62M 000000000000008b-00000000000552a1.wal   62M 00000000000000a2-000000000005b000.wal
 62M 0000000000000075-000000000004e429.wal   62M 000000000000008c-00000000000555ed.wal   62M 00000000000000a3-000000000005b37c.wal
 62M 0000000000000076-000000000004e9d9.wal   62M 000000000000008d-00000000000559a6.wal   62M 00000000000000a4-000000000005b744.wal
 62M 0000000000000077-000000000004ef8e.wal   62M 000000000000008e-0000000000055d8f.wal   62M 00000000000000a5-000000000005bb88.wal
 62M 0000000000000078-000000000004f5db.wal   62M 000000000000008f-0000000000056252.wal   62M 00000000000000a6-000000000005bf4b.wal
 62M 0000000000000079-000000000004fb26.wal   62M 0000000000000090-000000000005666c.wal   62M 00000000000000a7-000000000005c32d.wal
 62M 000000000000007a-00000000000500bc.wal   62M 0000000000000091-0000000000056a5b.wal   62M 00000000000000a8-000000000005c6d5.wal

Etcd version (please run commands below)

etcd v3.5.0 from docker

Etcd configuration (command line flags or environment variables)

      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_ENABLE_PPROF=true
      - ETCD_SNAPSHOT_COUNT=50000
      - ETCD_MAX_SNAPSHOTS=2
      - ETCD_MAX_WALS=5

    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

I cannot run these commands because etcd hangs and crashes.

Relevant log output

No response

ahrtr commented 2 years ago

The OOM is a known issue, the workaroud AFAIK are:

perform compaction & defragmentation to reclaim the db file space;
avoid big range operation on the client side.

With respect to the WAL files, it looks like an issue because the max WAL file is 5, but there are lots of WAL files. It means that the old WAL files failed to be purged. I checked the log file you attached, but did not see anything useful, the reason should be you attachd isn't the complete log. Please try to reproduce the issue and attached the complete log if possible.

Uzay-G commented 2 years ago

Can I just delete the WAL files?

ahrtr commented 2 years ago

Can I just delete the WAL files?

It isn't recommended to manually delete the WAL files, otherwise the WAL files may not be matching the snap files. Please try to reproduce the issue and provide complete log. If you are interested, please try to figure out why etcd failed to purge the WAL files automatically.

Uzay-G commented 2 years ago

When I try getting debug info:

halcyon@espial:~/milvus$ sudo docker exec milvus-etcd etcdctl member list -w table
{"level":"warn","ts":"2022-08-27T05:34:10.842Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002f2000/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

When I try to get the logs:

{"level":"info","ts":1661578407.7117872,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_MODE","variable-value":"revision"}
{"level":"info","ts":1661578407.722684,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"1000"}
{"level":"info","ts":1661578407.7227707,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ENABLE_PPROF","variable-value":"true"}
{"level":"info","ts":1661578407.7228498,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_SNAPSHOTS","variable-value":"2"}
{"level":"info","ts":1661578407.7228956,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_MAX_WALS","variable-value":"5"}
{"level":"info","ts":1661578407.7229593,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"4294967296"}
{"level":"info","ts":1661578407.7230117,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"50000"}
{"level":"info","ts":"2022-08-27T05:33:27.723Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","-advertise-client-urls=http://127.0.0.1:2379","-listen-client-urls","http://0.0.0.0:2379","--data-dir","/etcd"]}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/etcd","dir-type":"member"}
{"level":"info","ts":"2022-08-27T05:33:27.727Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2022-08-27T05:33:27.738Z","caller":"embed/etcd.go:598","msg":"pprof is enabled","path":"/debug/pprof"}
{"level":"info","ts":"2022-08-27T05:33:27.740Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"946a5a6f2","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"default","data-dir":"/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":50000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":4294967296,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"warn","ts":1661578407.740217,"caller":"fileutil/fileutil.go:57","msg":"check file permission","error":"directory \"/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"info","ts":"2022-08-27T05:33:30.958Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/etcd/member/snap/db","took":"3.207592575s"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":300003,"snapshot-size":"7.1 kB"}
{"level":"info","ts":"2022-08-27T05:34:47.258Z","caller":"etcdserver/server.go:518","msg":"recovered v3 backend from snapshot","backend-size-bytes":123691008,"backend-size":"124 MB","backend-size-in-use-bytes":68882432,"backend-size-in-use":"69 MB"}

I'm trying to get more info but etcd hogs memory and I can't even use my system properly. Does this help at all? Is there any way I can just reset it so it keeps existing data and ignores the problematic stuff?

ahrtr commented 2 years ago

The huge memory usage might be caused by the db file size.

What's the size of the db file, which locates in ${DATA_DIR}/member/snap/db? Please try to perform compaction + defragmentation per guide https://etcd.io/docs/v3.5/op-guide/maintenance/#space-quota

ahrtr commented 2 years ago

I just added more log for debugging the reason why etcd fails to purge WAL file.

https://github.com/etcd-io/etcd/pull/14576

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

sxllwx commented 2 months ago

/sub

No5stranger commented 2 months ago

/mark

etcd-io / etcd