etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.32k stars 9.71k forks source link

panic due to race between wal rotation and snapshot #14252

Open jasondobryk opened 2 years ago

jasondobryk commented 2 years ago

Hi, there is a problem on several etc clusters on windows server at the customer. process есв.ехе managed by the windows service. It looks like this: the cluster is working, but if you restart the etc service on one of the servers, we will get an error:

{"level":"fatal","ts":"2022-07-21T13:17:36.221+0300","caller":"etcdserver/storage.go:95","msg":"failed to open WAL","error":"wal: file not found","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.readWAL\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/storage.go:95\ngo.etcd.io/etcd/server/v3/etcdserver.restartNode\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/raft.go:481\ngo.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:533\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

What did you expect to happen? I expect a normal restart of the etcd process

How can we reproduce it? so far we have not been able to repeat the error on our test servers

Etcd version etcd.exe --version etcd Version: 3.5.4

etcdctl.exe version etcdctl version: 3.5.4

Etcd configuration of one of the servers: name: ck11-apache2.test.local initial-advertise-peer-urls: http://10.X.X.178:2380 listen-peer-urls: http://10.X.X.178:2380 listen-client-urls: http://10.X.X.178:2379,http://127.0.0.1:2379 advertise-client-urls: http://10.X.X.178:2379 initial-cluster-token: etcd-cluster-123456789 initial-cluster: ck11-apache1.test.local=http://10.X.X.177:2380,ck11-apache2.test.local=http://10.X.X.178:2380,ck11-apache-nlb.test.local=http://10.X.X.179:2381 initial-cluster-state: 'new' max-txn-ops: 12800 heartbeat-interval: 250 election-timeout: 1250 logger: zap log-level: info

Etcd debug information

C:\Program Files\etcd>etcdctl -w table endpoint --cluster status +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | http://10.X.X.179:2379 | 90f53ed17cd0b92f | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | | | http://10.X.X.178:2379 | 97a661acdde4fc12 | 3.5.4 | 1.7 GB | true | false | 2 | 2048771 | 2048771 | | | http://10.X.X.177:2379 | a82db5fe4c233dac | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

I really need help with this problem. It is still unclear how to reproduce it, what to focus on

serathius commented 2 years ago

I would start from "error": "wal: file not found". Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.

jasondobryk commented 2 years ago

I would start from "error": "wal: file not found". Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.

According to the logs, it is unclear what specific file is missing, there are files with wal in the folder. Perhaps some specific one is missing or damaged, is it possible to understand it somehow?

jasondobryk commented 2 years ago

I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.

jasondobryk commented 2 years ago

It would be nice to solve this problem in the next version

serathius commented 2 years ago

I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.

This seems like a bug, can you provide a minimal reproduction steps?

MIVerTFT commented 2 years ago

Yes, a very strange situation. Reproduced like this. There is one node etcd 3.5.4 running on Windows 10 config file below.

name: node1
listen-peer-urls: http://localhost:2002
listen-client-urls: http://localhost:2001
initial-advertise-peer-urls: http://localhost:2002
advertise-client-urls: http://localhost:2001
initial-cluster-token: cluster_1'
initial-cluster: node1=http://localhost:2002
initial-cluster-state: 'new'
heartbeat-interval: 250
election-timeout: 1250
max-txn-ops: 12800
auto-compaction-mode: periodic
auto-compaction-retention: 1m
max-wals: 5
snapshot-count: 10000

I run a task that continuously writes records of 200,000 bytes each. As a result, there is a fast rotation of the wal files and after 3000-4000 transaction, the node cannot restart due to the error that the wal files are deleted and there is no snapshot.

On Windows Server 2019, the situation is a little different. wal files are not immediately deleted, they continue to accumulate until the snapshot is created, but after that the node still does not restart due to the same error.

On debian, i couldn't reproduce this situation.

serathius commented 2 years ago

Windows is currently in Tier 3 support. https://etcd.io/docs/v3.5/op-guide/supported-platform/#current-support Meaning we just build windows binary, but we never tested if it works.

Feel free to look into the issue, however it's not unexpected that there might be problems.

ahrtr commented 2 years ago

Things to check:

  1. etcd has permission to access the WAL directory and files.
  2. Whether there are any WAL files?
MIVerTFT commented 2 years ago
  1. Yes, I specifically run etcd with administrator rights
  2. Yes, there are other wall files in the data folder
MIVerTFT commented 2 years ago

I wrote a very simple test.

  1. Create file.
  2. Put a lock on file.
  3. Delete file.
  4. Close process.

On windows 10, the file is deleted after step 3, but on windows server 2019 only after step 4, while there are no errors at step 3. This explains the difference in behavior about which I wrote above. On windows server Etcd sends a command to delete files, but they remain until the lock is remove, or until the process is killed.

ahrtr commented 2 years ago

The file being deleting after step 4 is expected behavior on Linux as well. I suspect the issue is caused by TryLockFile. It actually failed to lock the file, but the lockFile returns nil, accodingly the purge goroutine regards it as a successful lock, then it deletes the file.

Could you double check it?

MIVerTFT commented 2 years ago

Yes, that's right, LockFileEx cannot lock the file, but the lockFile method returns nil due to an incorrect condition, well, as it is written here.

xrl commented 2 years ago

I am seeing this issue as well, 3.5.4 on linux/kubernetes using the official images. We are shipping our etcd using the bitnami helm chart.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.