Open jasondobryk opened 2 years ago
I would start from "error": "wal: file not found"
. Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.
I would start from
"error": "wal: file not found"
. Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.
According to the logs, it is unclear what specific file is missing, there are files with wal in the folder. Perhaps some specific one is missing or damaged, is it possible to understand it somehow?
I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.
It would be nice to solve this problem in the next version
I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.
This seems like a bug, can you provide a minimal reproduction steps?
Yes, a very strange situation. Reproduced like this. There is one node etcd 3.5.4 running on Windows 10 config file below.
name: node1
listen-peer-urls: http://localhost:2002
listen-client-urls: http://localhost:2001
initial-advertise-peer-urls: http://localhost:2002
advertise-client-urls: http://localhost:2001
initial-cluster-token: cluster_1'
initial-cluster: node1=http://localhost:2002
initial-cluster-state: 'new'
heartbeat-interval: 250
election-timeout: 1250
max-txn-ops: 12800
auto-compaction-mode: periodic
auto-compaction-retention: 1m
max-wals: 5
snapshot-count: 10000
I run a task that continuously writes records of 200,000 bytes each. As a result, there is a fast rotation of the wal files and after 3000-4000 transaction, the node cannot restart due to the error that the wal files are deleted and there is no snapshot.
On Windows Server 2019, the situation is a little different. wal files are not immediately deleted, they continue to accumulate until the snapshot is created, but after that the node still does not restart due to the same error.
On debian, i couldn't reproduce this situation.
Windows is currently in Tier 3 support. https://etcd.io/docs/v3.5/op-guide/supported-platform/#current-support Meaning we just build windows binary, but we never tested if it works.
Feel free to look into the issue, however it's not unexpected that there might be problems.
Things to check:
I wrote a very simple test.
On windows 10, the file is deleted after step 3, but on windows server 2019 only after step 4, while there are no errors at step 3. This explains the difference in behavior about which I wrote above. On windows server Etcd sends a command to delete files, but they remain until the lock is remove, or until the process is killed.
The file being deleting after step 4 is expected behavior on Linux as well. I suspect the issue is caused by TryLockFile. It actually failed to lock the file, but the lockFile returns nil, accodingly the purge goroutine regards it as a successful lock, then it deletes the file.
Could you double check it?
Yes, that's right, LockFileEx
cannot lock the file, but the lockFile
method returns nil
due to an incorrect condition, well, as it is written here.
I am seeing this issue as well, 3.5.4 on linux/kubernetes using the official images. We are shipping our etcd using the bitnami helm chart.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Hi, there is a problem on several etc clusters on windows server at the customer. process есв.ехе managed by the windows service. It looks like this: the cluster is working, but if you restart the etc service on one of the servers, we will get an error:
{"level":"fatal","ts":"2022-07-21T13:17:36.221+0300","caller":"etcdserver/storage.go:95","msg":"failed to open WAL","error":"wal: file not found","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.readWAL\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/storage.go:95\ngo.etcd.io/etcd/server/v3/etcdserver.restartNode\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/raft.go:481\ngo.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:533\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
What did you expect to happen? I expect a normal restart of the etcd process
How can we reproduce it? so far we have not been able to repeat the error on our test servers
Etcd version etcd.exe --version etcd Version: 3.5.4
etcdctl.exe version etcdctl version: 3.5.4
Etcd configuration of one of the servers: name: ck11-apache2.test.local initial-advertise-peer-urls: http://10.X.X.178:2380 listen-peer-urls: http://10.X.X.178:2380 listen-client-urls: http://10.X.X.178:2379,http://127.0.0.1:2379 advertise-client-urls: http://10.X.X.178:2379 initial-cluster-token: etcd-cluster-123456789 initial-cluster: ck11-apache1.test.local=http://10.X.X.177:2380,ck11-apache2.test.local=http://10.X.X.178:2380,ck11-apache-nlb.test.local=http://10.X.X.179:2381 initial-cluster-state: 'new' max-txn-ops: 12800 heartbeat-interval: 250 election-timeout: 1250 logger: zap log-level: info
Etcd debug information
C:\Program Files\etcd>etcdctl -w table endpoint --cluster status +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | http://10.X.X.179:2379 | 90f53ed17cd0b92f | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | | | http://10.X.X.178:2379 | 97a661acdde4fc12 | 3.5.4 | 1.7 GB | true | false | 2 | 2048771 | 2048771 | | | http://10.X.X.177:2379 | a82db5fe4c233dac | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
I really need help with this problem. It is still unclear how to reproduce it, what to focus on