Closed harinath001 closed 6 months ago
Enabling PreVote should can resolve the issue. Note that raftexample is out of maintenance. We are planning to implement a new example under this repo: https://github.com/etcd-io/raft/issues/2
Please raise a new issue in etcd if you see any issue on etcd. Thanks.
Context
Bug pattern
in a 4 node cluster, 4th node cannot join the cluster after some time (during which there will be leadership changes and log compactions happened.)
Changes made in the example code to take frequent snapshots
( see the comment for the patch which contains the changes i made for testing https://github.com/etcd-io/raft/issues/174#issuecomment-1975389013) made defaultSnapshotCount = 1 made snapshotCatchUpEntriesN = 1
Detailed steps for reproducing the bug.
Compile steps
go build -o raftexample
Create a cluster of 3 raft servers
./raftexample --id 1 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003 --port 9001
./raftexample --id 2 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003 --port 9002
./raftexample --id 3 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003 --port 9003
Add 4th server
curl -L http://127.0.0.1:9003/4 -XPOST -d http://127.0.0.1:8004
Starting 4th server
./raftexample --id 4 --cluster http://127.0.0.1:8001,http://127.0.0.1:8002,http://127.0.0.1:8003,http://127.0.0.1:8004 --port 9004 --join
make some transactions which will increase the applied index
curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1 curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1 curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1 curl -L http://127.0.0.1:9001/my-key -XPUT -d bar1
logs from server-1, which is the leader now.
all the nodes applies the commits till index 12.
stop the 4th server
<killed the 4th server>
trigger multiple leader failures, which leads to leadership changes
current leader: server 1