Prevent disruptive rejoining node

gyuho commented 6 years ago

Original discussion https://groups.google.com/d/msg/etcd-dev/82bPTmzGVM0/EiTINb6dBQAJ

https://github.com/coreos/etcd/blob/b03fd4cbc30eaa9f66faca5df655eaaca56990a0/etcdserver/raft.go#L416-L421

https://github.com/coreos/etcd/blob/b03fd4cbc30eaa9f66faca5df655eaaca56990a0/etcdserver/raft.go#L450-L455

Restart a follower to an existing cluster with an active leader.
Follower advances election ticks with only tick left for campaign.
Follower's last tick elapses, and becomes a candidate.
Once candidate, it drops all incoming proposals until leader gets elected (thus unavailable).

Ideally, the acting leader sends a heartbeat message to this follower before step 3, so that follower resets its election ticks and stays available without becoming a candidate.

advanceTicksForElection helps cross-datacenter deployments with larger election timeouts, by speeding up its bootstrap process. However, some prefer high availabilities, and disabling advanceTicksForElection can increase the availability of restarting follower node.

/cc @jpbetz

xiang90 commented 6 years ago

We expect the follower to receive a heartbeat from the leader within the heartbeat timeout. This is a reasonable assumption to make. If restarting nodes keep on disrupting the cluster, the user probably set heartbeat timeout to some low value. Also we will have pre vote enable enabled in the future.

I do not think we should make what etcd server does today does today configurable unless I missed something.

gyuho commented 6 years ago

Also we will have pre vote enable enabled in the future.

Agree. This should reduce disruptive elections and resolve most issues.

wojtek-t commented 6 years ago

Copying some things from the email thread here - I think we either misunderstand some logs or there is some bug in etcd:

Hmm... maybe I'm misunderstanding the log entries, but in my case leader is dropping leadership in that case:

2018-02-11 23:38:16.450210 I | raft: f813a994322b2b40 [term: 9] received a MsgAppResp message with higher term from 9937f2a5f3634020 [term: 10]
2018-02-11 23:38:16.450261 I | raft: f813a994322b2b40 became follower at term 10
2018-02-11 23:38:16.450274 I | raft: raft.node: f813a994322b2b40 changed leader from f813a994322b2b40 to 9937f2a5f3634020 at term 10

where f813a994322b2b40 is the current leader and 9937f2a5f3634020 is a member that restarted. Is it a bug?

Interestingly, the third member ignores such vote requests:

2018-02-11 23:38:16.430232 I | raft: ad782d6b7abde5c3 [logterm: 9, index: 2853, vote: f813a994322b2b40] ignored MsgVote from 9937f2a5f3634020 [logterm: 9, index: 2364] at term 9: lease is not expired (remaining ticks: 10)

I believe this is what happens - see log entries above.

One more interesting I found is that in the logs of restarted member, election is started before 'peer X become active' and 'established a TCP streaming connection with peer X' log entries:

2018-02-11 23:38:15.892374 I | etcdmain: etcd Version: 3.1.11
(...)
2018-02-11 23:38:16.163995 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (writer)
2018-02-11 23:38:16.164045 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (writer)
2018-02-11 23:38:16.164142 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (stream MsgApp v2 reader)
2018-02-11 23:38:16.192182 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (stream Message reader)
2018-02-11 23:38:16.220700 I | rafthttp: started streaming with peer f813a994322b2b40 (writer)
2018-02-11 23:38:16.220760 I | rafthttp: started streaming with peer f813a994322b2b40 (writer)
2018-02-11 23:38:16.220806 I | rafthttp: started streaming with peer f813a994322b2b40 (stream MsgApp v2 reader)
2018-02-11 23:38:16.249231 I | rafthttp: started streaming with peer f813a994322b2b40 (stream Message reader)
2018-02-11 23:38:16.311931 I | raft: 9937f2a5f3634020 is starting a new election at term 9
2018-02-11 23:38:16.312034 I | raft: 9937f2a5f3634020 became candidate at term 10
2018-02-11 23:38:16.312155 I | raft: 9937f2a5f3634020 received MsgVoteResp from 9937f2a5f3634020 at term 10
2018-02-11 23:38:16.312176 I | raft: 9937f2a5f3634020 [logterm: 9, index: 2364] sent MsgVote request to f813a994322b2b40 at term 10
2018-02-11 23:38:16.312191 I | raft: 9937f2a5f3634020 [logterm: 9, index: 2364] sent MsgVote request to ad782d6b7abde5c3 at term 10
2018-02-11 23:38:16.366186 I | rafthttp: peer f813a994322b2b40 became active
2018-02-11 23:38:16.366257 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream Message writer)
2018-02-11 23:38:16.366418 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream MsgApp v2 writer)
2018-02-11 23:38:16.369265 I | rafthttp: peer ad782d6b7abde5c3 became active
2018-02-11 23:38:16.369301 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream Message writer)
2018-02-11 23:38:16.369448 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream MsgApp v2 writer)
2018-02-11 23:38:16.425179 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream MsgApp v2 reader)
2018-02-11 23:38:16.434434 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream Message reader)
2018-02-11 23:38:16.453593 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream MsgApp v2 reader)
2018-02-11 23:38:16.496026 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream Message reader)

Does it mean that election started before we even initialized connections to peers? Shouldn't we wait for initialization of connections (which?) before we start measuring heartbeat timeout?

gyuho commented 6 years ago

We are hitting this issue from https://github.com/coreos/etcd/pull/5468.

https://github.com/coreos/etcd/blob/df4aafbbdfe5a96699f01bee66b4ddb04d24444f/raft/raft.go#L806-L822

UPDATE: On second thought, https://github.com/coreos/etcd/pull/5468 makes sense in case there is an isolated candidate. Responding with pb.MsgAppResp of higher term would free the stuck candidate at the cost of disrupting the current leader. And this is expected; pre-vote should help.

Will enable pre-vote in etcd.

xiang90 commented 6 years ago

@gyuho even without pre-vote, the restarted node should not ALWAYS disrupt the leadership. If it is the case, we need to figure out why. Enabling pre-vote might mask the bug.

xiang90 commented 6 years ago

@gyuho

@wojtek-t is probably right. we start the timer too fast.

/cc @jpbetz

gyuho commented 6 years ago

@xiang90 Right, loosening advance election timeout ticks should help, in addition to pre-vote. Otherwise, we are just hoping restarted follower receives heartbeat before last tick elapse.

raft: f813a994322b2b40 [term: 9] received a MsgAppResp message with higher term

This is clearly follower(or candidate) with higher term responding to leader heartbeat, thus forcing the leader to step down, which is expected in Raft to prevent isolated follower being stuck with repeating elections.

Maybe add more logs around this?

xiang90 commented 6 years ago

Right, loosening advance election timeout ticks should help, i

Not really. What we should do it to account for the connection/initialization time when setting the initial election timeout to a low value when the node is restarted. Right now, we set it low to the heartbeat timeout to shorten the initial leader election, where the timeout might be even faster than the initialization time as @wojtek-t pointed out.

mborsz commented 6 years ago

Hi,

I was able to reproduce this issue (rejoining node triggers leader election) in version 3.1.13:

8cc587412352bc16 is elected as a leader in the cluster
f34ba41f615b4eff is a follower
2fae52714728f955 rejoins cluster

2fae52714728f955's logs:

A  2018-04-12 22:33:12.130522 I | etcdserver: 2fae52714728f955 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s) 
A  2018-04-12 22:33:12.260686 I | rafthttp: established a TCP streaming connection with peer 8cc587412352bc16 (stream Message reader) 
A  2018-04-12 22:33:12.260771 I | rafthttp: established a TCP streaming connection with peer 8cc587412352bc16 (stream MsgApp v2 reader) 
A  2018-04-12 22:33:12.260998 I | rafthttp: established a TCP streaming connection with peer f34ba41f615b4eff (stream MsgApp v2 reader) 
A  2018-04-12 22:33:12.261115 I | rafthttp: established a TCP streaming connection with peer f34ba41f615b4eff (stream Message reader) 
A  2018-04-12 22:33:12.384623 I | raft: 2fae52714728f955 is starting a new election at term 8 
A  2018-04-12 22:33:12.384687 I | raft: 2fae52714728f955 became candidate at term 9 
A  2018-04-12 22:33:12.384704 I | raft: 2fae52714728f955 received MsgVoteResp from 2fae52714728f955 at term 9 
A  2018-04-12 22:33:12.384717 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to f34ba41f615b4eff at term 9 
A  2018-04-12 22:33:12.384730 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to 8cc587412352bc16 at term 9 
A  2018-04-12 22:33:12.384741 I | raft: raft.node: 2fae52714728f955 lost leader 8cc587412352bc16 at term 9 
A  2018-04-12 22:33:13.886168 I | raft: 2fae52714728f955 is starting a new election at term 9 
A  2018-04-12 22:33:13.886218 I | raft: 2fae52714728f955 became candidate at term 10 
A  2018-04-12 22:33:13.886235 I | raft: 2fae52714728f955 received MsgVoteResp from 2fae52714728f955 at term 10 
A  2018-04-12 22:33:13.886247 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to f34ba41f615b4eff at term 10 
A  2018-04-12 22:33:13.886259 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to 8cc587412352bc16 at term 10 
A  2018-04-12 22:33:13.896923 I | raft: 2fae52714728f955 received MsgVoteResp rejection from 8cc587412352bc16 at term 10 
A  2018-04-12 22:33:13.896955 I | raft: 2fae52714728f955 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections 
A  2018-04-12 22:33:13.897041 I | raft: 2fae52714728f955 received MsgVoteResp rejection from f34ba41f615b4eff at term 10 
A  2018-04-12 22:33:13.897066 I | raft: 2fae52714728f955 [quorum:2] has received 1 MsgVoteResp votes and 2 vote rejections 
A  2018-04-12 22:33:13.897080 I | raft: 2fae52714728f955 became follower at term 10 
A  2018-04-12 22:33:15.023927 I | raft: 2fae52714728f955 [term: 10] received a MsgVote message with higher term from 8cc587412352bc16 [term: 11] 
A  2018-04-12 22:33:15.023972 I | raft: 2fae52714728f955 became follower at term 11 
A  2018-04-12 22:33:15.023986 I | raft: 2fae52714728f955 [logterm: 8, index: 2307, vote: 0] cast MsgVote for 8cc587412352bc16 [logterm: 8, index: 2748] at term 11 
A  2018-04-12 22:33:15.026218 I | raft: raft.node: 2fae52714728f955 elected leader 8cc587412352bc16 at term 11 
A  2018-04-12 22:33:15.075479 I | etcdserver: published {Name:etcd-10.127.240.58 ClientURLs:[http://127.0.0.1:2379]} to cluster e7a652e35e44715f 
A  2018-04-12 22:33:15.075640 I | embed: ready to serve client requests

f34ba41f615b4eff's logs:

A  2018-04-12 22:33:10.810396 W | rafthttp: health check for peer 2fae52714728f955 could not connect: dial tcp 10.127.240.58:2380: getsockopt: connection refused 
A  2018-04-12 22:33:11.868355 I | rafthttp: peer 2fae52714728f955 became active 
A  2018-04-12 22:33:11.868510 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 reader) 
A  2018-04-12 22:33:11.870258 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message reader) 
A  2018-04-12 22:33:12.258999 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message writer) 
A  2018-04-12 22:33:12.259318 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 writer) 
A  2018-04-12 22:33:12.555504 I | raft: f34ba41f615b4eff [logterm: 8, index: 2748, vote: 8cc587412352bc16] ignored MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 8: lease is not expired (remaining ticks: 9) 
A  2018-04-12 22:33:13.893774 I | raft: f34ba41f615b4eff [term: 8] received a MsgVote message with higher term from 2fae52714728f955 [term: 10] 
A  2018-04-12 22:33:13.893823 I | raft: f34ba41f615b4eff became follower at term 10 
A  2018-04-12 22:33:13.893855 I | raft: f34ba41f615b4eff [logterm: 8, index: 2748, vote: 0] rejected MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 10 
A  2018-04-12 22:33:13.893867 I | raft: raft.node: f34ba41f615b4eff lost leader 8cc587412352bc16 at term 10 
A  2018-04-12 22:33:15.026885 I | raft: f34ba41f615b4eff [term: 10] received a MsgVote message with higher term from 8cc587412352bc16 [term: 11] 
A  2018-04-12 22:33:15.026968 I | raft: f34ba41f615b4eff became follower at term 11 
A  2018-04-12 22:33:15.026991 I | raft: f34ba41f615b4eff [logterm: 8, index: 2748, vote: 0] cast MsgVote for 8cc587412352bc16 [logterm: 8, index: 2748] at term 11 
A  2018-04-12 22:33:15.029122 I | raft: raft.node: f34ba41f615b4eff elected leader 8cc587412352bc16 at term 11

8cc587412352bc16's logs:

A  2018-04-12 22:33:10.603211 W | rafthttp: health check for peer 2fae52714728f955 could not connect: dial tcp 10.127.240.58:2380: getsockopt: connection refused 
A  2018-04-12 22:33:11.769421 I | rafthttp: peer 2fae52714728f955 became active 
A  2018-04-12 22:33:11.860876 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 reader) 
A  2018-04-12 22:33:11.864928 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message reader) 
A  2018-04-12 22:33:12.243043 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 writer) 
A  2018-04-12 22:33:12.243180 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message writer) 
A  2018-04-12 22:33:12.539016 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748, vote: 8cc587412352bc16] ignored MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 8: lease is not expired (remaining ticks: 6) 
A  2018-04-12 22:33:12.539133 I | raft: 8cc587412352bc16 [term: 8] received a MsgAppResp message with higher term from 2fae52714728f955 [term: 9] 
A  2018-04-12 22:33:12.539162 I | raft: 8cc587412352bc16 became follower at term 9 
A  2018-04-12 22:33:12.539173 I | raft: raft.node: 8cc587412352bc16 changed leader from 8cc587412352bc16 to 2fae52714728f955 at term 9 
A  2018-04-12 22:33:13.877402 I | raft: 8cc587412352bc16 [term: 9] received a MsgVote message with higher term from 2fae52714728f955 [term: 10] 
A  2018-04-12 22:33:13.877469 I | raft: 8cc587412352bc16 became follower at term 10 
A  2018-04-12 22:33:13.877484 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748, vote: 0] rejected MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 10 
A  2018-04-12 22:33:13.877498 I | raft: raft.node: 8cc587412352bc16 lost leader 2fae52714728f955 at term 10 
A  2018-04-12 22:33:15.008423 I | raft: 8cc587412352bc16 is starting a new election at term 10 
A  2018-04-12 22:33:15.008490 I | raft: 8cc587412352bc16 became candidate at term 11 
A  2018-04-12 22:33:15.008506 I | raft: 8cc587412352bc16 received MsgVoteResp from 8cc587412352bc16 at term 11 
A  2018-04-12 22:33:15.008518 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748] sent MsgVote request to 2fae52714728f955 at term 11 
A  2018-04-12 22:33:15.008529 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748] sent MsgVote request to f34ba41f615b4eff at term 11 
A  2018-04-12 22:33:15.011935 I | raft: 8cc587412352bc16 received MsgVoteResp from 2fae52714728f955 at term 11 
A  2018-04-12 22:33:15.011986 I | raft: 8cc587412352bc16 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections 
A  2018-04-12 22:33:15.012007 I | raft: 8cc587412352bc16 became leader at term 11 
A  2018-04-12 22:33:15.012035 I | raft: raft.node: 8cc587412352bc16 elected leader 8cc587412352bc16 at term 11

Questions:

Why 2fae52714728f955 says it has two active peers at 22:33:12.130522 while 'established a TCP connection' logs starts at 22:33:12.260686?
What is origin of 8 in 'fast-forwarding 8 ticks (election ticks 10)'? Why "election ticks - 2"?
Can we fix this? :)
Can we make it possible to disable fast-forwarding mechanism completely e.g. by flag? I believe in our use case it doesn't provide any useful value, but as we can see it degrades availability of the cluster.

gyuho commented 6 years ago

@wojtek-t @mborsz We will re-investigate on this.

Why "election ticks - 2"?

Previously, we had only one tick left (election ticks - 1). And patch was adding "one more" tick in order to have more time for the rejoining follower to receive a leader heartbeat thus not starting a disruptive election. Our assumption was that just one more tick would be enough (it's 100ms more when heartbeat is 100 ms and election timeout is 1 sec).

Why 2fae52714728f955 says it has two active peers at 22:33:12.130522 while 'established a TCP connection' logs starts at 22:33:12.260686?

It's because the rebooting follower 2fae52714728f955 rejoins the cluster of active 2 members, and it was able to establish streams onto them.

etcdserver: 2fae52714728f955 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)

And this is where the rejoining follower finds two other peers out of 3-node cluster, and fast-forwards 8 ticks, and with only 2 ticks left before it triggers an election. And our expectation was within the last 2 ticks, the rejoining follower is able to receive leader heartbeat and does not start a disruptive election. This is useful for cross-datacenter deploy with larger heartbeat intervals; when one tick is 1-second, without fast-forwarding you would wait the 10 seconds to find out there's no leader, with fast-forward you just wait 2-second. But this is still theoretical scenario, and if this happens very often in your use case, might make sense to make this configurable.

Logs tell that rejoining follower was expecting leader heartbeat within 100ms x 2 (200 ms) at 2018-04-12 22:33:12.130522, but didn't get it, so started an election at 2018-04-12 22:33:12.384623. What were your heartbeat and leader election configuration? Does this happen often?

Before we make this configurable for low network latency environments, I want to be sure that it won't mask any other bugs.

wojtek-t commented 6 years ago

Logs tell that rejoining follower was expecting leader heartbeat within 100ms x 2 (200 ms) at 2018-04-12 22:33:12.130522, but didn't get it, so started an election at 2018-04-12 22:33:12.384623. What were your heartbeat and leader election configuration? Does this happen often?

Defaults (heartbeat - 100ms, leader election - 1s). We see it happening from time to time - it's definitely NOT a one-off.

etcd-io / etcd

Prevent disruptive rejoining node #9333