Leader Election in some rare cases take longer to complete

bnamasivayam commented 4 years ago

There was an issue in production where it took 6 seconds for the leader election process to elect a proper leader. The crux of the issue is co-ordinators could elect a leader that had already exited and hence need to wait for the next interval to elect a new one. A plausible set of events that could have lead to the behavior is

Let's say id1<id2.

id1 became the leader but slow to send LeaderHeartBeat
id2 got elected since id1 is slow to send heart beat
id2 is regularly sending heartbeat.
id1 heartbeat reached the co-ordinators. It will be put into the list of availableLeaders but false reply will be send to id1. Hence id1 will exit as it gets a majority false
Next window, the co-ordinators will select id1 as the leader as id1<id2
id2 will send heartbeat and it will be put in availableLeaders set but the coordinators will reply false and id2 will also die
Next window, id2 will be elected again as it is in the availableLeaders set.
Next window, some new id will be elected and the recovery progresses.

Evan suggested the following tweaks that could prevent this situation but also wants to be carefully vetted.

Do not reply false to heartbeat request if availableLeaders list is empty. This will prevent a leader from exiting prematurely.
In the nomination phase, if there are multiple leaders in availableLeaders list, then give preference to the one recently nominated.

xumengpanda commented 4 years ago

Can we reproduce this in simulation and test the leader election latency in simulation?

In simulation, we know which node is available and the artificial network latency, we can calculate an expected latency threshold to check against. (It very likely will be harder than what I described. But I feel it is doable.)

dongxinEric commented 4 years ago

Since we are talking about leader election here, do we have any write ups/documentation about the leader election algorithm we use? Ideally the doc should have a sort of formal proof of its correctness and liveness and etc.. @ajbeamon @etschannen

EDIT: I think it all boils down to discuss what consensus algorithm FDB used for this purpose.

apple / foundationdb

Leader Election in some rare cases take longer to complete #3378