Open bnamasivayam opened 4 years ago
Can we reproduce this in simulation and test the leader election latency in simulation?
In simulation, we know which node is available and the artificial network latency, we can calculate an expected latency threshold to check against. (It very likely will be harder than what I described. But I feel it is doable.)
Since we are talking about leader election here, do we have any write ups/documentation about the leader election algorithm we use? Ideally the doc should have a sort of formal proof of its correctness and liveness and etc.. @ajbeamon @etschannen
EDIT: I think it all boils down to discuss what consensus algorithm FDB used for this purpose.
There was an issue in production where it took 6 seconds for the leader election process to elect a proper leader. The crux of the issue is co-ordinators could elect a leader that had already exited and hence need to wait for the next interval to elect a new one. A plausible set of events that could have lead to the behavior is
Let's say id1<id2.
false
reply will be send to id1. Hence id1 will exit as it gets a majorityfalse
Evan suggested the following tweaks that could prevent this situation but also wants to be carefully vetted.
false
to heartbeat request if availableLeaders list is empty. This will prevent a leader from exiting prematurely.