Open kishorenc opened 3 years ago
This is because when the follower restart, leader could to send heartbeat to it, but didn't handle the log mismatch situation. Follower will set response success to false when log mismatch: https://github.com/baidu/braft/blob/154d805bd1eb88df97121fe1c73a5b469df88056/src/braft/node.cpp#L2430 But leader didn't handle it: https://github.com/baidu/braft/blob/154d805bd1eb88df97121fe1c73a5b469df88056/src/braft/replicator.cpp#L266 So the possible solution is handling above situation:
if(!response->success() && !readonly){
r->_send_empty_entries(false);
return;
}
When leader recevie new log request, sending entires will be triggered,so the follower will catch up the log,that's ok.
Deleting all data from a follower and re-join it to the cluster is a very dangerous operation, The empty follower may vote to a wrong leader since any of others contains more logs than itself. Which leads to a brain-split issue and the cluster becomes unavailable, or even worse - purge committed data silently.
The right operation order to clean a follower is :
In this case, as far as i know:
@PFZheng @Edward-xk Consider to add a persistent uuid to all nodes, and adding it to peer_id, to make that clusters are aware about some nodes are reset unexpectedly.
or even worse - purge committed data silently. @chenzhangyi how is this happend? for the performance reasons, is it saft to use MemoryLogStorage option for the cluster?
or even worse - purge committed data silently. @chenzhangyi how is this happend? for the performance reasons, is it saft to use MemoryLogStorage option for the cluster?
A very simple case
L1: (1, 1) (1, 2) L2 (1, 1) (1, 2) L3 (1, 1)
log(1, 2) has been committed. At this point, reseting L2 to empty
L1: (1, 1) (1, 2) L2: L3: (1, 1)
L3 issue a voting request for itself and l2 grants. Now L3 write it's first log
L1: (1, 1) (1, 2) L2: (1, 1) (2, 2) L3: (1, 1) (2, 2)
after L3 sends log request to L1. The log entry (1, 2) is removed permanently.
Consider this sequence:
reject term_unmatched AppendEntries
logs occur)Sample error log (this keeps getting repeated) on the follower:
As long as the cluster gets no writes, the follower is unable to join the cluster at all. When I make a write to the leader, the errors stop immediately and the "stuck" follower joins the cluster and resumes normal operation. A similar issue was also previously discussed here (a Java port of this project).
Is there any work around for this issue? Otherwise, a cluster which is mostly read-only will exhibit this problem in an environment where the storage is ephemeral (like EC2).