baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.95k stars 881 forks source link

Follower restarting with empty state directory does not join idle cluster until first write happens #292

Open kishorenc opened 3 years ago

kishorenc commented 3 years ago

Consider this sequence:

  1. On a 3 node braft cluster, do some writes
  2. On a follower, stop the process, delete the state/data directory and start the process again
  3. The follower will NOT be able to rejoin (reject term_unmatched AppendEntries logs occur)

Sample error log (this keeps getting repeated) on the follower:

W20210527 17:07:16.129559 179408896 node.cpp:2308] node default_group:127.0.0.1:8107:8108 reject term_unmatched AppendEntries from 127.0.0.1:6107:6108 in term 2 prev_log_index 2 prev_log_term 2 local_prev_log_term 0 last_log_index 0 entries_size 0 from_append_entries_cache: 0

As long as the cluster gets no writes, the follower is unable to join the cluster at all. When I make a write to the leader, the errors stop immediately and the "stuck" follower joins the cluster and resumes normal operation. A similar issue was also previously discussed here (a Java port of this project).

Is there any work around for this issue? Otherwise, a cluster which is mostly read-only will exhibit this problem in an environment where the storage is ephemeral (like EC2).

ehds commented 3 years ago

This is because when the follower restart, leader could to send heartbeat to it, but didn't handle the log mismatch situation. Follower will set response success to false when log mismatch: https://github.com/baidu/braft/blob/154d805bd1eb88df97121fe1c73a5b469df88056/src/braft/node.cpp#L2430 But leader didn't handle it: https://github.com/baidu/braft/blob/154d805bd1eb88df97121fe1c73a5b469df88056/src/braft/replicator.cpp#L266 So the possible solution is handling above situation:

  if(!response->success() && !readonly){
        r->_send_empty_entries(false);
        return;
    }

When leader recevie new log request, sending entires will be triggered,so the follower will catch up the log,that's ok.

chenzhangyi commented 3 years ago

Deleting all data from a follower and re-join it to the cluster is a very dangerous operation, The empty follower may vote to a wrong leader since any of others contains more logs than itself. Which leads to a brain-split issue and the cluster becomes unavailable, or even worse - purge committed data silently.

The right operation order to clean a follower is :

  1. remove it from the cluster with remove_peer.
  2. clean all data and restart
  3. add it back with add_peer.

In this case, as far as i know:

  1. It's right the follower didn't join cluster before any write operation
  2. It's buggy that a write operation makes the follower re-join the cluster.

@PFZheng @Edward-xk Consider to add a persistent uuid to all nodes, and adding it to peer_id, to make that clusters are aware about some nodes are reset unexpectedly.

dongdongwcpp commented 3 years ago

or even worse - purge committed data silently. @chenzhangyi how is this happend? for the performance reasons, is it saft to use MemoryLogStorage option for the cluster?

chenzhangyi commented 3 years ago

or even worse - purge committed data silently. @chenzhangyi how is this happend? for the performance reasons, is it saft to use MemoryLogStorage option for the cluster?

A very simple case

L1: (1, 1) (1, 2) L2 (1, 1) (1, 2) L3 (1, 1)

log(1, 2) has been committed. At this point, reseting L2 to empty

L1: (1, 1) (1, 2) L2: L3: (1, 1)

L3 issue a voting request for itself and l2 grants. Now L3 write it's first log

L1: (1, 1) (1, 2) L2: (1, 1) (2, 2) L3: (1, 1) (2, 2)

after L3 sends log request to L1. The log entry (1, 2) is removed permanently.