etcd-io / raft

Raft library for maintaining a replicated state machine
Apache License 2.0
664 stars 163 forks source link

Better handling of detected invariant violations #18

Open tbg opened 1 year ago

tbg commented 1 year ago

Reincarnation of https://github.com/etcd-io/etcd/issues/10166#issue-368252159.

While it's good that raft tries to detect invariant violations (for example follower regressing no an acked log position), more flexibility in how to react these events would be desirable.

This isn't something we should do lightly, but the current failure modes are unnecessarily catastrophic. The point of raft is to provide high availability, so a faulty follower, even if it technically broke the rules, shouldn't crash the leader.

halegreen commented 1 year ago

Is it a valid solution : having a option to disable this panic and recover a node in-place when it reaches this state. panic: tocommit(265625) is out of range [lastIndex(265624)]. Was the raft log corrupted, truncated, or lost

halegreen commented 1 year ago

hi, I'd like to take this challenge, raft is new for me, so the solution may cost sometime for me..

zouyonghao commented 9 months ago

Hi, as discussed in https://github.com/etcd-io/etcd/issues/13493 , a reproducible record for this issue is also available with a docker:

docker run -it --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined zouyonghao/etcd-10166 bash
# rr replay /root/10166/rr_rec_2_0/ -a
pav-kv commented 9 months ago

Currently, a node that notices a safety property violation, crashes itself. In many cases, though, the safety properties violation is not necessarily this node's fault. If any Byzantine effect happens, any node could be the culprit, and we can't / shouldn't be trying to figure out which. If we assume raft implementation has no/little bugs, then all safety property violations are Byzantine.

An example scenario:

  1. 3/5 followers are misconfigured, and writes on them are not durable (e.g. storage writes do not fsync before proceeding; we've seen this happen in prod).
  2. The leader commits an entry based on votes/accepts from these followers.
  3. The cluster restarts.
  4. Now, the cluster will evolve in different directions depending on who becomes the leader.
  5. In particular, if one of the faulty 3/5 nodes is elected, it will try to override commits on the former leader node. The former leader will notice this as a violation. But it would make no much sense to crash this node, it behaves correctly and the offenders were the followers.

I think the correct solution to this issue would be to "brick" the raft group entirely when safety properties are violated. The specific mechanisms could vary, but a simple solution would be to have a special "brick" State, in addition to existing valid ones like Leader/Follower/Candidate. In this state, the RawNode ceases to make any progress, and makes best effort to propagate the information about the corruption to other nodes in the group (e.g. reply with a killpill to any message), so that they enter this state too.

When a node enters the "brick" state, the application layer will use the appropriate application-specific mechanism to bring this information to the operator. It can be "crash the node", or alerts/monitoring, or sending a bug report, etc.

Some tooling / introspection would be needed in order to recover from this state, if needed. In would be generally unsafe to do so. For example, the operator would choose which of the conflicting replicas will be declared the "truth" now, and the raft group will be bootstrapped from this replica.