atomix / copycat

A novel implementation of the Raft consensus algorithm
http://atomix.io/copycat
Apache License 2.0
581 stars 155 forks source link

State Divergence with long Snapshot times #242

Open vyzo opened 8 years ago

vyzo commented 8 years ago

We have run into an issue that eventually leads to state divergence in a cluster, as snapshot times get longer and longer.

We are using a snapshottable state machine implementation.

The state machine maintains a growing map which is part of the snapshot, and we have noticed that as the snapshot time increases, the cluster eventually hits a sequence of events that ends with the divergence of a cluster member. We have observed that all long runs in our tests eventually lead to this state.

The sequence of events extrapolated from our logs is as following (in a cluster with 3 copycats):

Our distilled logs from from runs causing divergence have the following form around the time of divergence:

# server-1
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.149:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.FollowerState - /10.10.1.149:10000 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.71:10000, clientAddress=/10.10.1.71:10000], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.72:10000, clientAddress=/10.10.1.72:10000]]
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Transitioning to CANDIDATE
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.CandidateState - /10.10.1.149:10000 - Starting election
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.CandidateState - /10.10.1.149:10000 - Requesting votes from [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.71:10000, clientAddress=/10.10.1.71:10000], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.72:10000, clientAddress=/10.10.1.72:10000]]
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Transitioning to LEADER
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.149:10000 - Taking snapshot 2677034
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - Failed to configure /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.72:10000 failed. Reason: request timed out
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [catalyst-event-loop-2] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [catalyst-event-loop-3] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.72:10000
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.71:10000
[14/07/2016:15:55:57 +0000] [catalyst-event-loop-1] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [catalyst-event-loop-2] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmVh1nQnR9DaufoDF5dVD7fHCNPUSAGievHxLA5uHiv4Ts)

# server-2
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.71:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.71:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.71:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.71:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.71:10000 - Taking snapshot 2677034
[14/07/2016:15:56:02 +0000] [copycat-server-/10.10.1.71:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmVh1nQnR9DaufoDF5dVD7fHCNPUSAGievHxLA5uHiv4Ts)

# server-3 [divergent]
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.72:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:47 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.72:10000 - Taking snapshot 2677034
[14/07/2016:15:55:52 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.72:10000 - Transitioning to FOLLOWER
[14/07/2016:15:55:52 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.72:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.storage.compaction.Compactor - Compacting log with compaction: MINOR
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.72:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmQc3YAUgYHk5LacfTypUWvFhMEqC8ZpBjZ2AsydLFfvi7)

The only mitigation we have against this problem is to increase the election timeout, so that it is longer than the snapshot time, and try to reduce that time with custom serialization. Indeed, we have observed that when snapshot times don't exceed the timeout, the state remains identical in all members of the cluster. This is not a long term solution however, because snapshot time will eventually exceed the election timeout again as our system grows.

madjam commented 8 years ago

While I'm not arguing against the the fact that system invariants should be maintained irrespective of how long snapshots take, I'm wondering how sustainable it is to have a system where snapshot times can grow indefinitely. Even if the underlying issue is fixed you might still have a system that is unstable. As you are probably aware larger election timeouts have other negative side effects such as the system taking too long to recover from legitimate leader failures.

AFAIK, snapshotting is more appropriate when when the underlying state does not grow indefinitely. For example: a counter.