State Divergence with long Snapshot times

We have run into an issue that eventually leads to state divergence in a cluster, as snapshot times get longer and longer.

We are using a snapshottable state machine implementation.

The state machine maintains a growing map which is part of the snapshot, and we have noticed that as the snapshot time increases, the cluster eventually hits a sequence of events that ends with the divergence of a cluster member. We have observed that all long runs in our tests eventually lead to this state.

The sequence of events extrapolated from our logs is as following (in a cluster with 3 copycats):

the leader starts a snapshot which takes longer than the election timeout
a follower opens an election and becomes the new leader and then both it and the other follower take a new snapshot
the former leader finishes its snapshot and becomes a follower, with an eventual state divergence at this point.

Our distilled logs from from runs causing divergence have the following form around the time of divergence:

# server-1
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.149:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.FollowerState - /10.10.1.149:10000 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.71:10000, clientAddress=/10.10.1.71:10000], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.72:10000, clientAddress=/10.10.1.72:10000]]
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Transitioning to CANDIDATE
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.CandidateState - /10.10.1.149:10000 - Starting election
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.CandidateState - /10.10.1.149:10000 - Requesting votes from [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.71:10000, clientAddress=/10.10.1.71:10000], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.72:10000, clientAddress=/10.10.1.72:10000]]
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Transitioning to LEADER
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.149:10000 - Taking snapshot 2677034
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - Failed to configure /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.72:10000 failed. Reason: request timed out
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [catalyst-event-loop-2] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [catalyst-event-loop-3] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.72:10000
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.71:10000
[14/07/2016:15:55:57 +0000] [catalyst-event-loop-1] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [catalyst-event-loop-2] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmVh1nQnR9DaufoDF5dVD7fHCNPUSAGievHxLA5uHiv4Ts)

# server-2
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.71:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.71:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.71:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.71:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.71:10000 - Taking snapshot 2677034
[14/07/2016:15:56:02 +0000] [copycat-server-/10.10.1.71:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmVh1nQnR9DaufoDF5dVD7fHCNPUSAGievHxLA5uHiv4Ts)

# server-3 [divergent]
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.72:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:47 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.72:10000 - Taking snapshot 2677034
[14/07/2016:15:55:52 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.72:10000 - Transitioning to FOLLOWER
[14/07/2016:15:55:52 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.72:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.storage.compaction.Compactor - Compacting log with compaction: MINOR
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.72:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmQc3YAUgYHk5LacfTypUWvFhMEqC8ZpBjZ2AsydLFfvi7)

The only mitigation we have against this problem is to increase the election timeout, so that it is longer than the snapshot time, and try to reduce that time with custom serialization. Indeed, we have observed that when snapshot times don't exceed the timeout, the state remains identical in all members of the cluster. This is not a long term solution however, because snapshot time will eventually exceed the election timeout again as our system grows.

atomix / copycat

State Divergence with long Snapshot times #242