camunda / camunda

Process Orchestration Framework
https://camunda.com/platform/
3.34k stars 610 forks source link

Duplicate snapshot replication causing to fail raft follower #14367

Closed Zelldon closed 1 year ago

Zelldon commented 1 year ago

Describe the bug

Error group https://console.cloud.google.com/errors/detail/CLXT_uWi3KaNSQ;service=zeebe;time=P7D?project=camunda-saas-int-chaos

We recently reworked how we store replicated snapshots, and added some error cases that should fail when we for example receive a duplicate snapshot.

This seemed to happen in one of our recent chaos tests. The impact is quite high since due to the exception, which is thrown in this case, the node transitions to inactive. It will no longer take part of the corresponding partition.

To Reproduce

Not clear yet.

Expected behavior

No duplicate snapshot replication

Log/Stacktrace

Full Stacktrace

``` io.camunda.zeebe.snapshots.SnapshotException$SnapshotAlreadyExistsException: Expected to receive snapshot with id 225-1-225-240, but was already persisted. This shouldn't happen. at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:381) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:51) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.roles.PassiveRole.onInstall(PassiveRole.java:190) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.roles.FollowerRole.onInstall(FollowerRole.java:83) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.impl.RaftContext.lambda$registerHandlers$8(RaftContext.java:303) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.impl.RaftContext.lambda$runOnContext$27(RaftContext.java:322) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) [zeebe-atomix-utils-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?] at java.util.concurrent.FutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] ```

Environment:

deepthidevaki commented 1 year ago

The threads are not running at the same time. Thread 40 logs are until 01:37 and Thread 39 logs are from 01:43. Was there a restart in between?

Zelldon commented 1 year ago

@deepthidevaki I think you are right, Here is the restart

Somehow overseen the timestamp gaps 👍🏼

Zelldon commented 1 year ago

Keeping it as critical for now since it is a regression and causing the node to transition to inactive when receiving duplicate snapshot

npepinpe commented 1 year ago

From the logs, it looks like the leader tried sending it the same snapshot twice in a row.

2023-09-20 03:37:52.583 CEST RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '0') and replace with received snapshot (index '225'). First entry in the log will be at index 226

2023-09-20 03:37:55.560 CEST Committed new snapshot 225-1-225-240

2023-09-20 03:37:55.573 CEST RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/225-1-225-240, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/225-1-225-240.checksum, checksum=1740324580, snapshotId=FileBasedSnapshotId{index=225, term=1, processedPosition=225, exporterPosition=240}, metadata=FileBasedSnapshotMetadata[version=1, processedPosition=225, exportedPosition=240, lastFollowupEventPosition=240]}

2023-09-20 03:37:55.576 CEST RaftServer{raft-partition-partition-2} - An uncaught exception occurred, transition to inactive role

Now, the exception occurs is the following:

io.camunda.zeebe.snapshots.SnapshotException$SnapshotAlreadyExistsException: Expected to receive snapshot with id 225-1-225-240, but was already persisted. This shouldn't happen.
    at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:381) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
    at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:51) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
    at io.atomix.raft.roles.PassiveRole.onInstall(PassiveRole.java:190) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
    at io.atomix.raft.roles.FollowerRole.onInstall(FollowerRole.java:83) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
    at io.atomix.raft.impl.RaftContext.lambda$registerHandlers$8(RaftContext.java:303) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
    at io.atomix.raft.impl.RaftContext.lambda$runOnContext$27(RaftContext.java:322) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
    at io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) [zeebe-atomix-utils-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
    at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
    at java.lang.Thread.run(Unknown Source) [?:?]

As we can see, it occurs because it received an install request from the leader. So nothing too weird is happening here - or rather, nothing that would lead to data loss or corruption or something :smile:

The question now is: did it make sense for the leader to send the snapshot install request again? If yes, then we should log a warning when this happens, but not fail, and return an appropriate response to the leader (i.e. we have this snapshot). If not, then we should fix it on the leader side.

My hypothesis is this is a race condition. Since the snapshot listener is invoked asynchronously after persisting, the snapshot reference of the follow may not be updated yet when the leader pushes out an append request (after it thinks the install operation is finished), which leads to it trying to send the same snapshot again.

npepinpe commented 1 year ago

OK let's work through the hypothesis, since it's difficult to replicate this in a controlled way.

The follower's onInstall method will persist the snapshot. Meaning the Raft thread is blocked. And until we reply, the leader will not send another request to us. Now, in onInstall, we call the persist() method of the pending snapshot, and join on the resulting future. By the time this returns, the snapshot listeners have been invoked. This listener is asynchronous, so it won't execute immediately, but it will enqueue a job on the Raft thread. Then we reply, and then the leader can send us another request.

Since all operations on the Raft thread are sequenced, the snapshot listener will be executed before the next leader request is processed. So the snapshot reference will be updated before the next request is processed, and thus the leader shouldn't be sending us the same snapshot again.

Of course, this is all expected behavior, and maybe there's a bug somewhere in the code :smile: But according to that, my hypothesis is then wrong.

npepinpe commented 1 year ago

Hm, I think the analysis above is wrong. In onInstall, we check if our current snapshot index is >= to the request's own index. So clearly, in this case, we decided this was false...

    // If the snapshot already exists locally, do not overwrite it with a replicated snapshot.
    // Simply reply to the request successfully.
    final var latestIndex = raft.getCurrentSnapshotIndex();
    if (latestIndex >= request.index()) {
      abortPendingSnapshots();

      return CompletableFuture.completedFuture(
          logResponse(InstallResponse.builder().withStatus(RaftResponse.Status.OK).build()));
    }

So what might have happened is for some reason the leader decided to send another install request before the snapshot listener was handled. Are we possibly retrying install requests and ended up sending one too many? The default install request timeout is 2.5 seconds. If there's any I/O stalls, for example, and we take longer, then we'll retry the request.

Indeed, on the leader, zeebe-1, we see the install request timed out.

2023-09-20 03:37:54.894 CEST RaftServer{raft-partition-partition-2} - InstallRequest{currentTerm=1, leader=1, index=225, term=1, version=1, chunkId=HeapByteBuffer{position=0, remaining=14, limit=14, capacity=14, mark=java.nio.HeapByteBuffer[pos=0 lim=14 cap=14], hash=-1428579103}, nextChunkId=null, data=HeapByteBuffer{position=0, remaining=159, limit=159, capacity=159, mark=java.nio.HeapByteBuffer[pos=0 lim=159 cap=159], hash=559971236}, initial=false, complete=true} to 0 failed: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request ProtocolRequest{id=6017, subject=raft-partition-partition-2-install, sender=zeebe-1.zeebe-broker-service.2042c610-7a76-4f45-bfb3-b5bd2e8a9083-zeebe.svc.cluster.local:26502, payload=byte[]{length=308, hash=-186090835}} to zeebe-0.zeebe-broker-service.2042c610-7a76-4f45-bfb3-b5bd2e8a9083-zeebe.svc.cluster.local:26502 timed out in PT2.5S

So the request was simply retried, and it was then received and queued before the snapshot listener was processed.

mic drop