Closed Zelldon closed 1 year ago
The threads are not running at the same time. Thread 40 logs are until 01:37
and Thread 39 logs are from 01:43
. Was there a restart in between?
@deepthidevaki I think you are right, Here is the restart
Somehow overseen the timestamp gaps 👍🏼
Keeping it as critical for now since it is a regression and causing the node to transition to inactive when receiving duplicate snapshot
From the logs, it looks like the leader tried sending it the same snapshot twice in a row.
2023-09-20 03:37:52.583 CEST RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '0') and replace with received snapshot (index '225'). First entry in the log will be at index 226
2023-09-20 03:37:55.560 CEST Committed new snapshot 225-1-225-240
2023-09-20 03:37:55.573 CEST RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/225-1-225-240, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/225-1-225-240.checksum, checksum=1740324580, snapshotId=FileBasedSnapshotId{index=225, term=1, processedPosition=225, exporterPosition=240}, metadata=FileBasedSnapshotMetadata[version=1, processedPosition=225, exportedPosition=240, lastFollowupEventPosition=240]}
2023-09-20 03:37:55.576 CEST RaftServer{raft-partition-partition-2} - An uncaught exception occurred, transition to inactive role
Now, the exception occurs is the following:
io.camunda.zeebe.snapshots.SnapshotException$SnapshotAlreadyExistsException: Expected to receive snapshot with id 225-1-225-240, but was already persisted. This shouldn't happen.
at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:381) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:51) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
at io.atomix.raft.roles.PassiveRole.onInstall(PassiveRole.java:190) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
at io.atomix.raft.roles.FollowerRole.onInstall(FollowerRole.java:83) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
at io.atomix.raft.impl.RaftContext.lambda$registerHandlers$8(RaftContext.java:303) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
at io.atomix.raft.impl.RaftContext.lambda$runOnContext$27(RaftContext.java:322) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
at io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) [zeebe-atomix-utils-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
As we can see, it occurs because it received an install request from the leader. So nothing too weird is happening here - or rather, nothing that would lead to data loss or corruption or something :smile:
The question now is: did it make sense for the leader to send the snapshot install request again? If yes, then we should log a warning when this happens, but not fail, and return an appropriate response to the leader (i.e. we have this snapshot). If not, then we should fix it on the leader side.
My hypothesis is this is a race condition. Since the snapshot listener is invoked asynchronously after persisting, the snapshot reference of the follow may not be updated yet when the leader pushes out an append request (after it thinks the install operation is finished), which leads to it trying to send the same snapshot again.
OK let's work through the hypothesis, since it's difficult to replicate this in a controlled way.
The follower's onInstall
method will persist the snapshot. Meaning the Raft thread is blocked. And until we reply, the leader will not send another request to us. Now, in onInstall
, we call the persist()
method of the pending snapshot, and join on the resulting future. By the time this returns, the snapshot listeners have been invoked. This listener is asynchronous, so it won't execute immediately, but it will enqueue a job on the Raft thread. Then we reply, and then the leader can send us another request.
Since all operations on the Raft thread are sequenced, the snapshot listener will be executed before the next leader request is processed. So the snapshot reference will be updated before the next request is processed, and thus the leader shouldn't be sending us the same snapshot again.
Of course, this is all expected behavior, and maybe there's a bug somewhere in the code :smile: But according to that, my hypothesis is then wrong.
Hm, I think the analysis above is wrong. In onInstall
, we check if our current snapshot index is >=
to the request's own index. So clearly, in this case, we decided this was false...
// If the snapshot already exists locally, do not overwrite it with a replicated snapshot.
// Simply reply to the request successfully.
final var latestIndex = raft.getCurrentSnapshotIndex();
if (latestIndex >= request.index()) {
abortPendingSnapshots();
return CompletableFuture.completedFuture(
logResponse(InstallResponse.builder().withStatus(RaftResponse.Status.OK).build()));
}
So what might have happened is for some reason the leader decided to send another install request before the snapshot listener was handled. Are we possibly retrying install requests and ended up sending one too many? The default install request timeout is 2.5 seconds. If there's any I/O stalls, for example, and we take longer, then we'll retry the request.
Indeed, on the leader, zeebe-1, we see the install request timed out.
2023-09-20 03:37:54.894 CEST RaftServer{raft-partition-partition-2} - InstallRequest{currentTerm=1, leader=1, index=225, term=1, version=1, chunkId=HeapByteBuffer{position=0, remaining=14, limit=14, capacity=14, mark=java.nio.HeapByteBuffer[pos=0 lim=14 cap=14], hash=-1428579103}, nextChunkId=null, data=HeapByteBuffer{position=0, remaining=159, limit=159, capacity=159, mark=java.nio.HeapByteBuffer[pos=0 lim=159 cap=159], hash=559971236}, initial=false, complete=true} to 0 failed: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request ProtocolRequest{id=6017, subject=raft-partition-partition-2-install, sender=zeebe-1.zeebe-broker-service.2042c610-7a76-4f45-bfb3-b5bd2e8a9083-zeebe.svc.cluster.local:26502, payload=byte[]{length=308, hash=-186090835}} to zeebe-0.zeebe-broker-service.2042c610-7a76-4f45-bfb3-b5bd2e8a9083-zeebe.svc.cluster.local:26502 timed out in PT2.5S
So the request was simply retried, and it was then received and queued before the snapshot listener was processed.
Describe the bug
Error group https://console.cloud.google.com/errors/detail/CLXT_uWi3KaNSQ;service=zeebe;time=P7D?project=camunda-saas-int-chaos
We recently reworked how we store replicated snapshots, and added some error cases that should fail when we for example receive a duplicate snapshot.
This seemed to happen in one of our recent chaos tests. The impact is quite high since due to the exception, which is thrown in this case, the node transitions to inactive. It will no longer take part of the corresponding partition.
To Reproduce
Not clear yet.
Expected behavior
No duplicate snapshot replication
Log/Stacktrace
Full Stacktrace
``` io.camunda.zeebe.snapshots.SnapshotException$SnapshotAlreadyExistsException: Expected to receive snapshot with id 225-1-225-240, but was already persisted. This shouldn't happen. at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:381) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newReceivedSnapshot(FileBasedSnapshotStore.java:51) ~[zeebe-snapshots-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.roles.PassiveRole.onInstall(PassiveRole.java:190) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.roles.FollowerRole.onInstall(FollowerRole.java:83) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.impl.RaftContext.lambda$registerHandlers$8(RaftContext.java:303) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.raft.impl.RaftContext.lambda$runOnContext$27(RaftContext.java:322) ~[zeebe-atomix-cluster-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) [zeebe-atomix-utils-8.3.0-SNAPSHOT.jar:8.3.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?] at java.util.concurrent.FutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] ```
Environment: