stability: improve recovery after a crash of the majority of servers

freddyrios commented 2 years ago

This relates to the scenario mentioned at https://github.com/dotnet/dotNext/issues/89#issuecomment-1008726787.

Scenario: running the modified example on 3 nodes, kill 2 nodes (ctrl - c) and then restart one of them. Expected: consistently resumes normal operation when at least 2 out of 3 nodes are running again Actual: cluster often does not recover consistenly

Notes:

it is usually possible to recover the cluster by retrying to restart the different nodes multiple times
we also saw a failure where a cluster that had been running without error during the weekend, failed a while after we connected to the 3 nodes. In this case, 2 of the nodes logged request timeouts, while another one logged an error from the data modifier without any further output until restarted (even after restarting the other nodes first). Given the busy nature of the run, some crazy ideas: what if the extra resource usage or any timing difference writing to the console made it easier to hit some limit and hit a race condition
we unintentially made a different test where we pulled the power of one of the nodes. When recovering it, its log had broken, which looked clearly different to the falures above. Unfortunately I did not keep the message, but it was something an index being out of range or not found in the log.

sakno commented 2 years ago

but it was something an index being out of range or not found in the log.

That may be an issue with filesystem setup. For instance, when SSD/NVMe is in place, Linux allows you to configure buffer caches for disk I/O. It means that even if you write some bytes to FileStream/SafeFileHandle they can be not yet written to the disk. Persistent WAL has WriteThrough flag to skip any intermediate OS-level buffers when writing to the disk However, it's always up to OS I/O layer.

sakno commented 2 years ago

@freddyrios , look at https://github.com/dotnet/dotNext/discussions/94#discussioncomment-1946231. Looks like the more accurate calculation of heartbeat timeout on the leader node (now it includes subtraction of replication duration) leads to more stable cluster. Anyway, this is a step forward to our goal.

sakno commented 2 years ago

Here is a new branch where I'm working on new protocol on top of TCP: https://github.com/dotnet/dotNext/tree/feature/new-raft-tcp

freddyrios commented 2 years ago

The deployment had default rasbian, so it makes sense there could be something like that at play for that log issue.

On Tue, Jan 11, 2022, 18:44 SRV @.***> wrote:

but it was something an index being out of range or not found in the log.

That may be an issue with filesystem setup. For instance, when SSD/NVMe is in place, Linux allows you to configure buffer caches for disk I/O. It means that even if you write some bytes to FileStream/SafeFileHandle they can be not yet written to the disk. Persistent WAL has WriteThrough flag to skip any intermediate OS-level buffers when writing to the disk However, it's always up to OS I/O layer.

— Reply to this email directly, view it on GitHub https://github.com/copenhagenatomics/dotNext/issues/1#issuecomment-1010208144, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ42ZDIKG3QFDMM3HSOHIU3UVRUAJANCNFSM5LTMQR6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

freddyrios commented 2 years ago

I am on a phone/train, so had trouble fully following.

However, what I understood is that the response time trying to send heartbeats/entries to one follower prevented the leader from sending heartbeats/entries at the expected rate to a different follower.

I could see addressing that helping the stability for some of the scenarios we have hit.

One thing I noticed when playing with the raft simulator, is that heartbeats do not always align across different nodes.so depending on what was going on, it looked a lot like the leader could keep the heartbeats frequency independently for each follower. I suspect that decoupling is an alternate way one deals with it. However, it does make sense that it would keep trying at the right frequency, even if one of the appends happened to be a bit heavy due yo the entries being sent.

For healthy followers the time to append entries is supposed to be an order of magnitude less than the election timeouts. If these are slow enough to matter for healthy nodes, then it could make some of the raft paper assumptions problematic I guess.

Another independent but related stability area I have thought about is: what happens when appending entries to a followers takes long (due to the size and amount of entries for example), can this result in the follower becoming a candidate even though the leader is actively communicating to it? (not sure what the right behavior of such case, as the problem appending those entries could be in either side).

On Tue, Jan 11, 2022, 19:16 SRV @.***> wrote:

@freddyrios https://github.com/freddyrios , look at dotnet#94 (comment) https://github.com/dotnet/dotNext/discussions/94#discussioncomment-1946231. Looks like the more accurate calculation of heartbeat timeout on the leader node (now it includes subtraction of replication duration) leads to more stable cluster. Anyway, this is a step forward to our goal.

— Reply to this email directly, view it on GitHub https://github.com/copenhagenatomics/dotNext/issues/1#issuecomment-1010233843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ42ZDJX6VJKP3EAQI2NNLTUVRXWDANCNFSM5LTMQR6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

sakno commented 2 years ago

what happens when appending entries to a followers takes long (due to the size and amount of entries for example), can this result in the follower becoming a candidate even though the leader is actively communicating to it?

Yes, it is. We can mitigate this by stopping the timer counting on the follower node during execution of AppendEntriesAsync.

sakno commented 2 years ago

@freddyrios , done: https://github.com/dotnet/dotNext/commit/a6ab90237433ae4504947dfe4193f742cfcc8f1f

freddyrios commented 2 years ago

@sakno I have created a new branch with the same example but on top of the latest develop changes https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-example-rebased

There are 2 more branches with some exploration I did on stability (already rebased on the latest develop changes), specially given the fix you shard related to hearbeats.

https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-slowspeed https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-slowspeed-frequentsnapshots

It seems:

bigstruct-example-rebased recovers 1 out of 3 fine, even before the latest changes. Maybe because the very frequent writes act as very frequent heartbeats. 2 out of 3 still fails often.
bigstruct-slowspeed recovering 1 out of 3 fails often, but the latest fixes the situation for the other nodes, as they are no longer disrupted by the node that keeps faliing. Usually giving "warn: DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74022] Request has timed out"
bigstruct-slowspeed-frequentsnapshots both 1 out of 3 and 2 out of 3 tend to recover fine (I saw 1 failure, but it seems hard to reproduce, so maybe some timing that is more unlikely in this configuration/load) with the latest changes.

What I find particularly odd is that it can push entries at such a high rate in the first configuration, yet run into timeout trouble recovering a node in the second configuration. Makes me wonder if more than a timeout is something getting stuck (but have no evidence of this).

Not sure what could explain 2 out of 3 being much more reliable in configuration 3 vs. 1. Even though configuration 1 is more of a stress test, I can't see why that would be special when taking out 2 nodes vs. 1 node in comparison to the configuration 3.

sakno commented 2 years ago

@freddyrios , I'm still working on new TCP transport. I expect to finish it this holidays. After that, I'll start testing and stabilizing the release using use cases you mentioned previously. All patches and new transport will be released as 4.2.0 version.

sakno commented 2 years ago

Alright, TCP transport is re-implemented now. Of course, I'll do the necessary tests using your examples. Now I think it's reasonable to implement recovery procedure for persistent Write-Ahead Log. Here is my plan:

Implement checksum for node's state file: node.state
Basic functionality for recovery procedure: 1.1. If node's state file is corrupted then the entire WAL can be considered as invalid. We can cleanup the WAL. 1.2. Add ClearAsync method to PersistentState 1.2. Add ability to override InitializeAsync method of persistent WAL. For each log entry, we can embed a checksum or even duplicate the data. Therefore, at startup we will able to validate all committed log entries. To be more precise, we will able to do that with existing functionality through ReplayAsync that allows to warmup underlying state machine. During initialization, we can easily verify the checkum and throw exception. In very bad case we still be able to execute the node even with empty WAL.

P.S.: Regarding our protection from bit flips: I think SHA is very heavyweight for our purposes. It's crypto-strong hash code typically not considered for data consistency check. FNV1a or CRC is much better and more performant.

sakno commented 2 years ago

Verification of node.state could prevent from accidental power off and incorrect indexes within the log.

sakno commented 2 years ago

My investigation shows that storing checksum for node.state makes a little sense. In general, we can't handle accidental power off normally:

Disk writes must be transactional and underlying file system must support copy-on-write semantics
Or WAL must support reliable writes via insertion of checkpoints and checksum for each checkpoint

Anyway, fault-tolerant WAL is another field of theoretical and practical research. I didn't have a goal to write WAL for such cases. The current implementation just works, at least under normal circumstances.

However, we can have workaround to simulate simple checkpoint. Before every append operation, we can save WriteInProgress indicator to the external file. The indicator can have a size of 1 byte. When operation is finished, just reverse this indicator. In case of accidental shutdown, the file will keep WriteInProgress indicator. It's enough to decide that the entire WAL probably broken and skip all the data in it to have a fresh setup. All these things out of scope at the moment, so I'll focus on stability of the cluster itself.

sakno commented 2 years ago

When recovering it, its log had broken, which looked clearly different to the falures above.

Found the root cause of this issue even in case of graceful shutdown of the node. The problem is in snapshot when it is installed for the index that is greater than the first index in the partition. In that case, the partition becomes completely invalid because the binary offset calculated incorrectly within partition file.

sakno commented 2 years ago

https://github.com/dotnet/dotNext/commit/ebd3b96c010050fea5bd60baf312bb83f28c2142:

Fixed aligned pointer access on ARM devices (on ARM devices, dereference of unaligned pointer can cause SEGFAULT)
Fixed bug that I mentioned previously: incorrect calculation of write position within the partition file when the caller trying to insert the log entry in the mid of the partition after snapshot installation.

freddyrios commented 2 years ago

Fixed aligned pointer access on ARM devices (on ARM devices, dereference of unaligned pointer can cause SEGFAULT

Nice catch. One I hit on our side in the past with ARM was trying to use https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.cast?view=net-6.0.

sakno commented 2 years ago

Fortunately, there is a small amount of such code. It is used on the hot path of program execution where the serialization/deserialization of log entry metadata required.

freddyrios commented 2 years ago

Test branches have been rebased to latest (if updating local copies, might want to delete them locally and pull them fresh):

The first example uses only default settings, except for BufferSize that is now 4096*2, which is a bit larger than one of the big entries (8000). Some short testing showed better stability, including in the scenario where 2 nodes crash.

The second example adds a 1 second delay after writting 16 entries. Note this is equivalent to the old slow + frequent snapshots, as the 50 records per partition was not changed (based on input on that being bad on its own). Short testing ran into 2 of the below crashes during re-elections. Restarting the crashed node recovered succesfully.

Process terminated. Assertion failed.
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.LogEntry..ctor(MemoryOwner`1& cachedContent, LogEntryMetadata& metadata, Int64 index) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.LogEntry.cs:line 44
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Partition.Read(Int32 sessionId, Int64 absoluteIndex, LogEntryReadOptimizationHint hint) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 218
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line
455
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 479
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<<CommitAsync>g__CommitAndCompactSequentiallyAsync|0>d.MoveNext() in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 311
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<CommitAsync>g__CommitAndCompactSequentiallyAsync|0()
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.CommitAsync(Nullable`1 endIndex, CancellationToken token) i[40/1934]s\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 293
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.CommitAsync(Int64 endIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 718
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 465
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster.DotNext.Net.Cluster.Consensus.Raft.TransportServices.ILocalMember.Appen[28/1934]sync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.DefaultImpl.cs:line 272
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 246
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.ProcessRequestAsync(MessageType type, ProtocolStream protocol, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 177   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 134
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)       
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)     [14/1934]   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetExistingTaskResult(Task`1 task, TResult result)
   at System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1.SetResult(TResult result)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.ProtocolStream.ReadMessageTypeAsync(CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ConnectionOriented\ProtocolStream.ReadOperations.cs:line 93
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()                             [0/1934]   at System.Threading.ThreadPool.<>c.<.cctor>b__86_0(Object state)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
   at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32 bytesTransferred, SocketFlags flags)
   at System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32 bytesTransferred, Byte[] socketAddress, Int32 socketAddressSize, SocketFlags receivedFlags, SocketError socketError)
   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
Aborted

sakno commented 2 years ago

I caught another exception:

System.IO.InternalBufferOverflowException: System error.
         at DotNext.Buffers.SpanReader`1.Read(Int32 count) in /home/roman/projects/dotnext/src/DotNext/Buffers/SpanReader.cs:line 201
         at DotNext.IO.FileReader.Read[T]() in /home/roman/projects/dotnext/src/DotNext.IO/IO/FileReader.Binary.cs:line 22
         at DotNext.IO.FileReader.ReadAsync[T](CancellationToken token) in /home/roman/projects/dotnext/src/DotNext.IO/IO/FileReader.Binary.cs:line 42
         at RaftNode.SimplePersistentState.UpdateValue(LogEntry entry) in /home/roman/projects/dotnext/src/examples/RaftNode/SimplePersistentState.cs:line 75
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/MemoryBasedStateMachine.cs:line 456
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<<CommitAsync>g__CommitAndCompactSequentiallyAsync|0>d.MoveNext() in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/MemoryBasedStateMachine.cs:line 311
      --- End of stack trace from previous location ---
         at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 465
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 246
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 251
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134

I'll fix that shortly.

sakno commented 2 years ago

The previous issue has been fixed.

sakno commented 2 years ago

Found a new one:

DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74028]
      Failed to process request from 127.0.0.1:42824
      System.InvalidOperationException: Unknown Raft message type 251
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134

I'm working on it.

sakno commented 2 years ago

The root cause was incorrect skip of the last log entry when it is not consumed by the follower. Fixed and pushed to develop branch.

sakno commented 2 years ago

Found the root cause of failed assertion. Here is the steps to reproduce:

Append non-empty entry at index X with addToCache=true argument, but not commit it
Rewrite log entry at index X with the entry of size 0
Request the entry from WAL

Raft allows to rewrite uncommitted log entries only. This is by-design and described in Raft paper. In the described scenario, WAL uses internal cache when read operation is called. However, the actual content is stored on the disk with different length.

How to fix: invalidate cache at the specific index when rewrite happened.

freddyrios commented 2 years ago

All the previously shared scenarios are working now.

Additional testing shows these new scenarios:

(low priority) node starting again seems to always cause an election, which makes me wonder if prevoting is happening or some other unintended consequence of the start of that node
(critical) cutting power to 1 out of 3 nodes: the 2 alive nodes did not resume for 13+ minutes. Even then, it only recovered after the device was powered again, even though the raft node program was not started on the device (OS seems to explicitly reset the connection in this case)
broken log when taking out power, regardless of WriteThrough setting. Even if some alternatives might appear, the ease to reproduce could hint at some less unavoidable issue to be at play, so its at least worth exploration.

sakno commented 2 years ago

2nd fixed.

sakno commented 2 years ago

3rd issue. I've investigated WAL dumps from ARM devices. Fs (I think it was ext4) was unable to restore some of the partition files. That's why the index of the last log entry stored in the node.state file larger than the partition file. There is no silver bullet to solve this issue. However, there are few practices that we can use:

Persistent WAL supports live backups, see PersistentState.CreateBackupAsync and PersistentState.RestoreFromBackupAsync methods
On initialization, if the WAL is broken, we can just call ClearAsync and join as the follower. The latest cluster state will be replicated as soon as possible.

In worst case, the app can implement incremental backups. But I think the option 2 is enough for our purposes.

sakno commented 2 years ago

@freddyrios , you mentioned performance or read/write operations in WAL. Unfortunately, Linux historically did not have a good syscall for async disk I/O. It was many attempts to do that without any great success. However, in the last kernel version, _iouring syscall has been added. However, it is not supported by .NET: https://github.com/dotnet/runtime/issues/51985 It means that all asynchronous calls transformed to synchronous calls scheduled via ThreadPool.

freddyrios commented 2 years ago

@sakno closing this issue as done, as all the raised scenarios have been addressed. The only thing remaining is around node restarts usually triggering an election, but this is not a priority at the moment.

copenhagenatomics / dotNext

stability: improve recovery after a crash of the majority of servers #1