dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.56k stars 119 forks source link

DotNext.Net.Cluster: System.ArgumentOutOfRangeException: Non-negative number required. (Parameter 'length') #244

Open sakno opened 1 week ago

sakno commented 1 week ago

Discussed in https://github.com/dotnet/dotNext/discussions/243

Originally posted by **LarsWithCA** June 25, 2024 Hi @sakno, Once in a while we get a series of this exception during startup (possibly after restart/power-cycle) hindering the cluster from getting fully up and running: ``` 2024-06-24 11:36:24.9866|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 192.168.100.154:52480|System.ArgumentOutOfRangeException: Non-negative number required. (Parameter 'length') at System.IO.RandomAccess.SetLength(SafeFileHandle handle, Int64 length) at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.UnsealIfNeededAsync(Int64 truncatePosition, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 647 --- End of stack trace from previous location --- at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 605 at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 481 at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 520 at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 550 at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 620 at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 142 at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134 ``` Our setup: - Linux ARM 32bit (little endian) - .NET8 + DotNext.Net.Cluster 5.7.0 - Cluster with 6 nodes - 10Hz communication frequency - MemoryBasedStateMachine.Options: - recordsPerPartition = 50 - BufferSize = 8192 * 64 * 10 - InitialPartitionSize = 50 * 8192 * 10 - CompactionMode = CompactionMode.Sequential - WriteMode = WriteMode.AutoFlush - CacheEvictionPolicy = LogEntryCacheEvictionPolicy.OnSnapshot Here are the values of various arguments/variables/fields inside `PersistentState.Table.WriteThroughAsync` when the exception happens: - index=1 - fileOffset=512 - FirstIndex=150 - LastIndex=199 - PartitionNumber=3 - footer.Length=2000 - entry.Length=1425 - GetMetadataBuffer(index - 1).Length=40 - **writeAddress**=LogEntryMetadata.GetEndOfLogEntry(GetMetadataBuffer(index - 1).Span)=**-7426796073504137408** Are there any other values I should try and capture that might help you investigate/solve this? In the previous version we were running (5.5.0), apart from this exception we also saw "System.ArgumentOutOfRangeException: Specified file length was too large for the file system." in the same area of the code - we *might* not be seeing that in the newest version.
sakno commented 1 week ago

Related #242.

LarsWithCA commented 6 days ago

Hi @sakno

I've attached a zip of the node.state-file of the problematic node202: node.zip

The scenario is as follows: I restart the 6 Raspberry PIs in our cluster and our software including Raft is automatically started on startup. They all start from a clean slate. Here's what made it into their respective log-files (the callstack for the exceptions in node202 has been modified by my re-throwing in order to get various values):

node1:
2024-06-26 12:23:17.1807|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:01.7110000

node149:
2024-06-26 12:23:05.2737|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:02.7880000

node154:
2024-06-26 12:23:05.0097|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Cluster member 192.168.100.202:4250 is unavailable
2024-06-26 12:23:05.0954|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Cluster member 192.168.100.1:4250 is unavailable
2024-06-26 12:23:05.1880|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:02.0870000
2024-06-26 12:23:05.5372|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.232:4250 is failed. Retry replication from entry 1
2024-06-26 12:23:05.5372|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.31:4250 is failed. Retry replication from entry 1
2024-06-26 12:23:05.5395|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.149:4250 is failed. Retry replication from entry 1

node232:
2024-06-26 12:23:05.3018|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:01.7320000

node286:
2024-06-26 12:23:05.0852|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:01.9890000

node202:
2024-06-26 12:23:15.7147|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:02.4300000
2024-06-26 12:23:16.4185|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 192.168.100.154:38352|System.Exception: ### IsLittleEndian=True 64bit=False footer.Span.Length=2000 footer.Length=2000 index=1 writeAddress=59392164992516328 GetMetadataBuffer(index - 1).Length=40 LogEntryMetadata.GetEndOfLogEntry(GetMetadataBuffer(index - 1).Span)=59392164992516328 entry.Length=1425 fileOffset=512 FirstIndex=50 LastIndex=99 PartitionNumber=1
 ---> System.ArgumentOutOfRangeException: Specified file length was too large for the file system. (Parameter 'value')
   at System.IO.RandomAccess.WriteGatherAtOffset(SafeFileHandle handle, IReadOnlyList`1 buffers, Int64 fileOffset)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.ExecuteInternal()
--- End of stack trace from previous location ---
   at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 616
   --- End of inner exception stack trace ---
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 625
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 481
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 520
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 550
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 620
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ConnectionOriented\Server.cs:line 142
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 134
2024-06-26 12:23:16.5328|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 192.168.100.154:38364|System.Exception: ### IsLittleEndian=True 64bit=False footer.Span.Length=2000 footer.Length=2000 index=1 writeAddress=59392164992516328 GetMetadataBuffer(index - 1).Length=40 LogEntryMetadata.GetEndOfLogEntry(GetMetadataBuffer(index - 1).Span)=59392164992516328 entry.Length=1425 fileOffset=512 FirstIndex=50 LastIndex=99 PartitionNumber=1
 ---> System.ArgumentOutOfRangeException: Specified file length was too large for the file system. (Parameter 'value')
   at System.IO.RandomAccess.WriteGatherAtOffset(SafeFileHandle handle, IReadOnlyList`1 buffers, Int64 fileOffset)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.ExecuteInternal()
--- End of stack trace from previous location ---
   at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 616
   --- End of inner exception stack trace ---
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 625
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 481
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 520
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 550
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 620
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ConnectionOriented\Server.cs:line 142
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 134
...

Is this the data you were asking for?

sakno commented 6 days ago

node.zip has only node.state but not partition files.

LarsWithCA commented 6 days ago

Do you mean these? node.zip

sakno commented 6 days ago

Do you mean these?

Yep, thanks! You've posted WAL in a legacy binary format. But you're trying to open it as a new format which is not binary compatible with the previous one. See Release Notes. There a few ways to fix that:

  1. If you don't need your existing WAL, just erase it on all nodes and start from scratch. A new WAL will use new binary format.
  2. If you prefer to keep data from existing WAL, set UseLegacyBinaryFormat to true as mentioned in release notes.
LarsWithCA commented 6 days ago

That surprises me since we store the node.state and partition files in a RAM disk folder, and I know the cluster was both power cycled and rebooted multiple times yesterday. But thanks a lot for looking into it - I'll clear all the partition files and continue testing tomorrow. Is there a way for me to see if a WAL is in the legacy binary format or not?

sakno commented 6 days ago

Is there a way for me to see if a WAL is in the legacy binary format or not?

Yep, the easiest way is to look at first 512 bytes. In a new format, all of them are zeroes (except first byte, it can be 1 or 0). Old format doesn't have such a header so its first 40 bytes are mostly non-zero.

sakno commented 6 days ago

I've published release 5.7.1 with a patch for new binary format that improves reliability of WAL in case of power loss or process crash.

LarsWithCA commented 5 days ago

Hi @sakno I've been testing today with 5.7.1 and I still get these errors sometimes. On each startup we always clear the folder containing the node.state and partition-files, so there's no chance of leftover files using the legacy binary format. We don't set UseLegacyBinaryFormat to true. Is there any other way the legacy binary format might get enforced by mistake? Could the partition file you looked at be corrupted (instead of being in the legacy binary format)?

sakno commented 4 days ago

Could the partition file you looked at be corrupted (instead of being in the legacy binary format)?

Definitely not. You can check it by yourself. Instantiate your WAL class inside of test and open those files with it. For instance, for a log entry with index 0 and partition number 96 I see the following log entry metadata in the debugger:

{DotNext.Net.Cluster.Consensus.Raft.PersistentState.LogEntryMetadata}
    End: 3409
    Id: null
    Length: 1409
    Offset: 2000
    Term: 1
    Timestamp: 638423899776894644

Offset for the first log entry matches to the size of legacy format header (50 entries per partition X 40 bytes metadata = 2000 bytes). Length is meaningful, as well as timestamp (which is 02/01/2024 13:12:57 +00:00).

sakno commented 4 days ago

Is there any other way the legacy binary format might get enforced by mistake?

There are two places in the code inside of PersistentState. All of them rely on maxLogEntrySize private field. If possible, you can dump the value of this field.

sakno commented 4 days ago

You mentioned ramfs. Is there any chance that the file content shared between deletions of the file? Could you reset Linux page cache or reboot the device?

LarsWithCA commented 4 days ago

I'm rebooting all the devices between each test run.

LarsWithCA commented 4 days ago

Is there any other way the legacy binary format might get enforced by mistake?

There are two places in the code inside of PersistentState. All of them rely on maxLogEntrySize private field. If possible, you can dump the value of this field.

maxLogEntrySize is always 0

sakno commented 4 days ago

Another approach to catch mistake: go to PersistentState.LegacyPartition class and replace its constructor with throw statement to see if there is any caller of it.

LarsWithCA commented 3 days ago

Another approach to catch mistake: go to PersistentState.LegacyPartition class and replace its constructor with throw statement to see if there is any caller of it.

I did this and PersistentState.LegacyPartition is not getting used/constructed.

sakno commented 3 days ago

New format uses a footer of the same size as header in the old format. Probably, due some bug, footer placed at position 0. I'll try to reproduce this.

LarsWithCA commented 3 days ago

OK πŸ‘ let me know if you want me to try out something.

sakno commented 3 days ago

I've added some assertions to catch this situation. You can try develop branch. However, I don't see how is it possible.

LarsWithCA commented 2 days ago

I've added some assertions to catch this situation. You can try develop branch. However, I don't see how is it possible.

Yours assertions are not hit πŸ˜•

sakno commented 2 days ago

Table.FlushAndSealAsync is the only place inside of PersistentState class where footer is written. Assertion Debug.Assert(writer.FilePosition > HeaderSize); means that the footer is not written at position 0.

LarsWithCA commented 1 day ago

Does it make sense to add the same assertion in Table.WriteThroughAsync (Debug.Assert(writeAddress >= HeaderSize))? I tried doing that and can see that we try to write on address 0. Is that expected or the bug we're after?

writeAddress=0 HeaderSize=512
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.Partition.cs:line 600
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Partition.WriteAsync[TEntry](TEntry entry, Int64 absoluteIndex, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.Partition.cs:line 245
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 481
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.UnsafeAppendAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 422
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 520
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 550
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.DotNext.IO.Log.IAuditTrail<DotNext.Net.Cluster.Consensus.Raft.IRaftLogEntry>.AppendAndCommitAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 537
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster.DotNext.Net.Cluster.Consensus.Raft.TransportServices.ILocalMember.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftRaftCluster.DefaultImpl.cs:line 300

Btw, are you counting on the footer memory buffer being initialized with all 0s (in the Table constructor)?

sakno commented 1 day ago

writeAddress=0 is an effect of broken binary data within partition, not the root cause. writeAddress is computed as a offset+length of the previous log entry.

sakno commented 1 day ago

Finally I've reproduced this bug successfully in the test.

sakno commented 1 day ago

Fixed, I'll publish a new release today.

LarsWithCA commented 1 day ago

That is awesome! What was the issue?

sakno commented 1 day ago

When caching enabled, a new format doesn't fill log entries correctly on flush. In the same time, in-memory representation of metadata table remains correct. While application is running, the issue is not observable. But when application tries to read WAL partitions from disk, it gets corrupted metadata table. That's why we saw writeAddress equal to zero.

LarsWithCA commented 23 hours ago

Unfortunately this hasn't fixed our issue...

Here I catch the exception and print out some properties of the logentry:

2024-07-02 07:33:05.2939|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 192.168.100.154:56940|
System.Exception: ### IsLittleEndian=True 64bit=False logentry.Term=-4857076768968013814 logentry.Timestamp=240602054703 logentry.Length=5458362748373041152 logentry.Offset=5423414815264646208 logentry.End=-7564966510071864256
 ---> System.ArgumentOutOfRangeException: Non-negative number required. (Parameter 'length')
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.UnsealIfNeededAsync(Int64 truncatePosition, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 590
--- End of stack trace from previous location ---
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 547
   --- End of inner exception stack trace ---
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 565
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 481
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 520
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 550
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 620
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 142
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134

If I add the assertion again to Table.WriteThroughAsync (Debug.Assert(writeAddress >= HeaderSize)), I get a hit with writeAddress=0.

I guess both these things show there's still broken binary data within the partition?

sakno commented 19 hours ago

Please check dependency version

at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.UnsealIfNeededAsync(Int64 truncatePosition, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 590

Line 590 points to a different location in develop branch: https://github.com/dotnet/dotNext/blob/2edc792f69537154b96fdc3ea3b466646d3ea018/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs#L590-L597

LarsWithCA commented 19 hours ago

The line numbering is shifted a bit due to my added assertions and catching/throwing exception to get more data. I've tried using both the latest Develop branch and DotNext.Net.Cluster 5.7.2 nuget package with the same result. image