Open sakno opened 1 week ago
Related #242.
Hi @sakno
I've attached a zip of the node.state-file of the problematic node202: node.zip
The scenario is as follows: I restart the 6 Raspberry PIs in our cluster and our software including Raft is automatically started on startup. They all start from a clean slate. Here's what made it into their respective log-files (the callstack for the exceptions in node202 has been modified by my re-throwing in order to get various values):
node1:
2024-06-26 12:23:17.1807|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:01.7110000
node149:
2024-06-26 12:23:05.2737|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:02.7880000
node154:
2024-06-26 12:23:05.0097|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Cluster member 192.168.100.202:4250 is unavailable
2024-06-26 12:23:05.0954|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Cluster member 192.168.100.1:4250 is unavailable
2024-06-26 12:23:05.1880|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:02.0870000
2024-06-26 12:23:05.5372|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.232:4250 is failed. Retry replication from entry 1
2024-06-26 12:23:05.5372|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.31:4250 is failed. Retry replication from entry 1
2024-06-26 12:23:05.5395|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.149:4250 is failed. Retry replication from entry 1
node232:
2024-06-26 12:23:05.3018|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:01.7320000
node286:
2024-06-26 12:23:05.0852|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:01.9890000
node202:
2024-06-26 12:23:15.7147|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.154:4250 - Term 1 - Election timeout 00:00:02.4300000
2024-06-26 12:23:16.4185|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 192.168.100.154:38352|System.Exception: ### IsLittleEndian=True 64bit=False footer.Span.Length=2000 footer.Length=2000 index=1 writeAddress=59392164992516328 GetMetadataBuffer(index - 1).Length=40 LogEntryMetadata.GetEndOfLogEntry(GetMetadataBuffer(index - 1).Span)=59392164992516328 entry.Length=1425 fileOffset=512 FirstIndex=50 LastIndex=99 PartitionNumber=1
---> System.ArgumentOutOfRangeException: Specified file length was too large for the file system. (Parameter 'value')
at System.IO.RandomAccess.WriteGatherAtOffset(SafeFileHandle handle, IReadOnlyList`1 buffers, Int64 fileOffset)
at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.ExecuteInternal()
--- End of stack trace from previous location ---
at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 616
--- End of inner exception stack trace ---
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 625
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 481
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 520
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 550
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 620
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ConnectionOriented\Server.cs:line 142
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 134
2024-06-26 12:23:16.5328|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 192.168.100.154:38364|System.Exception: ### IsLittleEndian=True 64bit=False footer.Span.Length=2000 footer.Length=2000 index=1 writeAddress=59392164992516328 GetMetadataBuffer(index - 1).Length=40 LogEntryMetadata.GetEndOfLogEntry(GetMetadataBuffer(index - 1).Span)=59392164992516328 entry.Length=1425 fileOffset=512 FirstIndex=50 LastIndex=99 PartitionNumber=1
---> System.ArgumentOutOfRangeException: Specified file length was too large for the file system. (Parameter 'value')
at System.IO.RandomAccess.WriteGatherAtOffset(SafeFileHandle handle, IReadOnlyList`1 buffers, Int64 fileOffset)
at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.ExecuteInternal()
--- End of stack trace from previous location ---
at Microsoft.Win32.SafeHandles.SafeFileHandle.ThreadPoolValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 616
--- End of inner exception stack trace ---
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 625
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 481
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 520
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 550
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 620
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ConnectionOriented\Server.cs:line 142
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in C:\Users\Dell\source\repos\dotNext-master\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 134
...
Is this the data you were asking for?
node.zip
has only node.state
but not partition files.
Do you mean these? node.zip
Do you mean these?
Yep, thanks! You've posted WAL in a legacy binary format. But you're trying to open it as a new format which is not binary compatible with the previous one. See Release Notes. There a few ways to fix that:
UseLegacyBinaryFormat
to true as mentioned in release notes.That surprises me since we store the node.state and partition files in a RAM disk folder, and I know the cluster was both power cycled and rebooted multiple times yesterday. But thanks a lot for looking into it - I'll clear all the partition files and continue testing tomorrow. Is there a way for me to see if a WAL is in the legacy binary format or not?
Is there a way for me to see if a WAL is in the legacy binary format or not?
Yep, the easiest way is to look at first 512 bytes. In a new format, all of them are zeroes (except first byte, it can be 1 or 0). Old format doesn't have such a header so its first 40 bytes are mostly non-zero.
I've published release 5.7.1 with a patch for new binary format that improves reliability of WAL in case of power loss or process crash.
Hi @sakno I've been testing today with 5.7.1 and I still get these errors sometimes. On each startup we always clear the folder containing the node.state and partition-files, so there's no chance of leftover files using the legacy binary format. We don't set UseLegacyBinaryFormat to true. Is there any other way the legacy binary format might get enforced by mistake? Could the partition file you looked at be corrupted (instead of being in the legacy binary format)?
Could the partition file you looked at be corrupted (instead of being in the legacy binary format)?
Definitely not. You can check it by yourself. Instantiate your WAL class inside of test and open those files with it. For instance, for a log entry with index 0
and partition number 96
I see the following log entry metadata in the debugger:
{DotNext.Net.Cluster.Consensus.Raft.PersistentState.LogEntryMetadata}
End: 3409
Id: null
Length: 1409
Offset: 2000
Term: 1
Timestamp: 638423899776894644
Offset for the first log entry matches to the size of legacy format header (50 entries per partition X 40 bytes metadata = 2000 bytes). Length is meaningful, as well as timestamp (which is 02/01/2024 13:12:57 +00:00
).
Is there any other way the legacy binary format might get enforced by mistake?
There are two places in the code inside of PersistentState
. All of them rely on maxLogEntrySize
private field. If possible, you can dump the value of this field.
You mentioned ramfs. Is there any chance that the file content shared between deletions of the file? Could you reset Linux page cache or reboot the device?
I'm rebooting all the devices between each test run.
Is there any other way the legacy binary format might get enforced by mistake?
There are two places in the code inside of
PersistentState
. All of them rely onmaxLogEntrySize
private field. If possible, you can dump the value of this field.
maxLogEntrySize is always 0
Another approach to catch mistake: go to PersistentState.LegacyPartition
class and replace its constructor with throw
statement to see if there is any caller of it.
Another approach to catch mistake: go to
PersistentState.LegacyPartition
class and replace its constructor withthrow
statement to see if there is any caller of it.
I did this and PersistentState.LegacyPartition
is not getting used/constructed.
New format uses a footer of the same size as header in the old format. Probably, due some bug, footer placed at position 0
. I'll try to reproduce this.
OK π let me know if you want me to try out something.
I've added some assertions to catch this situation. You can try develop
branch. However, I don't see how is it possible.
I've added some assertions to catch this situation. You can try
develop
branch. However, I don't see how is it possible.
Yours assertions are not hit π
Table.FlushAndSealAsync
is the only place inside of PersistentState
class where footer is written. Assertion Debug.Assert(writer.FilePosition > HeaderSize);
means that the footer is not written at position 0
.
Does it make sense to add the same assertion in Table.WriteThroughAsync (Debug.Assert(writeAddress >= HeaderSize)
)?
I tried doing that and can see that we try to write on address 0. Is that expected or the bug we're after?
writeAddress=0 HeaderSize=512
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.Partition.cs:line 600
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Partition.WriteAsync[TEntry](TEntry entry, Int64 absoluteIndex, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.Partition.cs:line 245
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 481
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.UnsafeAppendAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 422
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 520
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 550
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.DotNext.IO.Log.IAuditTrail<DotNext.Net.Cluster.Consensus.Raft.IRaftLogEntry>.AppendAndCommitAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftPersistentState.cs:line 537
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster.DotNext.Net.Cluster.Consensus.Raft.TransportServices.ILocalMember.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:UsersDellsourcereposdotNextsrcclusterDotNext.Net.ClusterNetClusterConsensusRaftRaftCluster.DefaultImpl.cs:line 300
Btw, are you counting on the footer memory buffer being initialized with all 0s (in the Table constructor)?
writeAddress=0
is an effect of broken binary data within partition, not the root cause. writeAddress
is computed as a offset+length
of the previous log entry.
Finally I've reproduced this bug successfully in the test.
Fixed, I'll publish a new release today.
That is awesome! What was the issue?
When caching enabled, a new format doesn't fill log entries correctly on flush. In the same time, in-memory representation of metadata table remains correct. While application is running, the issue is not observable. But when application tries to read WAL partitions from disk, it gets corrupted metadata table. That's why we saw writeAddress
equal to zero.
Unfortunately this hasn't fixed our issue...
Here I catch the exception and print out some properties of the logentry:
2024-07-02 07:33:05.2939|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 192.168.100.154:56940|
System.Exception: ### IsLittleEndian=True 64bit=False logentry.Term=-4857076768968013814 logentry.Timestamp=240602054703 logentry.Length=5458362748373041152 logentry.Offset=5423414815264646208 logentry.End=-7564966510071864256
---> System.ArgumentOutOfRangeException: Non-negative number required. (Parameter 'length')
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.UnsealIfNeededAsync(Int64 truncatePosition, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 590
--- End of stack trace from previous location ---
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 547
--- End of inner exception stack trace ---
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.WriteThroughAsync(CachedLogEntry entry, Int32 index, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 565
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendUncachedAsync[TEntry](ILogEntryProducer`1 supplier, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 481
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 520
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.AppendAndCommitSlowAsync[TEntry](ILogEntryProducer`1 entries, Int64 startIndex, Boolean skipCommitted, Int64 commitIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 550
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 620
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 142
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134
If I add the assertion again to Table.WriteThroughAsync (Debug.Assert(writeAddress >= HeaderSize)), I get a hit with writeAddress=0.
I guess both these things show there's still broken binary data within the partition?
Please check dependency version
at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Table.UnsealIfNeededAsync(Int64 truncatePosition, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 590
Line 590 points to a different location in develop
branch:
https://github.com/dotnet/dotNext/blob/2edc792f69537154b96fdc3ea3b466646d3ea018/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs#L590-L597
The line numbering is shifted a bit due to my added assertions and catching/throwing exception to get more data.
I've tried using both the latest Develop branch and DotNext.Net.Cluster 5.7.2 nuget package with the same result.
Discussed in https://github.com/dotnet/dotNext/discussions/243