Closed freddyrios closed 10 months ago
Currently, the only reason for OperationCanceledException
is a token cancellation. According to sources of DataModifier
, we have:
var source = CancellationTokenSource.CreateLinkedTokenSource(stoppingToken, leadershipToken);
stoppingToken
is not a reason I think, because the token represents the lifetime of the hosted service. Inside of ReplicateAsync
implementation, passed token is linked with another token associated with RaftCluster
lifetime. Moreover, the node doesn't print Saving value {0} generated by the leader node
. According to sources, this is possible only if the node is no longer a leader.
Ok, looking at the 4.4.1 output again I can see that all this failure repetition is actuall leader changes, as you can see it reporting a different leader ip. Since the leadership token would be signaled that explains the canceled exceptions.
Is there a chance the difference in logging output of 4.4.1 vs later versions to be related to the fixes for #168? After all, those changes can cause less intermediate leader changes from being reported.
I find it likely that those previous leader changes are just the usual behavior and only the last change when it gets stuck is some different case being hit which is the bit that might be the same in all versions.
If we focus instead on the last outputs:
So not exactly the same last logged actions. Should I do a capture on the latest version with more detailed logging, maybe that could clarify what goes on when it stops?
Is there a chance the difference in logging output of 4.4.1 vs later versions
It was minor change for ForceReplicationAsync
method, Task
replaced by ValueTask
to remove some memory allocations on the hot path (ReplicateAsync
is called very frequently for every change). So logging itself is not changed, but stacktrace.
Should I do a capture on the latest version with more detailed logging
Yes please. It's good know what happening with Raft internals exactly at this moment.
Attached the log files. Its easy to spot when it stops making progress, as it is the last time Accepting/Saving messages show in all the logs.
In short, after the last leader election the followers (where the last "Member is downgrading" until "Incoming configuration with" keeps repeating):
2023-08-03 11:12:54.7453|INFO|RaftNode.ClusterConfigurator|New cluster leader is elected. Leader address is 192.168.100.10:3262
2023-08-03 11:12:54.7453|INFO|RaftNode.ClusterConfigurator|Term of local cluster member is 82. Election timeout 00:00:00.2470000
2023-08-03 11:12:54.8398|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgrading to follower state with term 82
2023-08-03 11:12:54.8398|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Election timeout is refreshed
2023-08-03 11:12:54.8398|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgraded to follower state with term 82
2023-08-03 11:12:54.8612|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Incoming configuration with 484292758073552492 fingerprint from leader. Local fingerprint is 484292758073552492, apply False
2023-08-03 11:12:54.8686|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgrading to follower state with term 82
2023-08-03 11:12:54.8686|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Election timeout is refreshed
2023-08-03 11:12:54.8686|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgraded to follower state with term 82
2023-08-03 11:12:54.8803|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Incoming configuration with 484292758073552492 fingerprint from leader. Local fingerprint is 484292758073552492, apply True
2023-08-03 11:12:54.9611|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgrading to follower state with term 82
2023-08-03 11:12:54.9611|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Election timeout is refreshed
2023-08-03 11:12:54.9611|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgraded to follower state with term 82
2023-08-03 11:12:54.9611|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Incoming configuration with 484292758073552492 fingerprint from leader. Local fingerprint is 484292758073552492, apply True
while the leader (where the last "Replication of member ... started" until "Local changes are not committed" keeps repeating):
2023-08-03 11:12:54.7414|INFO|RaftNode.ClusterConfigurator|New cluster leader is elected. Leader address is 192.168.100.10:3262
2023-08-03 11:12:54.7414|INFO|RaftNode.ClusterConfigurator|Term of local cluster member is 82. Election timeout 00:00:00.1870000
2023-08-03 11:12:54.7414|INFO|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Transition to Leader state has completed with term 82
2023-08-03 11:12:54.7414|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.1:3262 started at log index 12075
2023-08-03 11:12:54.7414|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.1:3262 contains 0 entries. Preceding entry has index 12074 and term 82. Local fingerprint is 0, remote is 484292758073552492, apply False
2023-08-03 11:12:54.7414|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.8:3262 started at log index 12075
2023-08-03 11:12:54.7414|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.8:3262 contains 0 entries. Preceding entry has index 12074 and term 82. Local fingerprint is 0, remote is 484292758073552492, apply False
2023-08-03 11:12:54.7414|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.8:3262 is failed. Retry replication from entry 12074
2023-08-03 11:12:54.7414|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.1:3262 is failed. Retry replication from entry 12074
2023-08-03 11:12:54.7414|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Local changes are not committed. Quorum is 3, last committed entry is 12072
2023-08-03 11:12:54.8366|INFO|RaftNode.DataModifier|Saving value 5996500 generated by the leader node
2023-08-03 11:12:54.8366|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.8:3262 started at log index 12074
2023-08-03 11:12:54.8366|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.1:3262 started at log index 12074
2023-08-03 11:12:54.8366|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.1:3262 contains 2 entries. Preceding entry has index 12073 and term 80. Local fingerprint is 0, remote is 484292758073552492, apply False
2023-08-03 11:12:54.8366|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.8:3262 contains 2 entries. Preceding entry has index 12073 and term 80. Local fingerprint is 0, remote is 484292758073552492, apply False
2023-08-03 11:12:54.8461|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member 192.168.100.8:3262 is replicated successfully starting from index 12074
2023-08-03 11:12:54.8613|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member 192.168.100.1:3262 is replicated successfully starting from index 12074
2023-08-03 11:12:54.8613|INFO|RaftNode.SimplePersistentState|Accepting value 5996500
2023-08-03 11:12:54.8613|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|All changes at index 12073 are committed. The number of committed entries is 2
2023-08-03 11:12:54.8613|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.8:3262 started at log index 12076
2023-08-03 11:12:54.8613|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.8:3262 contains 0 entries. Preceding entry has index 12075 and term 82. Local fingerprint is 484292758073552492, remote is 484292758073552492, apply True
2023-08-03 11:12:54.8613|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.1:3262 started at log index 12076
2023-08-03 11:12:54.8613|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.1:3262 contains 0 entries. Preceding entry has index 12075 and term 82. Local fingerprint is 484292758073552492, remote is 484292758073552492, apply True
2023-08-03 11:12:54.8613|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member 192.168.100.8:3262 is replicated successfully starting from index 12076
2023-08-03 11:12:54.8804|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member 192.168.100.1:3262 is replicated successfully starting from index 12076
2023-08-03 11:12:54.8804|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Local changes are not committed. Quorum is 3, last committed entry is 12074
2023-08-03 11:12:54.9590|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.1:3262 started at log index 12076
2023-08-03 11:12:54.9590|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of member 192.168.100.8:3262 started at log index 12076
2023-08-03 11:12:54.9590|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.1:3262 contains 0 entries. Preceding entry has index 12075 and term 82. Local fingerprint is 484292758073552492, remote is 484292758073552492, apply True
2023-08-03 11:12:54.9590|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.8:3262 contains 0 entries. Preceding entry has index 12075 and term 82. Local fingerprint is 484292758073552492, remote is 484292758073552492, apply True
2023-08-03 11:12:54.9624|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member 192.168.100.8:3262 is replicated successfully starting from index 12076
2023-08-03 11:12:54.9624|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member 192.168.100.1:3262 is replicated successfully starting from index 12076
2023-08-03 11:12:54.9624|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Local changes are not committed. Quorum is 3, last committed entry is 12074
The root cause is a concurrency between a thread running replication method and DataModifier
threading appending new entry. Now it is fixed (I hope so). Also, I removed some memory allocations where possible for acceptable price (from code readability point of view).
A few thoughts about memory growth: heartbeat loop uses SustainedLowLatency
GC latency mode during heartbeat, but not between them:
https://github.com/dotnet/dotNext/blob/e29b972ecddd2610a45ecd402f78755c79adc977/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/LeaderState.cs#L218-L224
It is necessary to prevent GC intrusion during heartbeat to avoid false positive timeouts. Between heartbeats GC is free to reclaim unused object. The implementation automatically reduced heartbeat period if GC happened in between.
I noticed you added it to the latest release, but I am doing the run with the latest in the develop branch given there was another commit that mentioned a concurrencyfix (an hour and a half so far).
I will let it run it for at least 1 day and report back.
there was another commit that mentioned a concurrencyfix
It was not related to the current issue. When commit relates directly to issue, I prefer to add issue number to the commit message.
looks good, the run is still going strong well past the point of all the previous failures running the example in that environment.
reopening as by mistake I got confused and thought the fix was already in release
I'll publish a new release this week.
This is about the issues being discussed here https://github.com/dotnet/dotNext/issues/184#issuecomment-1660673297.
The issue has been reproduced both in linux arm 32 bit in raspberry os (debian based) and linux debian 64 bit in wsl. At least one reproduction had all 3 nodes running, but due to the investigation of a different issue the cluster was often running with 2 out of the 3 nodes.
There are 2 observations surrounding the issue:
Presumably the frequent leader changes make the issue easier to reproduce, as there is a chance it only happens during leader changes.
The issue has been reproduced in the 14.13, 14.12.3 and 4.4.1 and in all cases it has taken under a day to get to the failure. The specific settings + entries + behavior + even console output shown over ssh could play a part in making the issue more likely to reproduce, as we have systems that run for months with 4.4.1 writing entries to raft 10 times x second.
In 14.13 and 14.12.3 the issue was observed with similar output to https://github.com/dotnet/dotNext/issues/184#issuecomment-1660235577.
I only made one run in 4.4.1 where progress did stop but the output was different. While in the later versions things suddenly stop after the leader change and potentially a single TaskCanceledException (not sure if it always gets into the screen), in the 4.4.1 run there were a various TaskCanceledException that happened before the stop + some "fail" log entries with "EndOfStreamException: Attempted to read past the end of the stream" by the leader.
I share the detailed 4.4.1 in case the issue was still the same and the old version logs extra info. Of course there is a risk the old version failure was not the same type of failure and is irrelevant for the issue present in the latest versions.
Leader:
Follower