raft leader loses leadership and the node gets stuck

freddyrios commented 4 months ago

Found in version 4.14.1 and observed at least 5-10 times for a 16 nodes cluster.

The issue can be seen in these logs.zip, where a cluster is first started and node 153 is elected a leader. However, shortly after this, node 153 and other nodes report "Consensus cannot be reached" e.g. nodes 179 and 155. When other nodes later send vote requests, 153 consistently times out like below. Not only that, but once a new leader is elected, all AppendEntries coming from the leader (and later install snapshot) timeout in the very same way.

2024-02-06 14:14:20.1952|WARN|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Timeout occurred while processing request from 192.168.100.158:40802|System.OperationCanceledException: The operation was canceled.
   at DotNext.Result`1.Validate() in /_/src/DotNext/Result.cs:line 186
   at DotNext.Result`1.get_Value() in /_/src/DotNext/Result.cs:line 181
   at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.GetResult(Int16 token) in /_/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 230
   at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in /_/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 237
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.VoteAsync(ClusterMemberId sender, Int64 senderTerm, Int64 lastLogIndex, Int64 lastLogTerm, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 809
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.VoteAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 56
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/T

In the log for node 153 I replaced a long section of the repeated timeout error with '...'. Nothing different in there, just the same timeout until we stopped the system.

sakno commented 4 months ago

Does it have stable repro? I can guess that there is a deadlock caused by transitionLock. It can be proven by measuring some metrics. Using dotnet-counters, you can measure DotNext.Threading.AsyncLock group of counters and then capture suspension-duration counter.

freddyrios commented 4 months ago

It has happened in about 1 of 12 clean starts in that system (we use clean starts when running with new configurations). However, the system is in use so I am not certain I will be able to gather this type of data soon (also need to check if dotnet-counters plays well with the single app deployment as I had trouble earlier with other diagnostic tools).

That said, the way I interpreted those log entries was also about the transitionLock.AcquireAsync timing out. Of course that does not say the reason, so my interpretation was either something is still holding the lock (stuck somewhere) or the lock failed to be released somehow.

I do wonder if it is something that goes wrongs on the first actions the node takes when it has been elected a leader.

sakno commented 4 months ago

Could you provide debug logs?

sakno commented 4 months ago

I have some assumptions, but debug logs are needed to confirm that. According to presented logs for node 153 I see that it's elected a leader

2024-02-06 14:14:16.6664|WARN|leaderchange|New cluster leader is elected - Leader address 192.168.100.153:4250 - Term 1 - Election timeout 00:00:01.5340000

then trying to replicate log entries to some unavailable member (with port 150 which is not presented in the log so I can assume that it's really unavailable)

2024-02-06 14:14:17.2924|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.166:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.2899|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.164:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3090|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.145:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3113|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.179:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3212|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.174:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3212|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.155:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3378|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.175:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3449|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.144:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3508|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.158:4250 is failed. Retry replication from entry 1
2024-02-06 14:14:17.3508|WARN|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Replication of 192.168.100.176:4250 is failed. Retry replication from entry 1

then leader probably downgrades to follower

2024-02-06 14:14:19.2067|WARN|leaderchange|Consensus cannot be reached - Term 1 - Election timeout 00:00:01.5340000

and then we see the timeout

2024-02-06 14:14:20.1952|WARN|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Timeout occurred while processing request from 192.168.100.158:40802

At the moment, I would assume that the issue is not caused by AsyncExclusiveLock. When the node is in the specific state, it doesn't hold transition lock. The lock is needed to protect the transition between states. The leader downgrades to follower and calls MoveToFollowerState. During transition, the lock is acquired that prevents VoteAsync from correct processing. Probably, some async call inside of MoveToFollowerState prevents exit from the transition lock. The only async call is StepDown https://github.com/dotnet/dotNext/blob/6b36d0b5e3560a4e19860d8b134196b8b9a4b557/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs#L512-L546

UpdateTermAsync is not cancelable and relies on disk I/O. The next call is DisposeAsync on LeaderState.

Debug logs are needed to see what happened

Does leader trying to downgrade to follower?
Does MoveToFollowerState stuck?

freddyrios commented 4 months ago

Unfortunately I can't make changes or stop/restart the system where this was reproduced at least in the next 2-4 weeks. I have tried reproducing the issue on a dev cluster (11 nodes instead of 17 nodes) without any luck.

Regarding the rest, a bit confused on this one:

then trying to replicate log entries to some unavailable member (with port 150 which is not presented in the log so I can assume that it's really unavailable)

Were you referring to 192.168.100.1? "Cluster member 192.168.100.1:4250 is unavailable"

There might be varying delays on the start of each node that change the picture, but logs show the process was started in all nodes within ~300ms. The exception to this is node 1, that started a little over 3 seconds later. This is because a script in node 1 starts all processes and only after it triggers that in all nodes it is starting its own process.

I tried adding 1-3 seconds start delay on the dev cluster script but failed to reproduce the issue. Last run I checked in the log it even shows the is unavailable message in the log, but did not reproduce the slow replication that comes with stuck node (in the version I am trying) + I also did not see any of the timeout exceptions in the logs across the cluster.

Re the rest of the analysis, that makes sense to me. However, I do wonder about the various early "Consensus cannot be reached" + related election time => whether nodes really got the very first hearbeats after the election.

sakno commented 4 months ago

After careful analysis of source code (without ability to reproduce) I can conclude that the only suspected is DisposeAsync called on previous state when transition initiated. This is the only uninterruptible asynchronous method called within the transition lock.

sakno commented 4 months ago

Is there any chance to try develop branch? I've fixed some aspects related to cancellation tokens.

freddyrios commented 4 months ago

That will be hard to do it soon, because I still can only reproduce it the system that is busy + I can't deploy this larger update without it going through some testing in other systems first e.g. as it includes various other changes, like the move to 5+ and .net 8.

sakno commented 4 months ago

Released in 5.0.3. In case of stable repro feel free to reopen it.

freddyrios commented 4 months ago

I was finally able to gather debuglogs.zip (with 4.14.1). In that run node 164 gets stuck.

With some votes against, 164 gets a clear majority and succesfully transitions to the leader state. Node 1 was clearly unavailable at this point and in fact its log shows it does not starts unil 164 is already stuck.

I find the first round of replications odd because it only shows this message for 14 out of the 16 nodes "Replication of member {Member} started at log index {EntryIndex}". These are 13, which makes sense because node 1 is not live and was in the first list "Replication of {Member} is failed. Retry replication from entry {EntryIndex}".

The only thing that follows the apparently incomplete replication is the member is downgrading message, which does not have a matching member is downgraded message. However, this happens over 2 seconds later, which suggests the first issue to trigger the bug happens before this point.

freddyrios commented 4 months ago

I managed to reproduce it with this test (added to TcpTransportTests), but it took many runs to trigger it:

    [Fact]
    public async Task LeadershipWithUnavailableNodeCore()
    {
        var ports = new[] { 3265, 3266, 3267, 3268, 3269, 3270, 3271, 3272, 3273, 3274, 3275, 3276, 3277, 3278, 3279, 3280 };

        RaftCluster clusterFactory(int port)
        {
            var config = new RaftCluster.TcpConfiguration(new IPEndPoint(IPAddress.Loopback, port)) 
            {
                RequestTimeout = TimeSpan.FromMilliseconds(1000),
                ConnectTimeout = TimeSpan.FromMilliseconds(10),
                LowerElectionTimeout = 1500,
                UpperElectionTimeout = 3000,
                TransmissionBlockSize = 8192 * 4 * 10,
                ColdStart = false,
            };
            config.LoggerFactory = LoggerFactory.Create(l => l.AddDebug().SetMinimumLevel(LogLevel.Debug));
            var builder = config.UseInMemoryConfigurationStorage().CreateActiveConfigurationBuilder();
            foreach (var p in ports)
                builder.Add(new IPEndPoint(IPAddress.Loopback, p));
            builder.Build();
            return new(config);
        }

        var hosts = ports.Select(p => clusterFactory(p)).ToList();
        var startedHosts = hosts.Skip(1).Select(async h => {
            await Task.Delay(Random.Shared.Next(300));
            await h.StartAsync();
            }).ToList();
        await Task.WhenAll(startedHosts);

        Debug.WriteLine("----------------");
        Debug.WriteLine("Done starting!");
        Debug.WriteLine("----------------");
        await Task.Delay(8500);

        await hosts[0].StartAsync();
        Debug.WriteLine("----------------");
        Debug.WriteLine("Done starting extra node!");
        Debug.WriteLine("----------------");

        await Task.Delay(30000);

        var leaderPort = ((IPEndPoint)hosts.First().Leader.EndPoint).Port;
        var leader = hosts.Single(h => ((IPEndPoint)h.LocalMemberAddress).Port == leaderPort);

        foreach (var member in leader.Members)
        {
            var status = member.Status;
            if (status is not ClusterMemberStatus.Available)
                Fail($"Member {member.EndPoint} has unexpected status {status}");
        }

        await Task.WhenAll(hosts.Select(h => Task.Run(() => h.StopAsync())));
    }

freddyrios commented 4 months ago

I just ran it again with RunUntilFailure and it failed in the 18th attempt. Used this version to get nlog logs like the ones I attached from the original system (requires adding the NLog.Extensions.Logging nuget package and this using using NLog.Extensions.Logging;):

    [Fact]
    public async Task LeadershipWithUnavailableNodeCore()
    {
        var ports = new[] { 3265, 3266, 3267, 3268, 3269, 3270, 3271, 3272, 3273, 3274, 3275, 3276, 3277, 3278, 3279, 3280 };

        RaftCluster clusterFactory(int port)
        {
            var config = new RaftCluster.TcpConfiguration(new IPEndPoint(IPAddress.Loopback, port)) 
            {
                RequestTimeout = TimeSpan.FromMilliseconds(1000),
                //low ConnectTimeout is needed to ensure the leader moves to the next append quickly when it attempts to reestablish connections to a disconnected follower
                ConnectTimeout = TimeSpan.FromMilliseconds(10),
                //we are using much larger values compared to the default raft timeouts (150-300), because we have trouble in larger systems with the timeouts. Even before this we had it at 300-600 based on earlier stability tests.
                //Lower is desireable for us as it limits the latency in re-election cases (see raft paper for graphs on worse cases depending on the values)
                //A larger difference between lower and upper timeouts compared to the default is likely needed for larger amount of nodes (16 for example),
                //as it reduces the probability of split elections that at the end increase the time to elect a leader (see raft paper graphs)
                LowerElectionTimeout = 1500,
                UpperElectionTimeout = 3000,
                //see https://github.com/copenhagenatomics/dotNext/issues/2 for more info on tuning,
                //but for now we are using approx. 4 times the size of a full 1k values entry (8000 bytes)
                TransmissionBlockSize = 8192 * 4 * 10,
                ColdStart = false,
            };
            config.LoggerFactory = LoggerFactory.Create(b => b.AddNLog(GetLoggingConfiguration(port.ToString())).SetMinimumLevel(LogLevel.Debug));
            var builder = config.UseInMemoryConfigurationStorage().CreateActiveConfigurationBuilder();
            foreach (var p in ports)
                builder.Add(new IPEndPoint(IPAddress.Loopback, p));
            builder.Build();
            return new(config);
        }

        static NLog.Config.LoggingConfiguration GetLoggingConfiguration(string node)
        {
            var minLevel = NLog.LogLevel.Debug;
            var config = new NLog.Config.LoggingConfiguration(new() { GlobalThreshold = minLevel });
            config.AddRuleForAllLevels(new NLog.Targets.FileTarget("raft" + node) { FileName = $"{node}.log" });
            return config;
        }

        var hosts = ports.Select(p => clusterFactory(p)).ToList();
        var startedHosts = hosts.Skip(1).Select(async h => {
            await Task.Delay(Random.Shared.Next(300));
            await h.StartAsync();
            }).ToList();
        await Task.WhenAll(startedHosts);

        Debug.WriteLine("----------------");
        Debug.WriteLine("Done starting!");
        Debug.WriteLine("----------------");
        await Task.Delay(8500);

        await hosts[0].StartAsync();
        Debug.WriteLine("----------------");
        Debug.WriteLine("Done starting extra node!");
        Debug.WriteLine("----------------");

        await Task.Delay(30000);

        var leaderPort = ((IPEndPoint)hosts.First().Leader.EndPoint).Port;
        var leader = hosts.Single(h => ((IPEndPoint)h.LocalMemberAddress).Port == leaderPort);

        foreach (var member in leader.Members)
        {
            var status = member.Status;
            if (status is not ClusterMemberStatus.Available)
                Fail($"Member {member.EndPoint} has unexpected status {status}");
        }

        await Task.WhenAll(hosts.Select(h => Task.Run(() => h.StopAsync())));
    }

freddyrios commented 3 months ago

@sakno I can't reopen the issue. The test above reproduces the bug in 5.1.0.

The last 2 runs using RunUntilFailure reproduced it in the 5th and 6th attempt respectively.

In short I did these to reproduce, but not sure how much of it is needed to do so:

Same TcpConfiguration
Uses in memory configuration storage initialized with all the nodes
first node is started 8.5 seconds after the other nodes started
there is a random delay of 0-300 ms before starting the nodes, to try to repro the more natural spread we see starting that cluster
after 30 seconds of running, we get the leader RaftCluster instance and check that all members are detected as available / the idea being that the node that is stuck would not be available
a Debug log file named after the port number is created for each node. Used after the failure to confirm the node was actually stuck in the same way.

sakno commented 3 months ago

Could you share debug log files from the test? From my side, I'll port LeadershipWithUnavailableNodeCore to XUnit to include the test to existing collection of tests.

freddyrios commented 3 months ago

I did not kept the ones from those runs, but here are the ones from a new run failing on the 9th attempt with node with port 3276 having status Unavailable (logs.zip

freddyrios commented 3 months ago

I just reproduced without most of those settings, left only ConnectTimeout. Doing another run without it now, since an earlier run without it did not reproduce in 20 attempts.

freddyrios commented 3 months ago

Ok, even that is not needed / reproduced on the first attempt.

freddyrios commented 3 months ago

Simplified test, less moving parts and faster attempts.

Changes:

Default TcpConfiguration (except ColdStart = false)
Removed starting of the last node and extra wait there. The node is stuck by then, so no point in including that later action
Reduced the wait after all nodes are started to 6 seconds

Above said, the runs I tried varied wildly in amount of attempts to reproduce.

    [Fact]
    public async Task LeadershipWithUnavailableNodeCore()
    {
        var ports = new[] { 3265, 3266, 3267, 3268, 3269, 3270, 3271, 3272, 3273, 3274, 3275, 3276, 3277, 3278, 3279, 3280 };

        RaftCluster clusterFactory(int port)
        {
            var config = new RaftCluster.TcpConfiguration(new IPEndPoint(IPAddress.Loopback, port)){ ColdStart = false, };
            config.LoggerFactory = LoggerFactory.Create(b => b.AddNLog(GetLoggingConfiguration(port.ToString())).SetMinimumLevel(LogLevel.Debug));
            var builder = config.UseInMemoryConfigurationStorage().CreateActiveConfigurationBuilder();
            foreach (var p in ports)
                builder.Add(new IPEndPoint(IPAddress.Loopback, p));
            builder.Build();
            return new(config);
        }

        static NLog.Config.LoggingConfiguration GetLoggingConfiguration(string node)
        {
            var minLevel = NLog.LogLevel.Debug;
            var config = new NLog.Config.LoggingConfiguration(new() { GlobalThreshold = minLevel });
            config.AddRuleForAllLevels(new NLog.Targets.FileTarget("raft" + node) { FileName = $"{node}.log" });
            return config;
        }

        var hosts = ports.Select(clusterFactory).ToList();
        var startedHosts = hosts.Skip(1).Select(async h => {
            await Task.Delay(Random.Shared.Next(300));
            await h.StartAsync();
            }).ToList();
        await Task.WhenAll(startedHosts);
        await Task.Delay(6000);

        var firstPort = ports[0];
        var leaderPort = ((IPEndPoint)hosts.Skip(1).First().Leader.EndPoint).Port;
        var leader = hosts.Single(h => ((IPEndPoint)h.LocalMemberAddress).Port == leaderPort);

        foreach (var member in leader.Members)
        {
            if (((IPEndPoint)member.EndPoint).Port == firstPort) 
                continue; // we don't check node 1 because we never started it
            var status = member.Status;
            if (status is not ClusterMemberStatus.Available)
                Fail($"Member {member.EndPoint} has unexpected status {status}");
        }

        await Task.WhenAll(hosts.Skip(1).Select(h => Task.Run(() => h.StopAsync())));
    }

sakno commented 3 months ago

Thanks for your code. I was able to repro the issue with the provided test. At least it proves that the root case not inside of persistent WAL, because it's reproducible with in-memory consensus-only WAL Currently, I found that the issues somewhere within LeaderState<TMember>.ForkHeartbeats private method.

dotnet / dotNext

raft leader loses leadership and the node gets stuck #221