Starting RaftNodes in sequence gives issues

simontherry commented 2 years ago

Hi,

I found an issue where if you start the RaftNodes one after the other too fast (when configuration is loaded from disk), the cluster cannot start correctly. The nodes start spamming errors and switching leaders and stating consensus cannot be reached:

fail: Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware[1]
      An unhandled exception has occurred while executing the request.
      Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Unexpected end of request content.
         at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1ContentLengthMessageBody.ReadAsyncInternal(CancellationToken cancellationToken)
         at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
         at DotNext.IO.Pipelines.PipeExtensions.ReadAsync[TResult,TParser](PipeReader reader, TParser parser, CancellationToken token) in C:\Users\SimonTH\Documents\GitHub\dotNext\src\DotNext.IO\IO\Pipelines\PipeExtensions.Readers.cs:line 65
         at DotNext.IO.Pipelines.PipeExtensions.<ReadBlockAsync>g__ReadBlockSlowAsync|26_0(PipeReader reader, Memory`1 output, CancellationToken token) in C:\Users\SimonTH\Documents\GitHub\dotNext\src\DotNext.IO\IO\Pipelines\PipeExtensions.Readers.cs:line 556
         at DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster.AppendEntriesAsync(HttpRequest request, HttpResponse response, CancellationToken token) in C:\Users\SimonTH\Documents\GitHub\dotNext\src\cluster\DotNext.AspNetCore.Cluster\Net\Cluster\Consensus\Raft\Http\RaftHttpCluster.Messaging.cs:line 260
         at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware.<Invoke>g__Awaited|6_0(ExceptionHandlerMiddleware middleware, HttpContext context, Task task)

I added my proof of concept on a forked branch of develop: https://github.com/simontherry/dotNext/tree/PoC_RaftNode .

To reproduce, follow this scenario:

Create bat file with the following content:

start RaftNode.exe http 3262 ".\log\3262"
start RaftNode.exe http 3263 ".\log\3263"
start RaftNode.exe http 3264 ".\log\3264"

Put the bat file in <path/to/dotNext>\src\examples\RaftNode\bin\Debug\net6.0
Run the bat file
Wait till everything starts: you see publishes of the values
Close all RaftNodes
Run the bat file again: error starts showing up

When putting a timeout of 2 seconds in between launching the nodes, the errors do not occur. Like this:

start RaftNode.exe http 3262 ".\log\3262"
timeout 2
start RaftNode.exe http 3263 ".\log\3263"
timeout 2
start RaftNode.exe http 3264 ".\log\3264"

Can you confirm this is a bug? Or am I using the API incorrectly?

sakno commented 2 years ago

@simontherry , according to your code, API cannot be used in that way. Direct modification of configuration is not supported if the node is already running. Moreover, concurrent modification of the configuration leads to unpredictable behavior. Current implementation doesn't allow to add/remove nodes concurrently. In other words, the cluster can add or remove one node at a time.

Membership changes cannot be applied to the configuration itself. Instead, a new node must pass through bootstrap process with help of IRaftHttpCluster.AddMemberAsync or IRaftHttpCluster.RemoveMemberAsync methods. Note that these methods can be called on leader node only. The sequence is following:

Execute first node in the cluster with coldStart=true
Execute another node with coldStart=false and call AddMemberAsync on the leader to add this node. Obviously, you need a programmed HTTP endpoint or something else to interact with the leader in that way
Wait for readiness probe (IRaftCluster.Readiness property). If the node is added successfully by the leader then the probe will turned into a completed state.
Repeat from step 2 for every new node

Step # 1 is only needed to bootstrap the cluster from scratch.

If you want to repeat the same behavior as for in-memory configuration then you need to write your own code:

Inherit from PersistentClusterConfigurationStorage<HttpEndPoint>
Implement IHostedService
Implement StartAsync and fill the configuration
Register custom implementation of the storage with RaftClusterConfiguration.UseConfigurationStorage method
Register custom implementation of the storage as IHostedService singleton. Note that you need to expose the same singleton service using two interfaces. This is important. Otherwise, you will register two services and Raft will not be able to distinguish them.

In that case, injection of IClusterConfigurationStorage<HttpEndPoint> by Raft implementation happens before the initialization of the cluster configuration.

Why is this so hard? Because persistent configuration is aimed to production use when the members adding dynamically. For testing purposes you need to use in-memory configuration storage.

Another simple way is just to create a file with necessary configuration using API and then make the copy of the file for each executed member.

simontherry commented 2 years ago

Hi @sakno , thanks for the response. I will try this out and let you know my findings.

simontherry commented 2 years ago

@sakno . Why can I not use the services.UsePersistentConfigurationStorage() , but instead have to create my own implementation?

I have tried your suggestion of having one node cold start, than add the nodes through an HTTP endpoint, and the configuration seems okay. I checked the generated files (active.list) and everything looks good on each nodes.

When I restart the cluster (all nodes coldStart=false), it starts giving errors? Is it not possible to reboot the entire cluster once it went down? Do you always need to start at least one node with coldStart after all went down?

sakno commented 2 years ago

UsePersistentConfigurationStorage is for production use. This method doesn't give you an opportunity to fill the configuration. Because the configuration must be changed using promotion methods from IRaftHttpCluster in real life. In your example, the membership list is hard-coded in the program itself. In that case there is no need to persist it (because the configuration is already persisted in the code).

coldStart=true is needed only if the configuration is empty. However, if you are not removing the crashed nodes from the configuration the leader will probe them again and again. If the majority if not present, the cluster will not elect a new leader. That's why the node must be removed gracefully.

simontherry commented 2 years ago

Hi @sakno ,

I tried your suggestions but still no success.

Check my code at https://github.com/simontherry/raft-poc/tree/main/RaftPoc

I followed the following steps:

Start nodes, one of them should be coldStart=true (see Start.bat). This starts nodes on port 5001, 5002, 5003
Create the cluster using an API (I used GRPC for my case). See commands.txt for usage: RaftPoc.TestUtil.exe --createCluster true will create the cluster.
Use the configure command to set the ID (which gets synced around the cluster)
Use the get command to retrieve the ID again.

Everything is fine up till this point.

Close all RaftPoc.exes now.
Run StartNoColdStart.bat which starts the cluster again.
The console gets spammed with errors and no consensus can be reached.

I don't understand what I am doing wrong here...

sakno commented 2 years ago

@simontherry , I've added a test to verify this situation. Here is it: https://github.com/dotnet/dotNext/blob/a5a1b72b1eca14bbd4caa2e0515414e95fa8f5dd/src/DotNext.Tests/Net/Cluster/Consensus/Raft/Http/RaftHttpClusterTests.cs#L326-L423

It provides absolutely the same scenario as yours:

Persistent configuration
Complete cluster recovery
Cold start of the first node in the cluster

You can use this test as a starting point of your research. Note that WaitForLeaderAsync method is expected in 4.5.0 release and not currently available in master branch. It was added for convenience. The same behavior can be achieved with LeaderChanged event.

simontherry commented 2 years ago

Hi @sakno ,

Is it intentional that you do not wait for the leader to be evaluated? As soon as I add await to check the leader variable, the test times out? Presumably because it fails to elect a leader as is the case for my demo as well?

 var leader1 = GetLocalClusterView(host1).WaitForLeaderAsync(DefaultTimeout); 
var leader2 = GetLocalClusterView(host2).WaitForLeaderAsync(DefaultTimeout); 
 var leader3 = GetLocalClusterView(host3).WaitForLeaderAsync(DefaultTimeout);

Add await keyword:

 var leader1 = await GetLocalClusterView(host1).WaitForLeaderAsync(DefaultTimeout); 
var leader2 = await  GetLocalClusterView(host2).WaitForLeaderAsync(DefaultTimeout); 
 var leader3 = await GetLocalClusterView(host3).WaitForLeaderAsync(DefaultTimeout);

simontherry commented 2 years ago

@sakno

I saw the listener was unused in the second part, so I tried adding

                await listener.Result.WaitAsync(DefaultTimeout);

but this also times out.

sakno commented 2 years ago

@simontherry , I found an issue. It is associated with persistent storage only. Its CopyToAsync method was not thread-safe. But it should, because configuration reading happens in parallel for each heartbeat. https://github.com/dotnet/dotNext/blob/8ed5ff59111f3fe5344885489709ceffd4baa704/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/PersistentClusterConfigurationStorage.cs#L79-L89

Now it is safe. The fix is in develop branch. Let me know if everything is fine. If so, I'll publish a new release on this week.

simontherry commented 2 years ago

Hi @sakno ,

It seems to work now in my POC.

However, I ran the Regression108 test until failure, and it sometimes fails with the following exception:

System.NullReferenceException
Object reference not set to an instance of an object.
   at DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpClusterTests.RegressionIssue108() in C:\Users\SimonTH\Documents\GitHub\dotNext\src\DotNext.Tests\Net\Cluster\Consensus\Raft\Http\RaftHttpClusterTests.cs:line 378
   at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass48_1.<<InvokeTestMethodAsync>b__1>d.MoveNext() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\Frameworks\Runners\TestInvoker.cs:line 264
--- End of stack trace from previous location ---
   at Xunit.Sdk.ExecutionTimer.AggregateAsync(Func`1 asyncAction) in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\Frameworks\ExecutionTimer.cs:line 48
   at Xunit.Sdk.ExceptionAggregator.RunAsync(Func`1 code) in C:\Dev\xunit\xunit\src\xunit.core\Sdk\ExceptionAggregator.cs:line 90

Also, adding


                await leader1;
                await listener.Result.WaitAsync(DefaultTimeout);

Makes the test time out...

So I am still in doubt if everything is fixed.

sakno commented 2 years ago

Test is fixed. await listener.Result.WaitAsync is not needed in the second part.

simontherry commented 2 years ago

@sakno Great, thanks for the quick solve of the issue!

dotnet / dotNext

Starting RaftNodes in sequence gives issues #108