dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.6k stars 119 forks source link

Cluster unstable when running tcp example #84

Closed sakno closed 2 years ago

sakno commented 2 years ago

Discussed in https://github.com/dotnet/dotNext/discussions/83

Originally posted by **AntonWerenberg** November 8, 2021 I've been experimenting with the raft node example code for some time, and one issue keeps being present: Instability in the cluster. The nodes will often run fine in the beginning, but sooner or later they will start to go out of sync and fail. One node will show the tcp error warning: "warn: DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74022] Request has timed out" and others will start running elections, but will not reach consensus. the behaviour can be seen in the attached image. I'm running with default election timeout settings from the example. ![image](https://user-images.githubusercontent.com/90845471/140749481-28669b16-b126-430f-b872-1d3c252d6946.png) I'm running with default election timeout settings from the example. I'm not a great programmer and I'm really having trouble seeing which direction to go to get to the bottom of this? My main concern is currently if this could be due to my own setup, something not configured correctly, or something like that. I measured Broadcasttime using Metrics collector. It is showing broadcast times of around 3 ms.
sakno commented 2 years ago

RC1 has been published. ConnectTimeout configuration option is added to TCP transport configuration and now it explicitly defined in Raft example.