akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.72k stars 1.04k forks source link

SBR configuration #6898

Closed PeterHageus closed 1 year ago

PeterHageus commented 1 year ago

Hi. Don't know if this is a Akka.Hosting issue or Akka.Cluster, but we have a problem with the default configuration:

During cluster churn (high cpu load on servers) our seed nodes are sometimes downed. This leads to them forming a new cluster, but a minority part. Everything restarted/started after this connects to this minority part, while the majority can remain stable for at least 24h (until IIS recycles), leading to a long lived partition.

Would setting KeepMajority.Role to the seed node role only take the seed nodes into account when resolving partition? Would that be the correct way to configure the cluster?

Arkatufus commented 1 year ago

This is an Akka.Cluster behaviour and the short answer is "it depends".

When you set KeepMajority.Role, what will happen when a split brain occured is that only the cluster members that has that role is being considered when SBR tries to resolve the split. This would mean that you will need at least 5 seed nodes for this to work properly in production.

But lets take some examples:

Cluster settings:

Scenario 1, the happy path: The cluster split into these parts:

Scenario 2, the not-so-happy path: The cluster split into these parts:

Scenario 3, the ugly path:

Arkatufus commented 1 year ago

The only way to fix this problem is to remove the arbiter inside the cluster, be it a Lighthouse instance or a fixed count of seed nodes. To do this, you will need Akka.Management.Cluster.Bootstrap in combination with Akka.Discovery which uses an arbiter outside of the cluster and is available for Kubernetes, Azure, and AWS.

Note that Akka.Discovery.Config is not the answer. It is still using the cluster itself as the arbiter, which defeats the purpose.

PeterHageus commented 1 year ago

OK, thanks for your input! Guess our only strategy atm is higher tolerance for heartbeats, to avoid unnecessary disconnects.