SBR configuration - Githubissues

PeterHageus commented 1 year ago

Hi. Don't know if this is a Akka.Hosting issue or Akka.Cluster, but we have a problem with the default configuration:

During cluster churn (high cpu load on servers) our seed nodes are sometimes downed. This leads to them forming a new cluster, but a minority part. Everything restarted/started after this connects to this minority part, while the majority can remain stable for at least 24h (until IIS recycles), leading to a long lived partition.

Would setting KeepMajority.Role to the seed node role only take the seed nodes into account when resolving partition? Would that be the correct way to configure the cluster?

Arkatufus commented 1 year ago

This is an Akka.Cluster behaviour and the short answer is "it depends".

How big is your cluster
What is your cluster size VS. seed nodes ratio
Are you willing to take the risk that a big part of the cluster will be downed if that part was split brained from the smaller chunk of the cluster that has all the seed nodes.

When you set KeepMajority.Role, what will happen when a split brain occured is that only the cluster members that has that role is being considered when SBR tries to resolve the split. This would mean that you will need at least 5 seed nodes for this to work properly in production.

But lets take some examples:

Cluster settings:

5 seed nodes with role "seed"
100 nodes with no roles
KeepMajority.Role is set to "seed"

Scenario 1, the happy path: The cluster split into these parts:

Part 1: 3 "seed" nodes and 80 non-role nodes
Part 2: 2 "seed" nodes and 20 non-role nodes SBR Resolution: SBR will down part 2

Scenario 2, the not-so-happy path: The cluster split into these parts:

Part 1: 3 "seed" nodes and 10 non-role nodes
Part 2: 2 "seed" nodes and 90 non-role nodes SBR Resolution: SBR will down part 2, even when it has the "majority" of general node count. This is because SBR only considers the number of the nodes that has majority inside the declared role.

Scenario 3, the ugly path:

All of the "seed" roles are down, leaving the 100 non-role to be stranded.
1 "seed" role node are restarted and self-join itself to form a cluster SBR Resolution: Nothing. You will end up with a permanent split brain and they could not corellate to each other because the non-role cluster does not know about the newly created cluster.

Arkatufus commented 1 year ago

The only way to fix this problem is to remove the arbiter inside the cluster, be it a Lighthouse instance or a fixed count of seed nodes. To do this, you will need Akka.Management.Cluster.Bootstrap in combination with Akka.Discovery which uses an arbiter outside of the cluster and is available for Kubernetes, Azure, and AWS.

Note that Akka.Discovery.Config is not the answer. It is still using the cluster itself as the arbiter, which defeats the purpose.

PeterHageus commented 1 year ago

OK, thanks for your input! Guess our only strategy atm is higher tolerance for heartbeats, to avoid unnecessary disconnects.

akkadotnet / akka.net

SBR configuration #6898