notxcain commented 6 years ago

I want to describe a possible split-brain scenario (even with SBR enabled) when using K8s Deployment for Akka Cluster.

How `Deployment` works

Deployment assumes it's replicas are stateless and that it's safe to have more of them than configured by replicas parameter
Kubernetes controller manager will try to check node nodeStatusUpdateRetry times every --node-monitor-period of time. After --node-monitor-grace-period it will consider node unhealthy. It will remove its pods after --pod-eviction-timeout (more on this here)
After pod eviction Deployment will reschedule them on reachable nodes.
In case of network problem, replicas that are left on unreachable nodes will keep runnig, until there is a connection with API Server, and then kubelet on that node would stop evicted containers.

Split-brain scenario

Let's say we run 3 nodes Akka Cluster with SBR enabled. 2 nodes become unreachable, but still running. A single node partition shuts itself down and also k8s evicts pods from unreachable nodes, not sure if orded matters. Eventually K8s starts 3 containers on reachable nodes, they form cluster. Now we have 2 clusters. New 3 node cluster on reachable nodes, and old 2 node cluster on unreachable.

What about `StatefulSet`?

The behavior of StatefulSet differs from Deployment. It WON'T evict pods from unreachable nodes, because one of its guarantee is that exactly one instance of each replica is being run at any given moment. We run our clusters across 3 AZs (on-prem), using Pod Anit-Affinity feature, you can read more about it here

Please ask question or correct me if I'm wrong.

ktoso commented 6 years ago

Hi @notxcain, your writeup is a bit weird... specifically:

Eventually K8s starts 3 containers on reachable nodes, they form cluster. Now we have 2 clusters.

Why 3 new ones? Why would they form a cluster?

What about StatefulSet?

Why not stateful set is documented here: https://github.com/akka/akka-management/issues/134 Please give it a look and correct us if you think that analysis is incorrect.

notxcain commented 6 years ago

Why 3 new ones?

One member left on reachable worker will kill itself. K8s will restart it. Also k8s will start evicted pods on reachable nodes. At this moment there will be 5 running members.

Why would they form a cluster?

They would form a cluster because of how cluster-bootstrap works. Members left on unreachable nodes won't be listed in headless service DNS records, so there gonna be no evidence of their existence.

Please give it a look and correct us if you think that analysis is incorrect.

I'll take a closer look later.

ktoso commented 6 years ago

One member left on reachable worker will kill itself. K8s will restart it. Also k8s will start evicted pods on reachable nodes. At this moment there will be 5 running members.

I have a feeling wording is not quite precise here... Could you please explain the exact steps we are worried about and how it relates to "akka nodes" "pods" "physical nodes" etc?

 They would form a cluster because of how cluster-bootstrap works.

Again I feel you're making a huge mental leap over all the details between. Which nodes and for what time are visible in discovery via k8s APIs etc plays a huge role in these things. I don't see these taken into consideration in this ticket so far?

notxcain commented 6 years ago

I have a feeling wording is not quite precise here... Could you please explain the exact steps we are worried about and how it relates to "akka nodes" "pods" "physical nodes" etc?

akka member == pod worker == physical node

Workers running 2 of 3 members become unreachable from the point of view of K8s and the 3rd member.
K8s evicts pods from unreachable workers and schedule them on reachable, also removes unreachable pods from headless service DNS record
Two evicted members are started, so there are 3 member on reachable workers and 2 still running on unreachable worker.

hseeberger commented 6 years ago

I totally agree that this could happen: SBR and k8s don't have the same understanding about which nodes (they even don't have the same understanding what a node is, but in the case of a network partition that doesn't matter) are available or not.

In a three node (physical) cluster k8s might consider nodes A and B unreachable. Assuming member nodes (ActorSystem) a, b and c are running on (within a pod) on each of the nodes A, B and C, SBR on c detects a and b unreachable whereas SBR on a and b detects c as unreachable.

Hence only k8s and SBR on c are in sync. But with quorum or majority, SBR will down c and hence leave a and b up which unfortunately run on A and B which are considered unreachable by k8s.

In the words of Tyrion Lannister: "we're fucked"

longshorej commented 6 years ago

I'd like to draw attention to the "join existing" flag (see reference.conf) which can help here. The initial deployment of your Akka Cluster app needs to have this flag disabled, but then any future rolling updates can enable it. These are the majority of cases of course, so in practice it shouldn't introduce much operator overhead.

hseeberger commented 6 years ago

@longshorej are you referring to akka.management.cluster.bootstrap.form-new-cluster?

longshorej commented 6 years ago

Yeah -- oops, I forgot it was named that. So, you'd need that set to true for the initial deployment. Subsequent deployments can disable it and they'll then only join an existing cluster. You'd need appropriate readiness checks defined and https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable set to 10 or so, depending upon cluster size.

hseeberger commented 6 years ago

@longshorej but wouldn't that lead to a complete shutdown in the scenario I described above? k8s would start the pods on the unreachable nodes A and B on C, but the newly created actor systems would not be able to form a new cluster (which is good). But the actor system c was shut down by SBR and k8s will start its pod again on C, too. So we have three new actor systems, none of which can form a cluster. And we also have the two old actor systems a and b on the unreachable nodes A and B.

thomschke commented 6 years ago

Try to poll the current replicas (.status.readyReplicas) from api-server and compute the matching quorum-size incl. recommended stable-after (see table).

AND

Stop your SBR as long as if:

the api-server is reachable but there is running an deployment-update (.status.replicas != .status.readyReplicas) or
the api-server is temporary unreachable (caused by k8s cluster update or crash)

Sure, there is not a “one size fits all” solution to this problem. But I think this strategy fits the characteristics of a k8s cluster.

thomschke commented 6 years ago

Currently all members are returned by contact point probing -> #254 That's why new members try to join unreachable members during k8s rolling updates.

chbatey commented 5 years ago

254 was merged but I think we can still improve the docs / configuration recommendations for these scenarios

akka / akka-management

Possible split-brain of Akka Cluster using K8s Deployment #156

How `Deployment` works

Split-brain scenario

What about `StatefulSet`?

254 was merged but I think we can still improve the docs / configuration recommendations for these scenarios

akka / akka-management

Possible split-brain of Akka Cluster using K8s Deployment #156

How Deployment works

Split-brain scenario

What about StatefulSet?

254 was merged but I think we can still improve the docs / configuration recommendations for these scenarios

How `Deployment` works

What about `StatefulSet`?