Open notxcain opened 6 years ago
Hi @notxcain, your writeup is a bit weird... specifically:
Eventually K8s starts 3 containers on reachable nodes, they form cluster. Now we have 2 clusters.
Why 3 new ones? Why would they form a cluster?
What about StatefulSet?
Why not stateful set is documented here: https://github.com/akka/akka-management/issues/134 Please give it a look and correct us if you think that analysis is incorrect.
Why 3 new ones?
One member left on reachable worker will kill itself. K8s will restart it. Also k8s will start evicted pods on reachable nodes. At this moment there will be 5 running members.
Why would they form a cluster?
They would form a cluster because of how cluster-bootstrap works. Members left on unreachable nodes won't be listed in headless service DNS records, so there gonna be no evidence of their existence.
Please give it a look and correct us if you think that analysis is incorrect.
I'll take a closer look later.
One member left on reachable worker will kill itself. K8s will restart it. Also k8s will start evicted pods on reachable nodes. At this moment there will be 5 running members.
I have a feeling wording is not quite precise here... Could you please explain the exact steps we are worried about and how it relates to "akka nodes" "pods" "physical nodes" etc?
They would form a cluster because of how cluster-bootstrap works.
Again I feel you're making a huge mental leap over all the details between. Which nodes and for what time are visible in discovery via k8s APIs etc plays a huge role in these things. I don't see these taken into consideration in this ticket so far?
I have a feeling wording is not quite precise here... Could you please explain the exact steps we are worried about and how it relates to "akka nodes" "pods" "physical nodes" etc?
akka member == pod worker == physical node
I totally agree that this could happen: SBR and k8s don't have the same understanding about which nodes (they even don't have the same understanding what a node is, but in the case of a network partition that doesn't matter) are available or not.
In a three node (physical) cluster k8s might consider nodes A and B unreachable. Assuming member nodes (ActorSystem) a, b and c are running on (within a pod) on each of the nodes A, B and C, SBR on c detects a and b unreachable whereas SBR on a and b detects c as unreachable.
Hence only k8s and SBR on c are in sync. But with quorum or majority, SBR will down c and hence leave a and b up which unfortunately run on A and B which are considered unreachable by k8s.
In the words of Tyrion Lannister: "we're fucked"
I'd like to draw attention to the "join existing" flag (see reference.conf) which can help here. The initial deployment of your Akka Cluster app needs to have this flag disabled, but then any future rolling updates can enable it. These are the majority of cases of course, so in practice it shouldn't introduce much operator overhead.
@longshorej are you referring to akka.management.cluster.bootstrap.form-new-cluster
?
Yeah -- oops, I forgot it was named that. So, you'd need that set to true for the initial deployment. Subsequent deployments can disable it and they'll then only join an existing cluster. You'd need appropriate readiness checks defined and https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable set to 10 or so, depending upon cluster size.
@longshorej but wouldn't that lead to a complete shutdown in the scenario I described above? k8s would start the pods on the unreachable nodes A and B on C, but the newly created actor systems would not be able to form a new cluster (which is good). But the actor system c was shut down by SBR and k8s will start its pod again on C, too. So we have three new actor systems, none of which can form a cluster. And we also have the two old actor systems a and b on the unreachable nodes A and B.
Try to poll the current replicas (.status.readyReplicas
) from api-server and compute the matching quorum-size
incl. recommended stable-after
(see table).
AND
Stop your SBR as long as if:
.status.replicas != .status.readyReplicas
) orSure, there is not a “one size fits all” solution to this problem. But I think this strategy fits the characteristics of a k8s cluster.
Currently all members are returned by contact point probing -> #254 That's why new members try to join unreachable members during k8s rolling updates.
I want to describe a possible split-brain scenario (even with SBR enabled) when using K8s Deployment for Akka Cluster.
How
Deployment
worksDeployment
assumes it's replicas are stateless and that it's safe to have more of them than configured byreplicas
parameternodeStatusUpdateRetry
times every--node-monitor-period
of time. After--node-monitor-grace-period
it will consider node unhealthy. It will remove its pods after--pod-eviction-timeout
(more on this here)Deployment
will reschedule them on reachable nodes.Split-brain scenario
Let's say we run 3 nodes Akka Cluster with SBR enabled. 2 nodes become unreachable, but still running. A single node partition shuts itself down and also k8s evicts pods from unreachable nodes, not sure if orded matters. Eventually K8s starts 3 containers on reachable nodes, they form cluster. Now we have 2 clusters. New 3 node cluster on reachable nodes, and old 2 node cluster on unreachable.
What about
StatefulSet
?The behavior of
StatefulSet
differs fromDeployment
. It WON'T evict pods from unreachable nodes, because one of its guarantee is that exactly one instance of each replica is being run at any given moment. We run our clusters across 3 AZs (on-prem), using Pod Anit-Affinity feature, you can read more about it herePlease ask question or correct me if I'm wrong.