akka / akka-management

Akka Management is a suite of tools for operating Akka Clusters.
https://doc.akka.io/docs/akka-management/
Other
254 stars 160 forks source link

akka-cluster kubernetes discovery should consider each node's label selector #625

Open jsravn opened 4 years ago

jsravn commented 4 years ago

I am using the pod label selection mechanism. It assumes that every pod in the cluster has the same label selector. However this is not always the case - such as during an upgrade that changes the label selector. As a result the cluster can get into a weird state and fail in different ways.

I expect the discovery algorithm to be:

  1. Given pod label selector X
  2. Select all pods P that match X
  3. Exclude any pod p in P whose label selector is not X

(3) is missing in the current discovery implementation. This means we could inadvertently include pods that have their own idea of what a cluster looks like - leading to an ill-defined state.

An example of this is:

  1. Create a pod with a broken label selector - it fails to bootstrap as it can't select itself.
  2. Fix the label selector, and apply the new Deployment
  3. Observe that the new pod cannot form a cluster - because it considers the broken pod part of its cluster, despite having a different label selector.
jroper commented 4 years ago

So, how would the other nodes know what the label selector is? Since the label selector is not a Kubernetes thing, but rather an Akka configuration.

There are also multiple misconfigurations that can get a deployment into this state, for example, misconfiguring the required number of replicas (especially if it's set to something unreasonably high so you can't just solve it by scaling up then back down).

Also, what if the reason that the first node is failing to bootstrap has nothing to do with the label selector, and the label selector doesn't match but is still valid (perhaps there are two labels on the pods, and this new deployment has switched which one is being used)? I'm worried that there would be multiple scenarios like this.

Also, note that even if we did implement this, it typically wouldn't solve the problem. If you have the configured required replica number set to 2, and you have a deployment of 3 replicas, then you're going to have three nodes unable to bootstrap. When the rolling upgrade is done, using defaults, only one new node will start. That may decide that the 3 other nodes are broken and not include them in its seed node decision, but then it's only going to see itself as a contact point, ie only one node, which will be less than the required 2, and so it won't bootstrap.

Another thing that could help, but still wouldn't completely solve the problem, would be using the age of the actor system as the decider for the seed node, using IP address as the tie breaker. Both Kubernetes and Cluster bootstrap will fail/restart the node if it doesn't bootstrap/isn't ready in time. Each time that happens, a new node is going to become the new oldest actor system, at some point, the new node will be the new oldest actor system, allowing it to select itself as the seed node. But, this still won't solve the problem where there are not enough new nodes to satisfy the minimum required nodes. But I think it's probably a little more robust.

Ultimately, I think bootstrapping a cluster for the first time is a special case. Once you've got it configured right, you're generally good to go, you won't have problems again. From that point on, I think it's better to be safe than sorry, and have more conservative deciders. But when you're deploying for the first time, you may have configuration issues like this. The best thing to do may be to delete and recreate the deployment, and generally since you're just setting things up, that shouldn't cause any issues. We can document this as something that might need to be done when deploying for the first time if you get config wrong. It kinda sucks, but bootstrapping an Akka cluster in Kubernetes kinda sucks in general - I think we've done a pretty good job of coming up with the least suckiest solution possible in cluster bootstrap, but that doesn't mean it doesn't still suck. This is why things like the AkkaCluster operator and Cloudstate exist :)

jsravn commented 4 years ago

I agree this is more of a UX issue and can be solved manually if you know what's going on. But at least when I hit it, I found it pretty confusing as there is nothing obvious in the logs/elsewhere about why the cluster can't form despite being set up correctly (because of the stale pod). This kind of problem will happen whenever the label selector changes, I think. Given that changing the label selector is a breaking change, it would be a nice improvement if the bootstrapper could detect this and treat it as a new cluster formation. However I'm not sure the best way to surface the label selector - perhaps it should be tied to the cluster identity somehow.

jroper commented 4 years ago

If we have an existing cluster, and the label changed - this can be done in a way that works by the way, you just have to deploy in three phases, one that adds the label alongside the old, one that switches the selector, and one that removes the old - but if by accident, you didn't do this, and you changed the label selector in a way that did break cluster formation, then absolutely not, you do not want the bootstrapper treating that as a new cluster formation. New cluster formation means game over, you lose all guarantees that Akka cluster gives you about consistency, you will end up with non singleton singletons, lose all single writer guarantees afforded by cluster sharding, you'll lose all CRDT data. You never, ever want to bootstrap a new cluster if there's even a chance that there's already an existing one. It should fail and fail hard. The only thing I think that could be improved is the error reporting.

jsravn commented 4 years ago

Note that the current mechanism is already destructive - if you change the label selector such that it can't find the old pods, it will create a new cluster while the old one is still running.

I created this issue specifically for the ill defined behavior around mismatched label selectors. Which in this case blocks all forward progress without any information to the user. If this happened, most users would be forced to simply delete the whole cluster to make the re-deploy succeed. I argue it's just better to detect this brokenness and form a new cluster, which is a form of self healing - and hopefully everything can recover assuming things are persisted, etc.

What you seem to be suggesting is any label selector change should fail at startup of the cluster. Or some heuristic to detect when it shouldn't fail. In your scenario we could consider if the label selectors intersect, but that is not failsafe either and users could get it wrong - we have to somehow ensure all nodes in the cluster have a consistent view of the other nodes, which means the bootstrapper has to simulate each node's discovery or something like that. Maybe that's the best option assuming we can build a reliable detector for "safe" label selector changes.