Open garethjames-imburse opened 6 months ago
Hi @garethjames-imburse - sorry for the delay on this. Have you run into this problem again since?
Hi @Aaronontheweb, we were unable to get anywhere setting new-cluster-enabled=off
- very often when killing pods we ended up with no cluster (as we required at least 2 contact points to form one). Instead we have tuned a whole lot of the other settings available with some trial and error, and deployments appear to be stable now.
@garethjames-imburse would you mind sharing some of your configuration settings? I'm very interested in seeing if we can reproduce this issue in our test lab at all, since we rely heavily on K8s service discovery there.
@Aaronontheweb, apologies for the delay - thank you for inviting us to share our configuration. I've reached out to you separately to discuss further but I'll paste any useful information back here.
I’m hoping someone can shed some light on why our Kubernetes deployments are so sensitive to split brains with our Helm chart and Akka configuration the way it is.
The documentation is fairly brief, not always explaining how the various settings work and when to use them, so it's unclear to us if we're following best practices for our deployment scenario.
We have five applications (alpha, beta, charlie, delta, echo) which are deployed from a single Helm chart as stateful sets to Kubernetes. Each stateful set has three replicas. The pods that are created are as follows:
We are using Akka.Cluster.Sharding and Akka.Management + Akka.Discovery.KubernetesApi to form the cluster. This works well generally, except for approximately 3% of the time we end up with a split brain when performing a rolling deployment. This seems like an unusually high percentage and is causing some problems.
The HOCON we were using initially was as follows:
Following the section Deployment Considerations from the Akka.Management repo docs, we made the following changes to the configuration:
After making these changes, while testing deployments, things appear to work as expected (just as they do most of the time). When being a bit more aggressive and randomly killing a handful of pods, we would often end up with none of the nodes being in a cluster (verified with PBM).
The last adjustments we made were as follows:
This seems to have yielded the best results overall, but we're concerned that setting
new-cluster-enabled=off
has not proved very useful and that we're still vulnerable to split brains during deployment.Does anyone have any experience and/or advice for similar scenarios using these Akka features?