Improve Kubernetes bootstrap resilience

I think we have an issue at the moment, from my reading the Kubernetes documentation on node partitions, and my understanding of how everything works, currently, the following can happen:

A partition occurs, leaving the entire Akka cluster (or, at least a majority without the kubernetes lease SBR strategy) running on the opposite side of the partition from the Kubernetes api-server (or, at least, from the etcd majority).
After the default 5 minute pod eviction timeout, the api-server deletes all of the Akka pods.
In Kubernetes 1.5+, this results in all of the pods going into a terminating status, and they will stay that way until either the partition is resolved (at which point, the kubelets on the Akka pods nodes find out from the api server that the pods are deleted, and delete them), or an administrator manually deletes the node.
While the partition is still under way, the ReplicaSet controller will see that those pods are being deleted, and create new ones to replace the old ones.
The new pods will satisfy the required-contact-point-nr cluster bootstrap configuration, and a new cluster will be bootstrapped, while the old one is still running on the other side of the partition.

The old cluster will eventually be killed, but if both clusters are able to, for example, still access their database (perhaps the database is hosted outside the cluster, and access to it from either side of the partition isn't impacted by the partition), then we have a big problem.

There are two things in cluster bootstrap that currently cause this scenario. Firstly, the Kubernetes API discovery ignores terminating pods, so cluster bootstrap is completely unaware that those pods exist and could still be running. Secondly, the required-contact-point-nr is not a sufficient enough safety mechanism to prevent it. I think there should at least be an option to say only boot a new cluster if all contact points have been successfully probed, regardless of the configured required-contact-point-nr.

Another scenario caused by this that I think is a problem is the following. Let's say you have a service that usually only has three nodes, but has been scaled up to 10 to handle a peak in load. A partition occurs, and 3 nodes on one side of the partition are downed. Kubernetes then restarts them, because that's what Kubernetes does when a container crashes, they all come up, successfully probe each other, and form a new cluster. Again, the option to only boot a new cluster if all contact points have been probed would prevent this.

I don't think the Kubernetes lease downing strategy helps here - in order to help, it would need to be combined with a bootstrap lease, and that lease would need to be maintained for the life of the cluster. If the lease couldn't be renewed, the cluster would need to down itself in case the non renewal was caused by a partition that led to nodes being recreated on the other side.

Firstly, the Kubernetes API discovery ignores terminating pods, so cluster bootstrap is completely unaware that those pods exist and could still be running

Should we by default not bootstrap when there are terminating nodes? With a flag to allow it if ppl prefer availability (e.g. when cluster is used to distribute work rather than sharding/singletons)?

I think there should at least be an option to say only boot a new cluster if all contact points have been successfully probed, regardless of the configured required-contact-point-nr

This is the behaviour currently. It isn't configurable but is overridable via isConfirmedCommunicationWithAllContactPointsRequired if you extend LowestAddressJoinDecider

This combined with returning Terminating nodes would help.

I don't think the Kubernetes lease downing strategy helps here - in order to help, it would need to be combined with a bootstrap lease, and that lease would need to be maintained for the life of the cluster. If the lease couldn't be renewed, the cluster would need to down itself in case the non renewal was caused by a partition that led to nodes being recreated on the other side.

This could get complicated, which node maintains the lease and when is it decided to release it? The last node sees its self as leaving gracefully? Could be very racy

akka / akka-management

Improve Kubernetes bootstrap resilience #570