cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

Add support for Availability Zones #141

Open jimmykarily opened 4 years ago

jimmykarily commented 4 years ago

As a person installing the kubecf helm chart, I should be able to define multiple AZs that I want my deployment to be deployed on.

jimmykarily commented 4 years ago

@phil9909 @edwardstudy had you done any work on this already?

loewenstein commented 4 years ago

@jimmykarily With the work on https://www.pivotaltracker.com/epic/show/4447428 (see https://github.com/cloudfoundry-incubator/cf-operator/pull/692 for the work-in-progress) there is no need (and for 1.0 no possibility) to configure AZs. On clusters running worker nodes in multiple AZs Kubernetes will schedule the Pods of a StatefulSet evenly across all AZs.

@viovanov Should be block this issue?

f0rmiga commented 4 years ago

https://github.com/SUSE/kubecf/pull/179 touches on AZs. @jimmykarily Check that out, please.

gaktive commented 4 years ago

When @jandubois gets a chance, he should chime in here.

jandubois commented 4 years ago

@loewenstein wrote:

On clusters running worker nodes in multiple AZs Kubernetes will schedule the Pods of a StatefulSet evenly across all AZs.

This is mostly aspirational; in practice you can still end up with quite a distorted distribution, especially if the number of nodes is low.

Kubernetes will automatically spread the Pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.

SelectorSpreadPriority is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.

https://kubernetes.io/docs/reference/kubernetes-api/labels-annotations-taints/#topologykubernetesiozone

But even then, SelectorSpreadPriority is just a single scoring value that goes into pod scheduling: https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring

Each of the scoring values can be assigned a different weight on a per-cluster basis using a scheduler policy configuration.

There is an alpha level feature in K8s 1.16 that should improve this situation in the future: Pod Topology Spread Constraints: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

But all this being said, I agree that there is no point for kubecf right now to try to influence the scheduler on zone placement; it should be configured via scheduler policy for the whole cluster.

The best we could do right now would be to use podAntiAffinity rules to increase the likelihood of spreading pods between zones. But once each zone contains a single instance it will not further help in maintaining an even distribution.

Anecdotal evidence: I frequently see both pods of the same statefulset being placed on the same node in a 3 node (single zone) cluster, instead of being put on 2 different nodes (it was "fixed" by using podAntiAffinity).

jandubois commented 4 years ago

There is a second level of AZ support: spreading the applications deployed by CF between zones.

When using Eirini this situation is identical to the scheduling of the kubecf pods itself, and should be left to the K8s scheduler.

When using Diego this needs some additional support to set the zone property on diego-cell pods so that the diego scheduler can take these into account. Unfortunately there is no way to expose node labels to a pod via "The Downward API", so it requires some active code in e.g. an init container to query the node labels for the zone information (plus a cluster role binding to allow read access to node labels).

I don't know if this level of AZ support was supposed to be included in this issue.

loewenstein commented 4 years ago

I'd add two things:

  1. There is currently still logic in place that creates StatefulSets per zone at https://github.com/cloudfoundry-incubator/cf-operator/blob/dd1814df79582ba2749d5cf337269fc6f3f8afad/pkg/kube/controllers/quarksstatefulset/quarksstatefulset_reconciler.go#L191-L197
  2. We'll have to help the scheduler by introducing a non-default StorageClass configured to provision PVs lazily (see https://www.pivotaltracker.com/story/show/169727962). Otherwise the Pods will simply be scheduled where the PVs where placed.