Open jimmykarily opened 4 years ago
@phil9909 @edwardstudy had you done any work on this already?
@jimmykarily With the work on https://www.pivotaltracker.com/epic/show/4447428 (see https://github.com/cloudfoundry-incubator/cf-operator/pull/692 for the work-in-progress) there is no need (and for 1.0 no possibility) to configure AZs. On clusters running worker nodes in multiple AZs Kubernetes will schedule the Pod
s of a StatefulSet
evenly across all AZs.
@viovanov Should be block this issue?
https://github.com/SUSE/kubecf/pull/179 touches on AZs. @jimmykarily Check that out, please.
When @jandubois gets a chance, he should chime in here.
@loewenstein wrote:
On clusters running worker nodes in multiple AZs Kubernetes will schedule the
Pod
s of aStatefulSet
evenly across all AZs.
This is mostly aspirational; in practice you can still end up with quite a distorted distribution, especially if the number of nodes is low.
Kubernetes will automatically spread the Pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.
SelectorSpreadPriority is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.
But even then, SelectorSpreadPriority
is just a single scoring value that goes into pod scheduling: https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring
Each of the scoring values can be assigned a different weight on a per-cluster basis using a scheduler policy configuration.
There is an alpha level feature in K8s 1.16 that should improve this situation in the future: Pod Topology Spread Constraints: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
But all this being said, I agree that there is no point for kubecf right now to try to influence the scheduler on zone placement; it should be configured via scheduler policy for the whole cluster.
The best we could do right now would be to use podAntiAffinity
rules to increase the likelihood of spreading pods between zones. But once each zone contains a single instance it will not further help in maintaining an even distribution.
Anecdotal evidence: I frequently see both pods of the same statefulset being placed on the same node in a 3 node (single zone) cluster, instead of being put on 2 different nodes (it was "fixed" by using podAntiAffinity
).
There is a second level of AZ support: spreading the applications deployed by CF between zones.
When using Eirini this situation is identical to the scheduling of the kubecf pods itself, and should be left to the K8s scheduler.
When using Diego this needs some additional support to set the zone property on diego-cell
pods so that the diego scheduler can take these into account. Unfortunately there is no way to expose node labels to a pod via "The Downward API", so it requires some active code in e.g. an init container to query the node labels for the zone information (plus a cluster role binding to allow read access to node labels).
I don't know if this level of AZ support was supposed to be included in this issue.
I'd add two things:
StatefulSet
s per zone at https://github.com/cloudfoundry-incubator/cf-operator/blob/dd1814df79582ba2749d5cf337269fc6f3f8afad/pkg/kube/controllers/quarksstatefulset/quarksstatefulset_reconciler.go#L191-L197StorageClass
configured to provision PV
s lazily (see https://www.pivotaltracker.com/story/show/169727962). Otherwise the Pod
s will simply be scheduled where the PV
s where placed.
As a person installing the kubecf helm chart, I should be able to define multiple AZs that I want my deployment to be deployed on.