Simplify zone awareness configuration

elastic / cloud-on-k8s

Elastic Cloud on Kubernetes

Other

2.59k stars 704 forks source link

Simplify zone awareness configuration #3933

Open eedugon opened 3 years ago

eedugon commented 3 years ago

Proposal

At the moment the only way to configure nodes with different attributes for zone awareness is to create different nodeSets, knowing in advance the specific values for the attributes. So it's purely a static configuration, as we suggest in this document: https://www.elastic.co/guide/en/cloud-on-k8s/1.2/k8s-advanced-node-scheduling.html#k8s-availability-zone-awareness

It would be great to have, within the same nodeSet, the possibility of having certain parameters retrieved from variables or pod metadata. For example to associate a node.attr setting to an environment variable available in the POD.

If that we already have (configuring settings based on pod's environment variables), then we would need to find a way for a running POD to know the availability zone where it's running :)

For that, apparently there's a feature request on Kubernetes side (https://github.com/kubernetes/kubernetes/issues/62078) to allow node labels to be passed to the pods. If that's implemented then probably this would be easier.

Use case. Why is this important?

Some kubernetes environments are based on Autoscaling groups on multiple AZs with dynamic zones allocation. In such cases it would be great to provide the requested mechanisms.
The Elasticsearch manifest of a 3 node cluster where each node has to run in a different AZ doesn't look very nice when we need to create 3 different nodeSets with 1 replica each (same settings in each nodeSet except of the specific attribute for the AZ). With a single nodeSet (replica 3) + affinity rules the same goal could be achieved from scheduling point of view, but in such case there's no way for the 3 instances of the nodes to know where (in what AZ) they are running. If we are able to implement the requested mechanism then we could easily build this type of configuration with a single nodeSet, based on some extra dynamic parameters.

sebgl commented 3 years ago

Also related (probably the right k8s issue to track): https://github.com/kubernetes/kubernetes/issues/40610 I see Anya already commented there :)

sebgl commented 3 years ago

A very hacky unsupported way to achieve this, through an init container that retrieves the k8s node label and injects it in the Elasticsearch configuration:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.9.2
  nodeSets:
  - name: default
    count: 1
    config:
      cluster.routing.allocation.awareness.attributes: zone
    podTemplate:
      spec:
        serviceAccountName: node-fetcher
        automountServiceAccountToken: true
        containers:
        - name: elasticsearch
        initContainers:
          - name: inject-zone-attribute
            image: bskim45/helm-kubectl-jq
            command:
              - "bash"
              - "-cx"
              - "ZONE=$(kubectl get node $NODE_NAME -o json | jq -r '.metadata.labels[\"failure-domain.beta.kubernetes.io/zone\"]') \
               CONFIG_FILE=/usr/share/elasticsearch/config/elasticsearch.yml \
               CONFIG_FILE_BACKUP=/usr/share/elasticsearch/config/elasticsearch.yml.backup; \
               mv $CONFIG_FILE $CONFIG_FILE_BACKUP && \
               cat $CONFIG_FILE_BACKUP <(echo node.attr.zone: $ZONE) > $CONFIG_FILE"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fetch-node
rules:
  - apiGroups: [ "" ]
    resources: [ "nodes" ]
    verbs: [ "get" ]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-fetcher
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fetch-node-binding
roleRef:
  apiGroup: ""
  kind: ClusterRole
  name: fetch-node
subjects:
  - apiGroup: ""
    kind: ServiceAccount
    name: node-fetcher
    namespace: default

sebgl commented 3 years ago

As of today the recommanded way to setup multi-AZ Elasticsearch clusters is to create one NodeSet per zone, with the right Kubernetes affinity settings, and the right Elasticsearch configuration bits:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.13.2
  nodeSets:
  - name: zone-a
    count: 1
    config:
      node.attr.zone: europe-west3-a
      cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: failure-domain.beta.kubernetes.io/zone
                  operator: In
                  values:
                  - europe-west3-a
  - name: zone-b
    count: 1
    config:
      node.attr.zone: europe-west3-b
      cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: failure-domain.beta.kubernetes.io/zone
                  operator: In
                  values:
                  - europe-west3-b

On one hand this is very explicit, gives a lot of control to the user, and is entirely transparent to the operator code which does not need to be aware of availability zones at all.

On the other hand it makes the Elasticsearch resource specification quite cumbersome. Even more if we assume any production-grade cluster must have this kind of setup.

We could simplify this setup by delegating availability zone management to the operator, so users just provide the number of zones they want per NodeSet.

Design ideas

I see 2 main paths to explore.

Rely on Pod topology spread constraints

We could let the user configure Pod topology spread constraints if they want to, so the Kubernetes scheduler has full control over the Pod <-> zone mapping and spreads Pods around on its own. However the zone into which a Pod is scheduled isn't easily available from within the Pod itself, making it quite hard to configure Elasticsearch for zone awareness. As of today, this would likely require an init container to grab the node labels and edit the elasticsearch.yml configuration. I'm also afraid this may not play well with grown-and-shrink -like update strategies. If we create the replacing Pod before removing the existing one, it could be hard to ensure they are both scheduled into the same availability zone.

Make the operator responsible for managing zones

We could extend the Elasticsearch CRD with optional zone management settings:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.13.2
  nodeSets:
  - name: my-nodes
    count: 1
    availabilityZones:
      count: 3  # effectively means we get 3*1=3 nodes
      topologyKey: zone  # label key set on the nodes to distinguish zones. Could default to "topology.kubernetes.io/zone" if not provided?
      allowedZones: # TBD - allows restricting to a subset of all available zones so users could for example remove an entire zone
      - us-east1a
      - us-east1b
      - us-east1c

The impact on the operator is quite important:

it needs to has RBAC access to get k8s Nodes resources (may not be a big deal, especially if read-only?)
it needs to manage several StatefulSets per NodeSet. We can't rely on the direct nodeSet name -> statefulSet name association anymore, and likely require a suffix instead (e.g. my-nodes-us-east1a). For each managed StatefulSet, the operator would set the right node affinity rule in the Pod spec, and setup the Elasticsearch configuration accordingly.

shubhaat commented 3 years ago

From a user perspective, we've not seen any asks for it yet. There could be a few reasons

users are not yet setting up across zones (which is scary)
users are setting up across zones, but don't find it painful enough to ask for improvements.
They are complaining, we just don't have evidence

Maybe we should validate how much of a pain it is.

eedugon commented 3 years ago

More than a pain I would say it’s a bit ugly (and risky from maintenance point of view, as a single change might need to be done in N places to without a logical need).

I will review why I haven't raised any Enhancement Request, but I've seen a lot of users, who have configured N nodeSets with 1 node in each just to separate in different zones, mentioning that the YAML doesn’t make much sense or looks awkward, more over when they might want an extra type of nodes and that will require N extra nodesets instead of one.

When looking at those manifests it’s obvious that they would look much nicer if ECK can handle some attributes to simplify them, because the config is the same in all nodeSets except the pod allocation.

I believe we have a mixture of your 1. and 3. statements, but not the second :-D

barkbay commented 3 years ago

I temporarily assigned this one to myself while exploring the idea of using a webhook to catch the calls to the binding subresource (pod/<podname>/bind ). The body on the API call contains the node name, which allows a webhook to inject information such as rack, zone or region into the Pod after it is created but not started yet.

From a high level perspective the workflow is:

Catch API calls to pods/binding using a webhook:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
 name: elastic-webhook.k8s.elastic.co
webhooks:
 - clientConfig:
     service:
       name: elastic-webhook-server
       namespace: elastic-system
       path: /pod-binding-k8s-elastic-co
   name: pod-binding-k8s-elastic-co.k8s.elastic.co
   rules:
     - apiGroups:
         - ""
       apiVersions:
         - v1
       resources:
         - bindings
         - pods/binding
       operations:
         - CREATE

Decode the binding request

{
 "name": "elasticsearch-sample-es-default-0",
 "namespace": "default",
 "operation": "CREATE",
 "userInfo": {
   "username": "system:kube-scheduler",
   "uid": "uid:system:kube-scheduler",
   "groups": ["system:authenticated"]
 },
 "object": {
   "kind": "Binding",
   "apiVersion": "v1",
   "metadata": {
     "name": "elasticsearch-sample-es-default-0",
     "namespace": "default",
     "uid": "3db74e24-524c-45e7-8dc6-cc9728c3d871"
   },
   "target": {
     "kind": "Node",
     "name": "gke-michael-dev1-default-pool-00c61a3c-1290"
   }
 }
}

Inject the Nodes labels into the Pods through some annotations

func (wh *mutatingWebhook) Handle(ctx context.Context, request admission.Request) admission.Response {
  binding := &v1.Binding{}

  // Decode the /bind request
  err := wh.decoder.DecodeRaw(request.Object, binding)
  if err != nil {...}

   // get the topology keys from the K8S Nodes
  kv, err := getTopologyKeys(ctx, wh.client, binding.Target.Name)
  if err != nil {...}

  if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
     // 1. Read the current Pod specification
     key := client.ObjectKey{Namespace: binding.ObjectMeta.Namespace, Name: binding.ObjectMeta.Name}
     pod := &v1.Pod{}
     if err := wh.client.Get(ctx, key, pod); err != nil {...}
     if pod.Annotations == nil {
        pod.Annotations = make(map[string]string)
     }
     // 2. Add topology keys to the Pod's annotations
     for k, v := range kv {
        pod.Annotations[k] = v
     }
     // 3. Update the Pod
     return wh.client.Update(ctx, pod)
  }); err != nil {...}
}

This removes the need of having a NodeSet for each zone:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.15.1
  nodeSets:
  - name: default
    config:
      ##############################
      # Setup allocation awareness #
      node.attr.zone: ${ZONE}
      cluster.routing.allocation.awareness.attributes: zone
      cluster.routing.allocation.awareness.force.zone.values: ${ZONES}
      ##############################
    podTemplate:
      spec:
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-sample
                elasticsearch.k8s.elastic.co/statefulset-name: elasticsearch-sample-es-default
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 4Gi
              cpu: 1
          env:
          - name: ZONE
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['topology.kubernetes.io/zone']
          - name: ZONES
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['topology.kubernetes.io/zones']

The code of a POC is available here. I think the use of a webhook might be an acceptable tradeoff here.

sebgl commented 3 years ago

@barkbay your suggestion made me think about a slightly different way to achieve something similar, in which we would rely on the operator to set the topology.kubernetes.io/zone annotation on the Pods, rather than rely on a mutating webhook:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.15.0
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
      node.attr.zone: ${ZONE}
      cluster.routing.allocation.awareness.attributes: zone
      cluster.routing.allocation.awareness.force.zone.values: ${ZONES}
    podTemplate:
      spec:
        initContainers:
          - name: wait-for-annotations
            command:
              - sh
              - -c
              - |
                while [[ $(grep -L 'topology.kubernetes.io/zone=' /mnt/elastic-internal/downward-volume/annotations) ]]; do
                echo Annotation topology.kubernetes.io/zone not set yet
                sleep 10
                done
        containers:
          - name: elasticsearch
            env:
              - name: ZONE
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['topology.kubernetes.io/zone']
            volumeMounts:
            - name: downward-volume
              mountPath: /mnt/elastic-internal/downward-volume
        volumes:
          - name: downward-volume
            downwardAPI:
              items:
                - path: annotations
                  fieldRef:
                    fieldPath: metadata.annotations

In parallel, watch the logs of that init container:

❯ k logs -f elasticsearch-sample-es-default-0 -c wait-for-annotations
Annotation topology.kubernetes.io/zone not set yet
Annotation topology.kubernetes.io/zone not set yet
Annotation topology.kubernetes.io/zone not set yet

And annotate the Pod (simulating what the operator would do):

kubectl annotate pods elasticsearch-sample-es-default-0 topology.kubernetes.io/zone=us-east-1a

In this example we inherit the $ZONE environment variable from the downward API (Pod annotations). And expect the operator to set the annotation after the Pod has been created. If the main container starts before the operator sets the annotation then the environment variable will be empty, which is not what we want here. So the trick here is to setup an init container that checks the content of a file maintained up to date by the downward API (something we could add to the prepare-fs script, for example). Once the annotation exists in that file, then we assume we can move on and the main container will have the right environment variable populated.

barkbay commented 3 years ago

@sebgl I never realized that the downward API was updating the content of the volume in (almost ?) realtime ! It's indeed more clever to rely on the operator to set the annotations 👍

barkbay commented 2 years ago

I started to think again about this feature and some implementation details. I think that two important points might lead the design:

We want fine grain control regarding which node labels can be injected. There should be a separation of concerns between the operator, who must be set up with the relevant permissions to read the node labels, and the resource's owner. Said differently, I think we don't want to inject all the node's labels as annotations. I share the concerns raised here and here: allowing the operator to read node labels should not allow other users to read them without further control.
Resource's owner (only Elasticsearch for now?) should be able to explicitly control which labels are injected. If a label does not exist, the operator should prevent the Pod from starting. The reason is that having a default value, like an empty string, it might have some consequences on the data topology.

Implementation idea:

Add an operator flag (something like --expose-node-labels) . It would accept a list of node labels, as regular expressions, to be allowed to be injected as annotations into the Pods.
No default value would be set for this flag, feature is disabled by default.
We may however bundle the operator (Helm charts with the appropriate roles and bindings and the Operator Hub) with a default value, something like topology.kubernetes.io/*,failure-domain.beta.kubernetes.io/* (even if the latest is deprecated it might still be used on some clusters)
Labels are injected as annotations on Pods using an annotation on the Elasticsearch resource itself. The annotation contains a comma separated list of the expected labels. To avoid a Pod to be started with a label that would not exist no regular expression would be allowed here, the explicit name of the label is expected (see the edge cases below):

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
  annotations:
    eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone"
spec:
...
  env:
  - name: ZONE
    valueFrom:
      fieldRef:
        fieldPath: metadata.annotations['topology.kubernetes.io/zone']

The feature would be only available for Elasticsearch.
I would create a dedicated init container, for separation of concerns and because it might help the feature to be reused for other resources if we do not agree with the previous point.
What if a label cannot be injected ?
- Permission to read node labels may change while the operator is running. Even if it is unlikely for a topology label to be changed, a new K8S node might be added to the K8S cluster. The operator may not be able to read and inject the labels if permission to read node labels has been revoked. What should be the expected behaviour if the operator is no more able to read the node labels ? My feeling is that we should prevent the Pod from starting, setting empty values might put the cluster at risk because of wrong data location.
- The resource's owner expects a label which does not exist. I would apply the same logic here: having an empty string might lead to some unexpected behaviour.

pebrc commented 2 years ago

No default value would be set for this flag, feature is disabled by default.

What is your reason for disabling by default? I can think of a few like: making it an explicit choice to share node information (even if it is only a topology hint) on the Pod level and maybe consistency with current behaviour (so that users that upgrade the operator don't suddenly get node metadata added on the Pod level).

If we then opt to have a default, as you suggest topology.kubernetes.io/*,failure-domain.beta.kubernetes.io/*through the bundled configuration we ship, then that argument gets weaker? Maybe I am missing another reason? Because it gets awkward to turn it off with ""?

The argument in favour of enabling it by default restricted to topology.kubernetes.io/*... is that it 1. makes feature adoption easier (no need to reconfigure the operator, which might be the responsibility of a different team/admin) 2. it thereby encourages more resilient Elasticsearch cluster architectures with less chance of data loss? 3. topology.kubernetes.io/* should be generally OK to share on the Pod level. 4. users can opt-out via the operator flag if so desired.

I would create a dedicated init container, for separation of concerns and because it might help the feature to be reused for other resources if we do not agree with the previous point.

I generally agree. My only worry is that we are adding more and more init containers. We had the init filesystem container since the very early days. I recently added one to suspend Elasticsearch if the user needs to run debug or disaster recovery tooling. This would be the third one.

What if a label cannot be injected ?

+1 on not using "" as the default but to stop reconciling the cluster instead. We might even be able to catch the user error of specifying non-existent labels in the validating webhook.

barkbay commented 2 years ago

Because it gets awkward to turn it off with ""?

Yes. If the operator is bundled with this flag enabled by default (through the eck.yaml configuration file in the "all-in-one", in the OperatoHub and in our Helm charts) then I think we have the same result, but it makes it explicit which labels are exposed.

My only worry is that we are adding more and more init containers.

Agreed, I can start working on a first implementation with a dedicated container and take the time to discuss this point later.

We might even be able to catch the user error of specifying non-existent labels in the validating webhook.

👍

barkbay commented 2 years ago

Labels are injected as annotations on Pods using an annotation on the Elasticsearch resource itself. The annotation contains a comma separated list of the expected labels (for example eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone")

A drawback is that the entire cluster must be restarted when this annotation is changed. Said differently, if the list of expected node labels is updated by the user on the Elasticsearch resource, then the operator does not know which specific Pods should be restarted. I think it is an acceptable tradeoff if we consider that the list of the expected annotations is unlikely to change frequently, and also that they are likely to be used by most of the node sets.

We might even be able to catch the user error of specifying non-existent labels in the validating webhook.

A challenge here is that node labels may not be consistent across the K8S nodes. A Pod may be eventually scheduled on a K8S node which does not have all the expected labels. We might still want to validate that all the expected labels exist on at least one (same) node, but I'm a bit reluctant to use the K8S client in a webhook for now because of #5025.

sebgl commented 2 years ago

I think #5054 is an "in-between" between doing nothing (people need to statically configure 3 StatefulSets with the right k8s + ES settings), and going all the way to making things simpler to the user.

Just to make it clear, here are 3 examples of different potential approaches to deploy a 3-nodes cluster into 3 different zones:

1) Current situation, users have to configure everything "manually":

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.5.0
  nodeSets:
  - name: hot-nodes-zone-a
    count: 1
    config:
      node.roles: ["data_hot"]
      node.attr.zone: europe-west3-a
      cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: failure-domain.beta.kubernetes.io/zone
                  operator: In
                  values:
                  - europe-west3-a
  - name: hot-nodes-zone-b
    count: 1
    config:
      node.attr.zone: europe-west3-b
      cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: failure-domain.beta.kubernetes.io/zone
                  operator: In
                  values:
                  - europe-west3-b
  - name: hot-nodes-zone-c
    count: 1
    config:
      node.attr.zone: europe-west3-c
      cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: failure-domain.beta.kubernetes.io/zone
                  operator: In
                  values:
                  - europe-west3-c

2) With the proposal in https://github.com/elastic/cloud-on-k8s/pull/5054, we can rely on Kubernetes TopologySpreadConstraints feature to spread all Pods of the same StatefulSet across all available zones. Major benefit is that we have a single NodeSet in the spec. Important: K8s TopologySpreadConstraints does not allow users to have their hot nodes in 2 zones if their k8s cluster has 3 zones (or have their hot nodes in 3 zones if their k8s cluster has 4 zones).

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
  annotations: 
    eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone" # allows specifying which node label should be used to determine the availability zone name
spec:
  version: 7.5.0
  nodeSets:
  - name: hot-nodes
    count: 3
    config:
      node.roles: ["data_hot"]
      cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
      # use the $ZONE env var that contains the zone name
      node.attr.zone: $ZONE
    podTemplate:
      spec:
        # K8s will spread Pods across all available zones through the "topology.kubernetes.io/zone" node label
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-sample
                elasticsearch.k8s.elastic.co/statefulset-name: elasticsearch-sample-es-default
      containers:
      - name: elasticsearch
        env:
        # specify that the annotation that was added on the Pod by the operator due to the `eck.k8s.elastic.co/downward-node-labels` annotation feature should be automatically transformed to an env var
        - name: ZONE
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['topology.kubernetes.io/zone']

3) Same technical approach as 2), but more opinionated approach that goes further into simplifying what the users need to do and understand (no es config + env var indirection to handle):

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.5.0
  nodeSets:
  - name: hot-nodes
    count: 3
    zoneAwareness:
      spreadAcrossAvailableZones: true # defaults to false
      # the 2 next fields are optional (this is their default value) to allow finer-grained control
      nodeLabel: topology.kubernetes.io/zone
      elasticsearchAttribute: zone
    config:
      node.roles: ["data_hot"]

What's specified in the spec in option 3) could technically also be specified through annotations, but that would become inconvenient if we want per-nodeset settings (we could also decide to let the user specify the PodTopologySpreadConstraints part at the nodeSet level, while we take care of the ES config if enabled through a cluster-wide annotation).

barkbay commented 2 years ago

A few questions/comments regarding the third proposal:

Users might want to use several node topology labels (for example region/zone/room/rack), the proposal seems to only handle one.
How users would tune topologySpreadConstraints parameters like maxSkew or whenUnsatisfiable ?
We might want to offer the same feature for the whole stack. Proposal 2 can be easily ported to other stack components. Regarding the third one, would it be the same syntax but with a different behaviour ?

My feeling is that the last proposal is maybe a bit too much opinionated, while some users might expect/need transparency and flexibility. Proposal 2, for which the PR is ready, is not incompatible with 3. We might just want to implement this feature later. Also the current PR, if merged, would not represent a very big burden to maintain IMHO.

sebgl commented 2 years ago

My feeling is that the last proposal it is maybe a bit too much opinionated, while some users might expect/need transparency and flexibility. Proposal 2, for which the PR is ready, is not incompatible with 3. We might just want to implement this feature later.

++ I'm fine with moving forward with option 2 now, it looks like it can be used as a building block to maybe propose option 3 later.

Kushmaro commented 2 years ago

Looking at this now @sebgl (sorry, had to catch up on the entire thread :D ) my hunch would actually be to go for opt.3

Reason being is that I think it makes for much much simpler UX/DX, i.e, I don't have to read much manuals before I understand what it does and how... I.e - it's just simple to use.

I'd actually say a better way of going about this would be to be opinionated in the implementation, get feedback, then relax and further allow controls to be more granular, if needed.

barkbay commented 2 years ago

Reopening as we might also want to implement the last proposal described in https://github.com/elastic/cloud-on-k8s/issues/3933#issuecomment-979340774

Edit: I think I'll open a dedicated issue

barkbay commented 2 years ago

Important: K8s TopologySpreadConstraints does not allow users to have their hot nodes in 2 zones if their k8s cluster has 3 zones (or have their hot nodes in 3 zones if their k8s cluster has 4 zones).

You can actually achieve this by mixing topologySpreadConstraints and nodeAffinity:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
  annotations:
    eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone" # allows specifying which node label should be used to determine the availability zone name
spec:
  version: 7.16.0
  nodeSets:
    - name: default
      count: 6
      config:
        node.store.allow_mmap: false
        cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
        node.attr.zone: $ZONE
      podTemplate:
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: topology.gke.io/zone
                        operator: In
                        values: # restrict deployment to these 2 zones
                          - europe-west1-b
                          - europe-west1-c
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: quickstart
                  elasticsearch.k8s.elastic.co/statefulset-name: quickstart-es-default
          containers:
            - name: elasticsearch
              env:
                # specify that the annotation that was added on the Pod by the operator due to the `eck.k8s.elastic.co/downward-node-labels` annotation feature should be automatically transformed to an env var
                - name: ZONE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['topology.kubernetes.io/zone']

Edit: here is the result on 3 nodes K8S cluster:

NAME                                           STATUS   ROLES    AGE    VERSION            INTERNAL-IP   EXTERNAL-IP      OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME   LABELS
gke-michael-dev-2-default-pool-0dc9aa68-drvb   Ready    <none>   3d8h   v1.21.6-gke.1500  Container-Optimized OS from Google   5.4.144+    topology.kubernetes.io/zone=europe-west1-b
gke-michael-dev-2-default-pool-14c3d8f0-v741   Ready    <none>   3d8h   v1.21.6-gke.1500     Container-Optimized OS from Google   5.4.144+    topology.kubernetes.io/zone=europe-west1-d
gke-michael-dev-2-default-pool-95fa1e08-mzds   Ready    <none>   3d8h   v1.21.6-gke.1500      Container-Optimized OS from Google   5.4.144+     topology.kubernetes.io/zone=europe-west1-c

NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE                                           NOMINATED NODE   READINESS GATES
quickstart-es-default-0   1/1     Running   0          2m26s   10.45.112.26   gke-michael-dev-2-default-pool-95fa1e08-mzds   <none>           <none>
quickstart-es-default-1   1/1     Running   0          2m26s   10.45.114.38   gke-michael-dev-2-default-pool-0dc9aa68-drvb   <none>           <none>
quickstart-es-default-2   1/1     Running   0          2m26s   10.45.112.27   gke-michael-dev-2-default-pool-95fa1e08-mzds   <none>           <none>
quickstart-es-default-3   1/1     Running   0          2m26s   10.45.114.40   gke-michael-dev-2-default-pool-0dc9aa68-drvb   <none>           <none>
quickstart-es-default-4   1/1     Running   0          2m26s   10.45.112.28   gke-michael-dev-2-default-pool-95fa1e08-mzds   <none>           <none>
quickstart-es-default-5   1/1     Running   0          2m26s   10.45.114.39   gke-michael-dev-2-default-pool-0dc9aa68-drvb   <none>           <none>

Only the 2 nodes in the desired zones are used.

sebgl commented 2 years ago

You can actually achieve this by mixing topologySpreadConstraints and nodeAffinity

You're right. But then the user has to "statically" pick the zones, so we loose the benefit of letting the scheduler handle that.

And I was about to say: at which point the user might as well go back to defining one nodeSet per zone (also fully statically defined); but I think there's still some benefit in using a single nodeSet (less StatefulSet resources to "understand" and name, less plumbing in the manifest) 👍

barkbay commented 2 years ago

You're right. But then the user has to "statically" pick the zones, so we loose the benefit of letting the scheduler handle that.

Agreed, just wanted to underline that topologySpreadConstraints has been designed to be compatible with nodeAffinity