Open eedugon opened 3 years ago
Also related (probably the right k8s issue to track): https://github.com/kubernetes/kubernetes/issues/40610 I see Anya already commented there :)
A very hacky unsupported way to achieve this, through an init container that retrieves the k8s node label and injects it in the Elasticsearch configuration:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.9.2
nodeSets:
- name: default
count: 1
config:
cluster.routing.allocation.awareness.attributes: zone
podTemplate:
spec:
serviceAccountName: node-fetcher
automountServiceAccountToken: true
containers:
- name: elasticsearch
initContainers:
- name: inject-zone-attribute
image: bskim45/helm-kubectl-jq
command:
- "bash"
- "-cx"
- "ZONE=$(kubectl get node $NODE_NAME -o json | jq -r '.metadata.labels[\"failure-domain.beta.kubernetes.io/zone\"]') \
CONFIG_FILE=/usr/share/elasticsearch/config/elasticsearch.yml \
CONFIG_FILE_BACKUP=/usr/share/elasticsearch/config/elasticsearch.yml.backup; \
mv $CONFIG_FILE $CONFIG_FILE_BACKUP && \
cat $CONFIG_FILE_BACKUP <(echo node.attr.zone: $ZONE) > $CONFIG_FILE"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fetch-node
rules:
- apiGroups: [ "" ]
resources: [ "nodes" ]
verbs: [ "get" ]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: node-fetcher
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fetch-node-binding
roleRef:
apiGroup: ""
kind: ClusterRole
name: fetch-node
subjects:
- apiGroup: ""
kind: ServiceAccount
name: node-fetcher
namespace: default
As of today the recommanded way to setup multi-AZ Elasticsearch clusters is to create one NodeSet per zone, with the right Kubernetes affinity settings, and the right Elasticsearch configuration bits:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.13.2
nodeSets:
- name: zone-a
count: 1
config:
node.attr.zone: europe-west3-a
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west3-a
- name: zone-b
count: 1
config:
node.attr.zone: europe-west3-b
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west3-b
On one hand this is very explicit, gives a lot of control to the user, and is entirely transparent to the operator code which does not need to be aware of availability zones at all.
On the other hand it makes the Elasticsearch resource specification quite cumbersome. Even more if we assume any production-grade cluster must have this kind of setup.
We could simplify this setup by delegating availability zone management to the operator, so users just provide the number of zones they want per NodeSet.
Design ideas
I see 2 main paths to explore.
We could let the user configure Pod topology spread constraints if they want to, so the Kubernetes scheduler has full control over the Pod <-> zone mapping and spreads Pods around on its own.
However the zone into which a Pod is scheduled isn't easily available from within the Pod itself, making it quite hard to configure Elasticsearch for zone awareness. As of today, this would likely require an init container to grab the node labels and edit the elasticsearch.yml
configuration.
I'm also afraid this may not play well with grown-and-shrink -like update strategies. If we create the replacing Pod before removing the existing one, it could be hard to ensure they are both scheduled into the same availability zone.
We could extend the Elasticsearch CRD with optional zone management settings:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.13.2
nodeSets:
- name: my-nodes
count: 1
availabilityZones:
count: 3 # effectively means we get 3*1=3 nodes
topologyKey: zone # label key set on the nodes to distinguish zones. Could default to "topology.kubernetes.io/zone" if not provided?
allowedZones: # TBD - allows restricting to a subset of all available zones so users could for example remove an entire zone
- us-east1a
- us-east1b
- us-east1c
The impact on the operator is quite important:
my-nodes-us-east1a
). For each managed StatefulSet, the operator would set the right node affinity rule in the Pod spec, and setup the Elasticsearch configuration accordingly.From a user perspective, we've not seen any asks for it yet. There could be a few reasons
Maybe we should validate how much of a pain it is.
More than a pain I would say it’s a bit ugly (and risky from maintenance point of view, as a single change might need to be done in N places to without a logical need).
I will review why I haven't raised any Enhancement Request, but I've seen a lot of users, who have configured N nodeSets with 1 node in each just to separate in different zones, mentioning that the YAML doesn’t make much sense or looks awkward, more over when they might want an extra type of nodes and that will require N extra nodesets instead of one.
When looking at those manifests it’s obvious that they would look much nicer if ECK can handle some attributes to simplify them, because the config is the same in all nodeSets except the pod allocation.
I believe we have a mixture of your 1. and 3. statements, but not the second :-D
I temporarily assigned this one to myself while exploring the idea of using a webhook to catch the calls to the binding subresource (pod/<podname>/bind
). The body on the API call contains the node name, which allows a webhook to inject information such as rack, zone or region into the Pod after it is created but not started yet.
From a high level perspective the workflow is:
pods/binding
using a webhook:apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: elastic-webhook.k8s.elastic.co
webhooks:
- clientConfig:
service:
name: elastic-webhook-server
namespace: elastic-system
path: /pod-binding-k8s-elastic-co
name: pod-binding-k8s-elastic-co.k8s.elastic.co
rules:
- apiGroups:
- ""
apiVersions:
- v1
resources:
- bindings
- pods/binding
operations:
- CREATE
{
"name": "elasticsearch-sample-es-default-0",
"namespace": "default",
"operation": "CREATE",
"userInfo": {
"username": "system:kube-scheduler",
"uid": "uid:system:kube-scheduler",
"groups": ["system:authenticated"]
},
"object": {
"kind": "Binding",
"apiVersion": "v1",
"metadata": {
"name": "elasticsearch-sample-es-default-0",
"namespace": "default",
"uid": "3db74e24-524c-45e7-8dc6-cc9728c3d871"
},
"target": {
"kind": "Node",
"name": "gke-michael-dev1-default-pool-00c61a3c-1290"
}
}
}
Nodes
labels into the Pods
through some annotationsfunc (wh *mutatingWebhook) Handle(ctx context.Context, request admission.Request) admission.Response {
binding := &v1.Binding{}
// Decode the /bind request
err := wh.decoder.DecodeRaw(request.Object, binding)
if err != nil {...}
// get the topology keys from the K8S Nodes
kv, err := getTopologyKeys(ctx, wh.client, binding.Target.Name)
if err != nil {...}
if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
// 1. Read the current Pod specification
key := client.ObjectKey{Namespace: binding.ObjectMeta.Namespace, Name: binding.ObjectMeta.Name}
pod := &v1.Pod{}
if err := wh.client.Get(ctx, key, pod); err != nil {...}
if pod.Annotations == nil {
pod.Annotations = make(map[string]string)
}
// 2. Add topology keys to the Pod's annotations
for k, v := range kv {
pod.Annotations[k] = v
}
// 3. Update the Pod
return wh.client.Update(ctx, pod)
}); err != nil {...}
}
This removes the need of having a NodeSet
for each zone:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.15.1
nodeSets:
- name: default
config:
##############################
# Setup allocation awareness #
node.attr.zone: ${ZONE}
cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.awareness.force.zone.values: ${ZONES}
##############################
podTemplate:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-sample
elasticsearch.k8s.elastic.co/statefulset-name: elasticsearch-sample-es-default
containers:
- name: elasticsearch
resources:
limits:
memory: 4Gi
cpu: 1
env:
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
- name: ZONES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zones']
The code of a POC is available here. I think the use of a webhook might be an acceptable tradeoff here.
@barkbay your suggestion made me think about a slightly different way to achieve something similar, in which we would rely on the operator to set the topology.kubernetes.io/zone
annotation on the Pods, rather than rely on a mutating webhook:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.15.0
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
node.attr.zone: ${ZONE}
cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.awareness.force.zone.values: ${ZONES}
podTemplate:
spec:
initContainers:
- name: wait-for-annotations
command:
- sh
- -c
- |
while [[ $(grep -L 'topology.kubernetes.io/zone=' /mnt/elastic-internal/downward-volume/annotations) ]]; do
echo Annotation topology.kubernetes.io/zone not set yet
sleep 10
done
containers:
- name: elasticsearch
env:
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
volumeMounts:
- name: downward-volume
mountPath: /mnt/elastic-internal/downward-volume
volumes:
- name: downward-volume
downwardAPI:
items:
- path: annotations
fieldRef:
fieldPath: metadata.annotations
In parallel, watch the logs of that init container:
❯ k logs -f elasticsearch-sample-es-default-0 -c wait-for-annotations
Annotation topology.kubernetes.io/zone not set yet
Annotation topology.kubernetes.io/zone not set yet
Annotation topology.kubernetes.io/zone not set yet
And annotate the Pod (simulating what the operator would do):
kubectl annotate pods elasticsearch-sample-es-default-0 topology.kubernetes.io/zone=us-east-1a
In this example we inherit the $ZONE
environment variable from the downward API (Pod annotations). And expect the operator to set the annotation after the Pod has been created.
If the main container starts before the operator sets the annotation then the environment variable will be empty, which is not what we want here.
So the trick here is to setup an init container that checks the content of a file maintained up to date by the downward API (something we could add to the prepare-fs script, for example). Once the annotation exists in that file, then we assume we can move on and the main container will have the right environment variable populated.
@sebgl I never realized that the downward API was updating the content of the volume in (almost ?) realtime ! It's indeed more clever to rely on the operator to set the annotations 👍
I started to think again about this feature and some implementation details. I think that two important points might lead the design:
We want fine grain control regarding which node labels can be injected. There should be a separation of concerns between the operator, who must be set up with the relevant permissions to read the node labels, and the resource's owner. Said differently, I think we don't want to inject all the node's labels as annotations. I share the concerns raised here and here: allowing the operator to read node labels should not allow other users to read them without further control.
Resource's owner (only Elasticsearch for now?) should be able to explicitly control which labels are injected. If a label does not exist, the operator should prevent the Pod from starting. The reason is that having a default value, like an empty string, it might have some consequences on the data topology.
Implementation idea:
--expose-node-labels
) . It would accept a list of node labels, as regular expressions, to be allowed to be injected as annotations into the Pods.topology.kubernetes.io/*,failure-domain.beta.kubernetes.io/*
(even if the latest is deprecated it might still be used on some clusters)apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
annotations:
eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone"
spec:
...
env:
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
The feature would be only available for Elasticsearch.
I would create a dedicated init container, for separation of concerns and because it might help the feature to be reused for other resources if we do not agree with the previous point.
What if a label cannot be injected ?
No default value would be set for this flag, feature is disabled by default.
What is your reason for disabling by default? I can think of a few like: making it an explicit choice to share node information (even if it is only a topology hint) on the Pod level and maybe consistency with current behaviour (so that users that upgrade the operator don't suddenly get node metadata added on the Pod level).
If we then opt to have a default, as you suggest topology.kubernetes.io/*,failure-domain.beta.kubernetes.io/*
through the bundled configuration we ship, then that argument gets weaker? Maybe I am missing another reason? Because it gets awkward to turn it off with ""
?
The argument in favour of enabling it by default restricted to topology.kubernetes.io/*...
is that it 1. makes feature adoption easier (no need to reconfigure the operator, which might be the responsibility of a different team/admin) 2. it thereby encourages more resilient Elasticsearch cluster architectures with less chance of data loss? 3. topology.kubernetes.io/*
should be generally OK to share on the Pod level. 4. users can opt-out via the operator flag if so desired.
I would create a dedicated init container, for separation of concerns and because it might help the feature to be reused for other resources if we do not agree with the previous point.
I generally agree. My only worry is that we are adding more and more init containers. We had the init filesystem container since the very early days. I recently added one to suspend Elasticsearch if the user needs to run debug or disaster recovery tooling. This would be the third one.
What if a label cannot be injected ?
+1 on not using ""
as the default but to stop reconciling the cluster instead. We might even be able to catch the user error of specifying non-existent labels in the validating webhook.
Because it gets awkward to turn it off with ""?
Yes. If the operator is bundled with this flag enabled by default (through the eck.yaml
configuration file in the "all-in-one", in the OperatoHub and in our Helm charts) then I think we have the same result, but it makes it explicit which labels are exposed.
My only worry is that we are adding more and more init containers.
Agreed, I can start working on a first implementation with a dedicated container and take the time to discuss this point later.
We might even be able to catch the user error of specifying non-existent labels in the validating webhook.
👍
Labels are injected as annotations on Pods using an annotation on the Elasticsearch resource itself. The annotation contains a comma separated list of the expected labels (for example
eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone"
)
A drawback is that the entire cluster must be restarted when this annotation is changed. Said differently, if the list of expected node labels is updated by the user on the Elasticsearch resource, then the operator does not know which specific Pods should be restarted. I think it is an acceptable tradeoff if we consider that the list of the expected annotations is unlikely to change frequently, and also that they are likely to be used by most of the node sets.
We might even be able to catch the user error of specifying non-existent labels in the validating webhook.
A challenge here is that node labels may not be consistent across the K8S nodes. A Pod may be eventually scheduled on a K8S node which does not have all the expected labels. We might still want to validate that all the expected labels exist on at least one (same) node, but I'm a bit reluctant to use the K8S client in a webhook for now because of #5025.
I think #5054 is an "in-between" between doing nothing (people need to statically configure 3 StatefulSets with the right k8s + ES settings), and going all the way to making things simpler to the user.
Just to make it clear, here are 3 examples of different potential approaches to deploy a 3-nodes cluster into 3 different zones:
1) Current situation, users have to configure everything "manually":
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.5.0
nodeSets:
- name: hot-nodes-zone-a
count: 1
config:
node.roles: ["data_hot"]
node.attr.zone: europe-west3-a
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west3-a
- name: hot-nodes-zone-b
count: 1
config:
node.attr.zone: europe-west3-b
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west3-b
- name: hot-nodes-zone-c
count: 1
config:
node.attr.zone: europe-west3-c
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west3-c
2) With the proposal in https://github.com/elastic/cloud-on-k8s/pull/5054, we can rely on Kubernetes TopologySpreadConstraints feature to spread all Pods of the same StatefulSet across all available zones. Major benefit is that we have a single NodeSet in the spec. Important: K8s TopologySpreadConstraints does not allow users to have their hot nodes in 2 zones if their k8s cluster has 3 zones (or have their hot nodes in 3 zones if their k8s cluster has 4 zones).
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
annotations:
eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone" # allows specifying which node label should be used to determine the availability zone name
spec:
version: 7.5.0
nodeSets:
- name: hot-nodes
count: 3
config:
node.roles: ["data_hot"]
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
# use the $ZONE env var that contains the zone name
node.attr.zone: $ZONE
podTemplate:
spec:
# K8s will spread Pods across all available zones through the "topology.kubernetes.io/zone" node label
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-sample
elasticsearch.k8s.elastic.co/statefulset-name: elasticsearch-sample-es-default
containers:
- name: elasticsearch
env:
# specify that the annotation that was added on the Pod by the operator due to the `eck.k8s.elastic.co/downward-node-labels` annotation feature should be automatically transformed to an env var
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
3) Same technical approach as 2), but more opinionated approach that goes further into simplifying what the users need to do and understand (no es config + env var indirection to handle):
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.5.0
nodeSets:
- name: hot-nodes
count: 3
zoneAwareness:
spreadAcrossAvailableZones: true # defaults to false
# the 2 next fields are optional (this is their default value) to allow finer-grained control
nodeLabel: topology.kubernetes.io/zone
elasticsearchAttribute: zone
config:
node.roles: ["data_hot"]
What's specified in the spec in option 3) could technically also be specified through annotations, but that would become inconvenient if we want per-nodeset settings (we could also decide to let the user specify the PodTopologySpreadConstraints part at the nodeSet level, while we take care of the ES config if enabled through a cluster-wide annotation).
A few questions/comments regarding the third proposal:
region
/zone
/room
/rack
), the proposal seems to only handle one.topologySpreadConstraints
parameters like maxSkew
or whenUnsatisfiable
?My feeling is that the last proposal is maybe a bit too much opinionated, while some users might expect/need transparency and flexibility. Proposal 2, for which the PR is ready, is not incompatible with 3. We might just want to implement this feature later. Also the current PR, if merged, would not represent a very big burden to maintain IMHO.
My feeling is that the last proposal it is maybe a bit too much opinionated, while some users might expect/need transparency and flexibility. Proposal 2, for which the PR is ready, is not incompatible with 3. We might just want to implement this feature later.
++ I'm fine with moving forward with option 2 now, it looks like it can be used as a building block to maybe propose option 3 later.
Looking at this now @sebgl (sorry, had to catch up on the entire thread :D ) my hunch would actually be to go for opt.3
Reason being is that I think it makes for much much simpler UX/DX, i.e, I don't have to read much manuals before I understand what it does and how... I.e - it's just simple to use.
I'd actually say a better way of going about this would be to be opinionated in the implementation, get feedback, then relax and further allow controls to be more granular, if needed.
Reopening as we might also want to implement the last proposal described in https://github.com/elastic/cloud-on-k8s/issues/3933#issuecomment-979340774
Edit: I think I'll open a dedicated issue
Important: K8s TopologySpreadConstraints does not allow users to have their hot nodes in 2 zones if their k8s cluster has 3 zones (or have their hot nodes in 3 zones if their k8s cluster has 4 zones).
You can actually achieve this by mixing topologySpreadConstraints
and nodeAffinity
:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
annotations:
eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone" # allows specifying which node label should be used to determine the availability zone name
spec:
version: 7.16.0
nodeSets:
- name: default
count: 6
config:
node.store.allow_mmap: false
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
node.attr.zone: $ZONE
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.gke.io/zone
operator: In
values: # restrict deployment to these 2 zones
- europe-west1-b
- europe-west1-c
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: quickstart
elasticsearch.k8s.elastic.co/statefulset-name: quickstart-es-default
containers:
- name: elasticsearch
env:
# specify that the annotation that was added on the Pod by the operator due to the `eck.k8s.elastic.co/downward-node-labels` annotation feature should be automatically transformed to an env var
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
Edit: here is the result on 3 nodes K8S cluster:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME LABELS
gke-michael-dev-2-default-pool-0dc9aa68-drvb Ready <none> 3d8h v1.21.6-gke.1500 Container-Optimized OS from Google 5.4.144+ topology.kubernetes.io/zone=europe-west1-b
gke-michael-dev-2-default-pool-14c3d8f0-v741 Ready <none> 3d8h v1.21.6-gke.1500 Container-Optimized OS from Google 5.4.144+ topology.kubernetes.io/zone=europe-west1-d
gke-michael-dev-2-default-pool-95fa1e08-mzds Ready <none> 3d8h v1.21.6-gke.1500 Container-Optimized OS from Google 5.4.144+ topology.kubernetes.io/zone=europe-west1-c
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
quickstart-es-default-0 1/1 Running 0 2m26s 10.45.112.26 gke-michael-dev-2-default-pool-95fa1e08-mzds <none> <none>
quickstart-es-default-1 1/1 Running 0 2m26s 10.45.114.38 gke-michael-dev-2-default-pool-0dc9aa68-drvb <none> <none>
quickstart-es-default-2 1/1 Running 0 2m26s 10.45.112.27 gke-michael-dev-2-default-pool-95fa1e08-mzds <none> <none>
quickstart-es-default-3 1/1 Running 0 2m26s 10.45.114.40 gke-michael-dev-2-default-pool-0dc9aa68-drvb <none> <none>
quickstart-es-default-4 1/1 Running 0 2m26s 10.45.112.28 gke-michael-dev-2-default-pool-95fa1e08-mzds <none> <none>
quickstart-es-default-5 1/1 Running 0 2m26s 10.45.114.39 gke-michael-dev-2-default-pool-0dc9aa68-drvb <none> <none>
Only the 2 nodes in the desired zones are used.
You can actually achieve this by mixing topologySpreadConstraints and nodeAffinity
You're right. But then the user has to "statically" pick the zones, so we loose the benefit of letting the scheduler handle that.
And I was about to say: at which point the user might as well go back to defining one nodeSet per zone (also fully statically defined); but I think there's still some benefit in using a single nodeSet (less StatefulSet resources to "understand" and name, less plumbing in the manifest) 👍
You're right. But then the user has to "statically" pick the zones, so we loose the benefit of letting the scheduler handle that.
Agreed, just wanted to underline that topologySpreadConstraints
has been designed to be compatible with nodeAffinity
Proposal
At the moment the only way to configure nodes with different attributes for zone awareness is to create different nodeSets, knowing in advance the specific values for the attributes. So it's purely a static configuration, as we suggest in this document: https://www.elastic.co/guide/en/cloud-on-k8s/1.2/k8s-advanced-node-scheduling.html#k8s-availability-zone-awareness
It would be great to have, within the same nodeSet, the possibility of having certain parameters retrieved from variables or pod metadata. For example to associate a
node.attr
setting to an environment variable available in the POD.If that we already have (configuring settings based on pod's environment variables), then we would need to find a way for a running POD to know the availability zone where it's running :)
For that, apparently there's a feature request on Kubernetes side (https://github.com/kubernetes/kubernetes/issues/62078) to allow node labels to be passed to the pods. If that's implemented then probably this would be easier.
Use case. Why is this important?
Some kubernetes environments are based on Autoscaling groups on multiple AZs with dynamic zones allocation. In such cases it would be great to provide the requested mechanisms.
The
Elasticsearch
manifest of a 3 node cluster where each node has to run in a different AZ doesn't look very nice when we need to create 3 different nodeSets with1 replica each
(same settings in each nodeSet except of the specific attribute for the AZ). With a single nodeSet (replica 3) + affinity rules the same goal could be achieved from scheduling point of view, but in such case there's no way for the 3 instances of the nodes to know where (in what AZ) they are running. If we are able to implement the requested mechanism then we could easily build this type of configuration with a singlenodeSet
, based on some extra dynamic parameters.