Closed SiorMeir closed 3 months ago
How many pods with master role are in your Elasticsearch cluster? Less than 3? If so your Elasticsearch cluster is not HA and will lose quorum anyway during an upgrade. ECK restarts all nodes at once for non-HA Elasticsearch clusters as there is no point in orchestrating this differently.
Please reopen if your cluster is actually larger than 3 master nodes or you think you have found a bug.
[Updated to be more precise about the role]
@pebrc The cluster is comprised of around 30 nodes, so more than 3 :)
@SiorMeir How many of those 30 are master nodes?
@pebrc 3 nodes are designated as master nodes
Were the three master nodes up and running at the point in time you were running the upgrade?
Can you maybe share your Elasticsearch YAML manifest, so we can better understand your setup/architecture?
@pebrc Update: we've increased the no. of master nodes and saw graceful rollout, so it seems that there is no actual issue.
However we do have a smaller monitoring cluster that has 2 nodes, both designated as master, that rolled out one at a time. There is no quorum there so we can't figure out the difference.
In addition, we saw a difference in the Manager
attribute between clusters: some had node-fetch
as value while others (ones that rolled out correctly) had elastic-operator
. This is not an attribute we changed in our configs.
Moreover - this is the second upgrade we're performing. we haven't noticed the need for quorom in the previous upgrade (version 8.9.0 to 8.12.2). Was this behavior added in a recent version of es? of the operator?
I'm attaching the attributes of the monitoring cluster for your viewing
Namespace: elasticsearch-recs
Labels: app=elasticsearch-recs-monitoring
app.kubernetes.io/managed-by=Helm
assetuuid=a5d8e2d5-1596-48f7-9028-697490f1e53a
base-chart-version=4.0.9
commit=29059cae049a52f22a831ffcff126d797fbab613
environment=staging
k8slens-edit-resource-version=v1
release=elasticsearch-recs-monitoring
Annotations: eck.k8s.elastic.co/downward-node-labels: topology.kubernetes.io/zone
eck.k8s.elastic.co/managed: true
eck.k8s.elastic.co/orchestration-hints:
{"no_transient_settings":true,"service_accounts":true,"desired_nodes":{"version":4,"hash":"1305681621"}}
elasticsearch.k8s.elastic.co/cluster-uuid: ZAd4inqKS8uq6WbUydAALg
meta.helm.sh/release-name: elasticsearch-recs-monitoring
meta.helm.sh/release-namespace: elasticsearch-recs
API Version: elasticsearch.k8s.elastic.co/v1
Kind: Elasticsearch
Metadata:
Creation Timestamp: 2024-07-01T12:49:40Z
Generation: 6
Resource Version: 1888763548
UID: 3f834e8e-e97b-41ac-a526-bc845a25ec51
Spec:
Auth:
Http:
Service:
Metadata:
Spec:
Tls:
Certificate:
Self Signed Certificate:
Disabled: true
Monitoring:
Logs:
Metrics:
Node Sets:
Config:
node.roles:
master
data
ingest
xpack.security.authc:
Anonymous:
authz_exception: false
Roles: superuser
Username: anonymous
Count: 2
Name: monitoring
Pod Template:
Metadata:
Creation Timestamp: <nil>
Labels:
App: elasticsearch-recs-monitoring
Release: elasticsearch-recs-monitoring
Spec:
Containers:
Image: https://hub.docker.com/_/elasticsearch:8.14.3
Image Pull Policy: IfNotPresent
Name: elasticsearch
Resources:
Limits:
Cpu: 2
Memory: 6Gi
Requests:
Cpu: 2
Memory: 6Gi
Dns Config:
Options:
Name: ndots
Value: 1
Service Account Name: default
Volume Claim Templates:
Metadata:
Annotations:
k8s-pvc-tagger/tags: {"Environment": "staging", "Assetuuid": "a5d8e2d5-1596-48f7-9028-697490f1e53a", "karpenter.sh/nodepool": "elasticsearch"}
Labels:
App: elasticsearch-recs-monitoring
Assetuuid: a5d8e2d5-1596-48f7-9028-697490f1e53a
Environment: staging
Release: elasticsearch-recs-monitoring
Name: elasticsearch-data
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 100Gi
Storage Class Name: gp3
Volume Mode: Filesystem
Pod Disruption Budget:
Metadata:
Spec:
Transport:
Service:
Metadata:
Spec:
Tls:
Certificate:
Certificate Authorities:
Update Strategy:
Change Budget:
Version: 8.14.3
Status:
Available Nodes: 2
Conditions:
Last Transition Time: 2024-08-01T05:27:02Z
Status: True
Type: ReconciliationComplete
Last Transition Time: 2024-07-29T13:21:59Z
Message: All nodes are running version 8.14.3
Status: True
Type: RunningDesiredVersion
Last Transition Time: 2024-08-01T05:27:02Z
Message: Service elasticsearch-recs/elasticsearch-recs-monitoring-es-internal-http has endpoints
Status: True
Type: ElasticsearchIsReachable
Last Transition Time: 2024-07-29T13:21:05Z
Message: Successfully calculated compute and storage resources from Elasticsearch resource generation 6
Status: True
Type: ResourcesAwareManagement
Health: green
In Progress Operations:
Downscale:
Last Updated Time: 2024-07-01T12:49:42Z
Upgrade:
Last Updated Time: 2024-07-29T13:22:40Z
Upscale:
Last Updated Time: 2024-07-01T12:49:42Z
Observed Generation: 6
Phase: Ready
Version: 8.14.3
Events: <none>
Appreciate the help!
So just to clarify: ECK does a full cluster restart if the update being applied is a version upgrade and the cluster is not HA (less than 3 master nodes). Here is the relevant code
Any other non-version upgrade (e.g. changes to environment variables and the like) will be applied in a rolling fashion.
Moreover - this is the second upgrade we're performing. we haven't noticed the need for quorom in the previous upgrade (version 8.9.0 to 8.12.2). Was this behavior added in a recent version of es? of the operator?
This logic has been in place since ECK 2.1.0 (March 2022) https://github.com/elastic/cloud-on-k8s/pull/5408 has more details on the motivation behind it.
In addition, we saw a difference in the Manager attribute between clusters: some had node-fetch as value while others (ones that rolled out correctly) had elastic-operator. This is not an attribute we changed in our configs.
I am not sure I understand what you are referring to here.
Given that we haven't found any evidence of a bug so far. I am closing this issue again as the behaviour you are we seeing is as expected.
Non-Graceful Cluster Rollout on
Version
Change on ECK 2.10What did you do? Upgraded the version of Elasticsearch (& Kibana) from 8.12.2 to 8.14.1
What did you expect to see? Graceful rollout of cluster, where nodes are replaced one at a time.
What did you see instead? Under which circumstances? All nodes went down at the same time, Operator changed status to
UNKNOWN
, resulting in downtime.However, Manually changing the image of the different node types, adding environment variables or settings caused a graceful rollout of the cluster.
Environment
ECK version: 2.10
Kubernetes information:
Logs: