elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
60 stars 708 forks source link

Non-Graceful Cluster Rollout on `Version` Change #7979

Closed SiorMeir closed 3 months ago

SiorMeir commented 3 months ago

Non-Graceful Cluster Rollout on Version Change on ECK 2.10

What did you do? Upgraded the version of Elasticsearch (& Kibana) from 8.12.2 to 8.14.1

What did you expect to see? Graceful rollout of cluster, where nodes are replaced one at a time.

What did you see instead? Under which circumstances? All nodes went down at the same time, Operator changed status to UNKNOWN, resulting in downtime.

However, Manually changing the image of the different node types, adding environment variables or settings caused a graceful rollout of the cluster.

Environment

pebrc commented 3 months ago

How many pods with master role are in your Elasticsearch cluster? Less than 3? If so your Elasticsearch cluster is not HA and will lose quorum anyway during an upgrade. ECK restarts all nodes at once for non-HA Elasticsearch clusters as there is no point in orchestrating this differently.

Please reopen if your cluster is actually larger than 3 master nodes or you think you have found a bug.

[Updated to be more precise about the role]

SiorMeir commented 3 months ago

@pebrc The cluster is comprised of around 30 nodes, so more than 3 :)

pebrc commented 3 months ago

@SiorMeir How many of those 30 are master nodes?

SiorMeir commented 3 months ago

@pebrc 3 nodes are designated as master nodes

pebrc commented 3 months ago

Were the three master nodes up and running at the point in time you were running the upgrade?

Can you maybe share your Elasticsearch YAML manifest, so we can better understand your setup/architecture?

SiorMeir commented 3 months ago

@pebrc Update: we've increased the no. of master nodes and saw graceful rollout, so it seems that there is no actual issue.

However we do have a smaller monitoring cluster that has 2 nodes, both designated as master, that rolled out one at a time. There is no quorum there so we can't figure out the difference.

In addition, we saw a difference in the Manager attribute between clusters: some had node-fetch as value while others (ones that rolled out correctly) had elastic-operator. This is not an attribute we changed in our configs.

Moreover - this is the second upgrade we're performing. we haven't noticed the need for quorom in the previous upgrade (version 8.9.0 to 8.12.2). Was this behavior added in a recent version of es? of the operator?

I'm attaching the attributes of the monitoring cluster for your viewing

Namespace:    elasticsearch-recs
Labels:       app=elasticsearch-recs-monitoring
              app.kubernetes.io/managed-by=Helm
              assetuuid=a5d8e2d5-1596-48f7-9028-697490f1e53a
              base-chart-version=4.0.9
              commit=29059cae049a52f22a831ffcff126d797fbab613
              environment=staging
              k8slens-edit-resource-version=v1
              release=elasticsearch-recs-monitoring
Annotations:  eck.k8s.elastic.co/downward-node-labels: topology.kubernetes.io/zone
              eck.k8s.elastic.co/managed: true
              eck.k8s.elastic.co/orchestration-hints:
                {"no_transient_settings":true,"service_accounts":true,"desired_nodes":{"version":4,"hash":"1305681621"}}
              elasticsearch.k8s.elastic.co/cluster-uuid: ZAd4inqKS8uq6WbUydAALg
              meta.helm.sh/release-name: elasticsearch-recs-monitoring
              meta.helm.sh/release-namespace: elasticsearch-recs
API Version:  elasticsearch.k8s.elastic.co/v1
Kind:         Elasticsearch
Metadata:
  Creation Timestamp:  2024-07-01T12:49:40Z
  Generation:          6
  Resource Version:    1888763548
  UID:                 3f834e8e-e97b-41ac-a526-bc845a25ec51
Spec:
  Auth:
  Http:
    Service:
      Metadata:
      Spec:
    Tls:
      Certificate:
      Self Signed Certificate:
        Disabled:  true
  Monitoring:
    Logs:
    Metrics:
  Node Sets:
    Config:
      node.roles:
        master
        data
        ingest
      xpack.security.authc:
        Anonymous:
          authz_exception:  false
          Roles:            superuser
          Username:         anonymous
    Count:                  2
    Name:                   monitoring
    Pod Template:
      Metadata:
        Creation Timestamp:  <nil>
        Labels:
          App:      elasticsearch-recs-monitoring
          Release:  elasticsearch-recs-monitoring
      Spec:
        Containers:
          Image:              https://hub.docker.com/_/elasticsearch:8.14.3
          Image Pull Policy:  IfNotPresent
          Name:               elasticsearch
          Resources:
            Limits:
              Cpu:     2
              Memory:  6Gi
            Requests:
              Cpu:     2
              Memory:  6Gi
        Dns Config:
          Options:
            Name:              ndots
            Value:             1
        Service Account Name:  default
    Volume Claim Templates:
      Metadata:
        Annotations:
          k8s-pvc-tagger/tags:  {"Environment": "staging", "Assetuuid": "a5d8e2d5-1596-48f7-9028-697490f1e53a", "karpenter.sh/nodepool": "elasticsearch"}
        Labels:
          App:          elasticsearch-recs-monitoring
          Assetuuid:    a5d8e2d5-1596-48f7-9028-697490f1e53a
          Environment:  staging
          Release:      elasticsearch-recs-monitoring
        Name:           elasticsearch-data
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         100Gi
        Storage Class Name:  gp3
        Volume Mode:         Filesystem
  Pod Disruption Budget:
    Metadata:
    Spec:
  Transport:
    Service:
      Metadata:
      Spec:
    Tls:
      Certificate:
      Certificate Authorities:
  Update Strategy:
    Change Budget:
  Version:  8.14.3
Status:
  Available Nodes:  2
  Conditions:
    Last Transition Time:  2024-08-01T05:27:02Z
    Status:                True
    Type:                  ReconciliationComplete
    Last Transition Time:  2024-07-29T13:21:59Z
    Message:               All nodes are running version 8.14.3
    Status:                True
    Type:                  RunningDesiredVersion
    Last Transition Time:  2024-08-01T05:27:02Z
    Message:               Service elasticsearch-recs/elasticsearch-recs-monitoring-es-internal-http has endpoints
    Status:                True
    Type:                  ElasticsearchIsReachable
    Last Transition Time:  2024-07-29T13:21:05Z
    Message:               Successfully calculated compute and storage resources from Elasticsearch resource generation 6
    Status:                True
    Type:                  ResourcesAwareManagement
  Health:                  green
  In Progress Operations:
    Downscale:
      Last Updated Time:  2024-07-01T12:49:42Z
    Upgrade:
      Last Updated Time:  2024-07-29T13:22:40Z
    Upscale:
      Last Updated Time:  2024-07-01T12:49:42Z
  Observed Generation:    6
  Phase:                  Ready
  Version:                8.14.3
Events:                   <none>

Appreciate the help!

pebrc commented 3 months ago

So just to clarify: ECK does a full cluster restart if the update being applied is a version upgrade and the cluster is not HA (less than 3 master nodes). Here is the relevant code

https://github.com/elastic/cloud-on-k8s/blob/d149f2352acacb2e2b33b978b63f0f5785f87121/pkg/controller/elasticsearch/driver/upgrade.go#L107-L114

Any other non-version upgrade (e.g. changes to environment variables and the like) will be applied in a rolling fashion.

Moreover - this is the second upgrade we're performing. we haven't noticed the need for quorom in the previous upgrade (version 8.9.0 to 8.12.2). Was this behavior added in a recent version of es? of the operator?

This logic has been in place since ECK 2.1.0 (March 2022) https://github.com/elastic/cloud-on-k8s/pull/5408 has more details on the motivation behind it.

In addition, we saw a difference in the Manager attribute between clusters: some had node-fetch as value while others (ones that rolled out correctly) had elastic-operator. This is not an attribute we changed in our configs.

I am not sure I understand what you are referring to here.

Given that we haven't found any evidence of a bug so far. I am closing this issue again as the behaviour you are we seeing is as expected.