elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.56k stars 695 forks source link

Better Documentation for Performing Manual Rolling Restarts (of underlying hosts) of Persistent Clusters #5305

Open BenB196 opened 2 years ago

BenB196 commented 2 years ago

Proposal

Currently if you manage an ECK cluster, and this cluster has persistent volumes, when you need to do a rolling restart on the underlying host nodes, the restart process is not straight forward and not well documented.

There are several major gaps in the docs that can result in a user having a bad time if they are skipped.

To best explain the issue, I'll write out the steps actually required to do something like this.

Assumptions:

Steps: Note: These steps should be done for every host restart

  1. Disable shared allocation

    • This is just part of the rolling upgrades docs
      PUT _cluster/settings
      {
      "persistent": {
      "cluster.routing.allocation.enable": "primaries"
      }
      }
  2. Exclude Elasticsearch cluster from ECK Operator management

    • This part is important for step 3
      ANNOTATION='eck.k8s.elastic.co/managed=false'
      kubectl annotate --overwrite elasticsearch quickstart $ANNOTATION
  3. Remove transient cluster setting "transient.cluster.routing.allocation.exclude._name" : "none_excluded"

    • This part is very important. If you don't do this, then setting "persistent.cluster.routing.allocation.enable": "primaries" does not work as intended as shards will reallocate.
    • Step 2 needs to be done before this, or else the ECK operator will just add this setting back
    • When happens if you leave this setting in?
      • If you leave this setting in, then shards will still attempt to reallocate themselves to other nodes. Depending on the size of your cluster (and even worse if you're using searchable snapshots), this could mean an immense amount of data trying to recover itself to other nodes.
        PUT _cluster/settings
        {
        "transient" : {
        "cluster" : {
        "routing" : {
        "allocation" : {
        "enable" : null,
        "exclude" : {
        "_name" : null
        }
        }
        }
        }
        }
        }
  4. You can follow the (Optional) steps in the Elasticsearch rolling upgrade guide

  5. Drain your host node

    • Standard Kubernetes host node practice for restarting
      kubectl drain <node_id> --delete-emptydir-data --ignore-daemonsets
  6. Restart host node

  7. Wait for host node to come back online

  8. Uncordon node

    • Standard for Kubernetes
      kubectl uncordon <node_id>
  9. Wait for Elasticsearch nodes to recover

  10. Remove "persistent.cluster.routing.allocation.enable": "primaries"

    • Standard step for Elasticsearch rolling restart
      PUT _cluster/settings
      {
      "persistent": {
      "cluster.routing.allocation.enable": null
      }
      }
  11. Reinclude Elasticsearch cluster into ECK operator management

    • What happens if you don't do this?
      • From what I've seen, if this isn't done between host restarts, the cluster can get into a weird state where nodes can no longer talk to the cluster and other things. I've found that re-enabling cluster management between restarts resolves this issue.
        RM_ANNOTATION='eck.k8s.elastic.co/managed-'
        kubectl annotate elasticsearch quickstart $RM_ANNOTATION
  12. Wait for cluster to recover.

  13. Repeat steps per host node.

sebgl commented 2 years ago

@BenB196 thanks a lot for reporting this and providing very detailed instructions. We need to think about how we can make this whole process much simpler.

mdf-ido commented 2 years ago

I have PVs and had a situation were the volumes were not able to be re-attached. The error message from describing the pod was "Multi-attach error for volume "pvc-####-####" Volume is already exclusively attached to one node". Would having an NFS Share be a workaround? I know that having local disk is recommended but wanted to ask for the possibility

BenB196 commented 2 years ago

@mdf-ido how are you're PVs provisioned? If you're using something like Longhorn, look into https://longhorn.io/kb/troubleshooting-volume-with-multipath/ I use something similar for Dev/Stage clusters for easier maintenance, and have run into the issue with mutlipath in the past locking volumes.

Edit:

Btw, I don't think that issue is directly related to ECK, it sounds more like a Kubernetes/Host issue, than an ECK one.

mdf-ido commented 2 years ago

Hi Ben! Thanks for the quick reply I am using AKS and the PVs are provisioned dynamically with the Azure Built-in storage classes.

bmoe24x commented 1 year ago

@BenB196 @sebgl Anybody know if there were ever improvements made to this process? We are running into a similar issue that will likely put an end to the possibility of us upgrading to an Enterprise license.

We allowed the Operator to perform a rolling restart following our upgrade of the Operator to version 2.6.1 - for a roughly 80 data node deployment with around 1.5 TB of disk usage per node the restart took over 40 hours and significantly impacted user latencies.

This isn't something that causes issue in smaller volume environments or clusters. In clusters that have non-negligible data size we had terabytes of I/O from shards (both primary and replica) being moved all over the cluster. Ideally the cluster simply should promote a replica to primary and wait for the restarted node to come back online as the data is still available on it's PVC. We tried to manually set both the persistent and transient allocation/rebalance setting but the Operator overrides it immediately with a transient allocation setting of 'all'.

Would love to get more information about how to improve this process.

hartfordfive commented 1 year ago

Any updates on finalizing this documentation?