Better Documentation for Performing Manual Rolling Restarts (of underlying hosts) of Persistent Clusters

BenB196 commented 2 years ago

Proposal

Currently if you manage an ECK cluster, and this cluster has persistent volumes, when you need to do a rolling restart on the underlying host nodes, the restart process is not straight forward and not well documented.

There are several major gaps in the docs that can result in a user having a bad time if they are skipped.

To best explain the issue, I'll write out the steps actually required to do something like this.

Assumptions:

Since we're doing a rolling restart, we'll base most of the stuff off the Elasticsearch rolling upgrades guide
I'm excluding modifying the pod distro budget changes that would be required if you have multiple ES node types on the same node.
I'm assuming a person will upgrade 1 underlying host at a time (but in theory these steps apply to 1+ hosts at a time as well)

Steps: Note: These steps should be done for every host restart

Disable shared allocation

This is just part of the rolling upgrades docs

PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}

Exclude Elasticsearch cluster from ECK Operator management

This part is important for step 3

ANNOTATION='eck.k8s.elastic.co/managed=false'
kubectl annotate --overwrite elasticsearch quickstart $ANNOTATION

Remove transient cluster setting "transient.cluster.routing.allocation.exclude._name" : "none_excluded"
- This part is very important. If you don't do this, then setting "persistent.cluster.routing.allocation.enable": "primaries" does not work as intended as shards will reallocate.
- Step 2 needs to be done before this, or else the ECK operator will just add this setting back
- When happens if you leave this setting in?
  - If you leave this setting in, then shards will still attempt to reallocate themselves to other nodes. Depending on the size of your cluster (and even worse if you're using searchable snapshots), this could mean an immense amount of data trying to recover itself to other nodes.
```
PUT _cluster/settings
{
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : null,
"exclude" : {
"_name" : null
}
}
}
}
}
}
```
You can follow the (Optional) steps in the Elasticsearch rolling upgrade guide
Drain your host node
- Standard Kubernetes host node practice for restarting
```
kubectl drain <node_id> --delete-emptydir-data --ignore-daemonsets
```
Restart host node
Wait for host node to come back online
Uncordon node
- Standard for Kubernetes
```
kubectl uncordon <node_id>
```
Wait for Elasticsearch nodes to recover
Remove "persistent.cluster.routing.allocation.enable": "primaries"
- Standard step for Elasticsearch rolling restart
```
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": null
}
}
```
Reinclude Elasticsearch cluster into ECK operator management
- What happens if you don't do this?
  - From what I've seen, if this isn't done between host restarts, the cluster can get into a weird state where nodes can no longer talk to the cluster and other things. I've found that re-enabling cluster management between restarts resolves this issue.
```
RM_ANNOTATION='eck.k8s.elastic.co/managed-'
kubectl annotate elasticsearch quickstart $RM_ANNOTATION
```
Wait for cluster to recover.
Repeat steps per host node.

sebgl commented 2 years ago

@BenB196 thanks a lot for reporting this and providing very detailed instructions. We need to think about how we can make this whole process much simpler.

mdf-ido commented 2 years ago

I have PVs and had a situation were the volumes were not able to be re-attached. The error message from describing the pod was "Multi-attach error for volume "pvc-####-####" Volume is already exclusively attached to one node". Would having an NFS Share be a workaround? I know that having local disk is recommended but wanted to ask for the possibility

BenB196 commented 2 years ago

@mdf-ido how are you're PVs provisioned? If you're using something like Longhorn, look into https://longhorn.io/kb/troubleshooting-volume-with-multipath/ I use something similar for Dev/Stage clusters for easier maintenance, and have run into the issue with mutlipath in the past locking volumes.

Edit:

Btw, I don't think that issue is directly related to ECK, it sounds more like a Kubernetes/Host issue, than an ECK one.

mdf-ido commented 2 years ago

Hi Ben! Thanks for the quick reply I am using AKS and the PVs are provisioned dynamically with the Azure Built-in storage classes.

bmoe24x commented 1 year ago

@BenB196 @sebgl Anybody know if there were ever improvements made to this process? We are running into a similar issue that will likely put an end to the possibility of us upgrading to an Enterprise license.

We allowed the Operator to perform a rolling restart following our upgrade of the Operator to version 2.6.1 - for a roughly 80 data node deployment with around 1.5 TB of disk usage per node the restart took over 40 hours and significantly impacted user latencies.

This isn't something that causes issue in smaller volume environments or clusters. In clusters that have non-negligible data size we had terabytes of I/O from shards (both primary and replica) being moved all over the cluster. Ideally the cluster simply should promote a replica to primary and wait for the restarted node to come back online as the data is still available on it's PVC. We tried to manually set both the persistent and transient allocation/rebalance setting but the Operator overrides it immediately with a transient allocation setting of 'all'.

Would love to get more information about how to improve this process.

hartfordfive commented 1 year ago

Any updates on finalizing this documentation?

elastic / cloud-on-k8s

Better Documentation for Performing Manual Rolling Restarts (of underlying hosts) of Persistent Clusters #5305

Proposal