Make shutdown API allocation_delay configurable and review default value

tommasopozzetti commented 2 years ago

Bug Report

After upgrading the operator to version 2+, we started noticing rolling upgrade procedures being much slower. Upon looking more into it, the utilization of the shutdown API instead of the legacy transient settings was causing Elasticsearch to start reallocation of indices before the restarted nodes were rejoining the cluster, thus taking a very long time to heal after each node restart.

This seems to be caused by the fact that the operator now relies on a timeout rather than waiting for the pod to be back up and manually restart allocations and therefore, if that timeout expires before the node rejoins the cluster, the shards start getting reallocated.

I believe all this is the expected behavior of the shutdown API, however, I think there might be a misalignment in terms of the default timeout. The shutdown API allows to specify a custom timeout (allocation_delay) when performing the shutdown call, but that doesn't seem to get used by the operator, which instead relies on the default allocation_delay of the indices as seen by the comment here. The comment seems to suggest the operator should rely on the default timeout of 5m, however, Elasticsearch documentation reports the default timeout to be 1m (see here.

1 minute definitely seems to be a very short timeout in Kubernetes where if anything in the lifecycle of the pod (termination, unmounting of volumes, scheduling of the new pod, image pulling, mounting of volumes, initcontainers....) takes a little too long to execute, there is a high risk of the node not rejoining the cluster in time (which is what has happened to us for multiple nodes).

To further confirm this is the issue, I have changed every index in the cluster to have a higher delayed_timeout (10m) and ran another upgrade and this time everything completed successfully without any delays. This solution, however, is very tedious because delayed_timeout is a per index setting (so it needs to be applied to every index and needs to be configured on every index template to ensure new indices inherit it). Furthermore, this setting is meant more for unplanned issues where possibly the 1m default response time makes sense, while it might make sense to have a higher one for planned maintenances if it is known that the maintenance could take longer.

My proposal is to set a custom allocation_delay in the shutdown call to override the default one and set a sane default of at least 5m in that call. Furthermore, possibly make that value configurable on the operator level so that, for clusters with particular needs that might expect an even higher timeout to be needed, it is possible to override it and set a proper timeout that matches the needs of that cluster.

pebrc commented 2 years ago

The default value for allocation_delay is still 5 minutes. I realise that the documentation I linked to in the comment does not (no longer?) mention that but the code is here:

https://github.com/elastic/elasticsearch/blob/51c1f27b94d2e8bb55a76b07027a535267f895d6/server/src/main/java/org/elasticsearch/cluster/metadata/SingleNodeShutdownMetadata.java#L78

I do however understand that there might be situations where a fixed 5 minute value is not enough. When you say you would like this to be configurable on the operator level then am I right in assuming you would like to have per cluster overrides for this? We could support an annotation per cluster or even an explicit field in the CRD to set this per cluster

tommasopozzetti commented 2 years ago

Hi @pebrc, thank you for taking a look at this.

This is interesting, because I am fairly sure none of my nodes took longer than 5 minutes to rejoin the cluster, but more in the 60 to 90 seconds range and yet shards were starting to be reallocated. I can run a further test about this and confirm if I'm seeing this behavior or not. I wonder if there might be some underlying issue in the allocation_delay behavior then and it simply ends up using the per-index delay setting which when I bumped seemed to fix the problem.

Either way, yes I agree that configuring this value per Elasticsearch cluster seems to be the most reasonable solution and I think either via annotation or CRD spec would work perfectly for this purpose.

pebrc commented 2 years ago

There is an open bug in Elasticsearch where shards in the cold tier start relocating even though the allocation_delay timeout has not yet expired. I would also be good to understand whether you are affected by this issue https://github.com/elastic/elasticsearch/issues/85052

tommasopozzetti commented 2 years ago

If I understand correctly, that issue should affect searchable snapshots in the cold tier. I do have some indices in the cold phase of ILM, but we do not have any separated data tiers (all data nodes can allocate all data tiers) and we don't use searchable snapshots so unless the issue is more extended than it seems I don't think it is the same one

tommasopozzetti commented 2 years ago

I just ran the same test on a brand new cluster with no ILM configured and just a handful of indices.

I edited a node config in the Elasticsearch CRD forcing the operator to run an upgrade procedure on the data nodes (let me note here that the data nodes are pinned to specific physical nodes due to the use of local storage). As soon as the operator has terminated the first pod, I cordoned the node so that the pod remained in pending status since it was unschedulable. In the meantime, I was monitoring the Elasticsearch cluster status. As soon as it went to yellow due to the node leaving, I started a timer. Exactly 1 minute after starting the timer, all shards that were delayed got reallocated to the other nodes.

Edit: I should note I am running ES version 7.16.3 and the operator version 2.1.0

pebrc commented 2 years ago

I think this is an Elasticsearch bug. I was able to reproduce your issue and have raised https://github.com/elastic/elasticsearch/issues/85922

tommasopozzetti commented 2 years ago

@pebrc Thank you for reproducing and confirming!

Looks like a combination of fixing that bug on the Elasticsearch side and possibly adding a configurable in the operator to override the default should address this issue!

pebrc commented 2 years ago

I have created another issue because it seems that the shorter re-allocation delay is caused by the operator deleting the node shutdown records too early and Elasticsearch falling back to the default 1 minute timeout because of that. https://github.com/elastic/cloud-on-k8s/issues/5617 (a separate issue so we can fix that with highest priority and include in a patch release if necessary)

The relevant discussion starts here https://github.com/elastic/elasticsearch/issues/85922#issuecomment-1108559732

pebrc commented 2 years ago

We released version 2.2 or the ECK operator which contains a fix for #5617 I wonder @tommasopozzetti whether you were able to validate if this is sufficient for your use case or whether the allocation delay still needs to be made configurable as discussed in this issue.

tommasopozzetti commented 2 years ago

Unfortunately at the moment I cannot easily test the new version, but if the timeout is now 5 minutes I am fairly sure that will solve our issues.

I still think there might be value in making it configurable for special scenarios in which the nodes might take longer to rejoin, but I think most use cases will now be covered by the 5 minute delay.

Thank you for fixing this!

elastic / cloud-on-k8s

Make shutdown API allocation_delay configurable and review default value #5582

Bug Report