elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.15k stars 24.84k forks source link

Strict validation for cluster.routing.allocation.balance.threshold can lead to snapshot restore failure #116558

Open ywangd opened 2 days ago

ywangd commented 2 days ago

We enhanced the validation for cluster.routing.allocation.balance.threshold in https://github.com/elastic/elasticsearch/pull/115831 so that it no longer accepts values lower than 1.0 in v9.0+. If a snapshot is taken in an old cluster where the setting has an invalid value, i.e. in the range of [0.0, 0.1), the snapshot will not be restorable in a new cluster and it generates the following exception

[2024-11-08T04:54:27,351][WARN ][o.e.s.RestoreService     ] [test-cluster-0] [repo:old_snap/ZwUuFTIlQ1-qeTsp8Cehcg] failed to restore snapshot java.lang.IllegalArgumentException: illegal value can't update [cluster.routing.allocation.balance.threshold] from [1.0] to [0.999]
    at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.settings.Setting$Updater.getValue(Setting.java:1304)
    at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.common.settings.AbstractScopedSettings.validateUpdate(AbstractScopedSettings.java:139)
    at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.snapshots.RestoreService$RestoreSnapshotStateTask.applyGlobalStateRestore(RestoreService.java:1546)
    at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.snapshots.RestoreService$RestoreSnapshotStateTask.execute(RestoreService.java:1477)
    at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.cluster.service.MasterService$UnbatchedExecutor.execute(MasterService.java:573)

We should re-consider the strict validation or make it possible for restore to ignore invalid cluster settings.

Relates: #115831 Relates: #116460

elasticsearchmachine commented 2 days ago

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner commented 18 hours ago

IMO this is the expected and desired behaviour. It'd be the same if you tried to restore a 7.x index in 9.0 (at least, absent an enterprise license which permits some amount of extra bwc). Instead, you need to restore into an 8.x cluster and fix up everything that needs fixing before upgrading.

ywangd commented 10 hours ago

Thanks David. In this case, do you think we should just close this issue? I guess this is similar to deprecate and finally remove a setting which needs to go through 18 months or 1 major version whichever is longer. The deprecation for this setting (#92100) was released in 8.7.0, Mar 30, 2023 (Jan 26, 2023 if we consider 8.6.1).This combined with the 9.0 release should be sufficient to justify the strict validation. Does this sound right?