elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.28k stars 24.86k forks source link

Removing the "transient settings" feature #49540

Open ywelsch opened 4 years ago

ywelsch commented 4 years ago

This is a call for feedback. We would like to find out what the use cases for transient settings are, and whether we can address them in other ways instead, allowing us to remove the transient cluster settings feature and simplify our story around settings (and precedence order).

While Elasticsearch's configurability is very powerful, it can also lead to confusion and sometimes misconfigurations. Typical configuration options include:

We would like to simplify the story on how to configure Elasticsearch. The proposal here is to remove transient cluster settings. These are rarely used, often cause confusion when used in combination with persistent cluster settings, and make our APIs unnecessarily complex.

Transient cluster settings

Please raise any scenarios that you feel can currently only be solved by transient settings, so that we can look into alternative ways to address them.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-infra (:Core/Infra/Settings)

pugnascotia commented 4 years ago

I've used transient settings when fiddling with a Cloud cluster e.g. log settings, or recovery settings. But I always put those settings back they way they were anyway.

I'm reminded as well that we've made it possible to have a rolling upgrade between major versions, so it's plausible that the only time a user has a full cluster restart is when they've run out of alternatives to fix an issue, or disaster has struck and brought down the cluster, and both scenarios seem like a really bad time for Elasticearch to suddenly reconfigure itself.

I vote for removal.

eedugon commented 4 years ago

As a note, if we decide to get rid of transient level we need to be aware and warn about settings that might cause problems during a full cluster restart, for example: "cluster.routing.allocation.enable": "none".

It's true that in our docs for rolling restarts we suggest the value primaries which is totally safe, but in the past it was very typical to set it to none during rolling upgrades, and in such case it was better (in my view) to set it at transient level just in case a full cluster restart was needed due to an unexpected issue during the upgrade or rolling restart.

A full restart with the allocation enable set to none at persistent level will be tricky as .security won't be allocated for example and we loss access to the cluster via native realm even for the fix itself.

The other added benefit is the visibility of the value to fallback as soon as the transient setting is removed, but I don't think this benefit worths the entire feature. If we remove it we will just fall back from cluster setting to the default (or value from elasticsearch.yml which is available via the API too).

I agree with @pugnascotia in the way that removing the feature will improve predictability a lot, as a potential unexpected loss of transient settings during any given activity will produce a lot of uncertainty in terms of results.

dakrone commented 4 years ago

One thing that transient settings are useful for is having the settings not restored from a snapshot. For example, if a user sets transient allocation settings, when restoring a snapshot with include_global_state those settings are not restored (as opposed to persistent settings where a user may need to explicitly unset settings when restoring or else shards may not be allocated).

I do think the benefits of removing transient settings outweigh keeping them.

gwbrown commented 3 years ago

I'm labeling this team-discuss to finalize whether and when we should do this.

This is a breaking change, so we would need to do it in a major version. As the next major version is on the horizon, if we're going to do this, we should probably do it sooner rather than later. Otherwise, this will get put off until the next-next major, substantially further down the road.

colings86 commented 3 years ago

We discussed this in the FixIt meeting and decide that we would like to remove transient settings as we cannot determine a use case where we would recommend using them over persistent settings, particular since they can act in unexpected ways for users

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-core-infra (Team:Core/Infra)