Provide clarifications on voting configuration changes timeouts

Description

There are a lot of docs about automatic cluster changes, but they all go by this:

After a node has joined or left the cluster the elected master must issue a cluster-state update that adjusts the voting configuration to match, and this can take a short time to complete. It is important to wait for this adjustment to complete before removing more nodes from the cluster.

Docs provide basically no clarification on how the end user should understand when it is safe to proceed. It also can't be assumed that it is enough just to wait for some kind of timeout, for example, as in the naive scenario end user removes a node and expects it to be pulled out of the configuration automatically - but if there is any trouble with the master election exactly at that moment, or just a tight GC loop on masters because of the memory configuration, or another kind of disaster, then the actual removal of the node from cluster will be delayed to the moment cluster has reformed again (and pulling out a node is not necessarily a thing happening in a healthy environment, it may be a part of disaster recovery - if you need a precise example, imagine that ES was deployed on VMs that became unhealthy by themselves, and the end user needs to recycle all the masters one by one to spin up fresh unaffected VMs). I assume that it's possible to watch the cluster configuration, but that requires some toolset to do it in the automated way and some insight to do it manually, and both ways are also opaque from just looking at the corresponding documentation.

So this is a request for more thorough explanation of the processes under the hood and clarification on how user can detect that the automatic cluster state change has kicked in and safely proceed.

elastic / elasticsearch

Provide clarifications on voting configuration changes timeouts #99230

Description