The cluster cannot complete the master election, resulting in the cluster being unavailable

weizijun commented 1 year ago

Elasticsearch Version

master

Installed Plugins

No response

Java Version

bundled

OS Version

all

Problem Description

When the cluster has a lot of indexes, it may be impossible to choose the master. This problem comes from the fact that the time cost of the 'selected-as-master task' plus the time cost of 'diff cluster state' exceeds the setting of cluster.election.initial_timeout (the default value is 100ms). In this way, after the node election is successful, it times out when publishing, and enters the election of the next term. This process has been repeated, and the master cannot be selected.

The default value of cluster.election.initial_timeout is 100ms. Many users are not clear about this principle. Can this default value be adjusted to 1s or 2s? This may increase the election time, but has little impact on users. May I ask if there are other side effects of increasing this parameter?

Steps to Reproduce

The reproduced cluster created 16 nodes, which are both data nodes and master nodes, created 5000 empty indexes, and mapped about 300 rows. Kill the current master, there will be a situation where the master cannot be selected.

Logs (if relevant)

scheduling scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=29, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=555, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=2, maxDelayMillis=300, delayMillis=670, ElectionScheduler{attempt=3, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
scheduling scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=10, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=631, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
scheduling scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=37, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=644, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}

DaveCTurner commented 1 year ago

I think this duplicates https://github.com/elastic/elasticsearch/issues/97909 so I am closing this. It's strange, I'm pretty sure this problem has gone unnoticed for over 4 years, and then two of us have reported it within a few days of each other.

The preferred workaround is not to have so many master-eligible nodes. See these docs for more information:

However, it is good practice to limit the number of master-eligible nodes in the cluster to three. Master nodes do not scale like other node types since the cluster always elects just one of them as the master of the cluster. If there are too many master-eligible nodes then master elections may take a longer time to complete.

weizijun commented 1 year ago

The preferred workaround is not to have so many master-eligible nodes. See these docs for more information:

three master can also reproduce the case. I use more masters to better reproduce the case.

DaveCTurner commented 1 year ago

Yes I expect this could happen even with three masters too, if the cluster state is very very large and/or there's some other performance problem making the election process unreasonably slow. We haven't seen that happen in practice even on some of our very large clusters. Marking one of the three masters as voting-only should help too, but as a last resort you can try increasing cluster.election.initial_timeout.

DaveCTurner commented 1 year ago

Yes I expect this could happen even with three masters too, if the cluster state is very very large and/or there's some other performance problem making the election process unreasonably slow.

FWIW I am struggling to make this kind of election collision happen repeatedly with just three masters. The cluster state size doesn't matter, because the newly-elected leader suppresses other election attempts with lightweight follower checks, sent as soon as it is elected. As long as the follower checks are not delayed then the cluster stabilises pretty reliably. You have to be extremely unlucky, over and over again, for this not to happen.

In fact even with 16 masters the follower checks seem to prevent this kind of problem rather reliably. It's definitely theoretically possible for it to take a long time to stabilise, but in practice this hardly ever happens. I wonder if there's something else wrong in your setup.

elastic / elasticsearch