elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
856 stars 24.8k forks source link

Start the clock on `cluster.publish.timeout` when cluster state is committed #80400

Open DaveCTurner opened 2 years ago

DaveCTurner commented 2 years ago

Today cluster.publish.timeout starts counting in the call to Coordinator#publish right at the very start of the publication, so it includes the time spent sending the cluster state out to all the nodes and waiting for them to write it to disk and respond. If the timeout elapses before the cluster state is committed then the master node considers itself to have failed and restarts the discovery process. The thinking here is that if it takes more than 30s to even commit the cluster state then the master is unhealthy and needs to be replaced. However I don't think this holds up in practice:

I'm proposing changing this behaviour to wait indefinitely until the publication is committed (or definitely won't commit) and then starting the clock to protect against individual/isolated slow nodes.

I also think we should fail the master if it's much slower to accept the publication than a quorum of the other nodes, although in practice I don't think we see this happen very often.

[^1]: For instance, consider the following five-node setup in which T are tiebreakers, S is painfully slow, and A cannot communicate with either tiebreaker.

```
    T
   ╱ ╲
  ╲   ╲
 ╱     ╲
A―――――――B
│╲     ╱│
│ ╲   ╱ │
│  ╲ ╱  │
─   ╳   │
│  ╱ ╲  │
│ ╱   ╲ │
│╱     ╲│
T       S
```

`A` is a viable master since it is still in contact with a majority of nodes, but one of those is the slow node so it will not be working very well. It would be better to fail over to `B` which is connected to a majority of fast nodes, but it would be difficult for `A` to determine this. The same thing is possible with three nodes:

```
    T
   ╱ ╲
  ╲   ╲
 ╱     ╲
S―――――――B
```

Here even if `S` is egregiously slow it can't tell it's slow because it's disconnected from `T`.
elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 2 years ago

We (the @elastic/es-distributed team) discussed this yesterday and decided to proceed with this idea: we will start the clock on the publish timeout when we've received a quorum of successful publish responses, or we've received sufficiently many failed responses that the remaining nodes cannot form a quorum. When the timeout elapses if the state is committed then the master will ignore any remaining nodes and apply the state locally. If the state is still not committed when the timeout elapses then the master will report itself faulty via the LeaderChecker and will stand down as master.

One slightly tricky bit is that we need to be able to detect the master being slow without needing to acquire Coordinator#mutex since we hold the mutex when writing the state to disk which is a possible location where it would get stuck. We therefore need to track publish responses as they arrive and before the handler acquires the mutex, and perform an independent quorum check to trigger the timeout and feed the LeaderChecker.

We also discussed various situations where the network was pathologically broken, e.g. only affecting specific connections or connections opened in specific directions. We're satisfied that properly-configured TCP keepalives mean that you cannot get stuck because of a broken network, because every request either gets a proper response or an eventual NodeDisconnectedException.

DaveCTurner commented 1 year ago

Another kinda-tricky bit that came up in the context of https://github.com/elastic/elasticsearch/pull/95833 is that a blackholed master will remain in mode LEADER even as the rest of the cluster moves on, which breaks tests. I think it might be enough to adjust org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase.Cluster#getAnyLeader to pick the leader with the greatest term, but there might be other subtleties in this area too.