Start the clock on `cluster.publish.timeout` when cluster state is committed

DaveCTurner commented 2 years ago

Today cluster.publish.timeout starts counting in the call to Coordinator#publish right at the very start of the publication, so it includes the time spent sending the cluster state out to all the nodes and waiting for them to write it to disk and respond. If the timeout elapses before the cluster state is committed then the master node considers itself to have failed and restarts the discovery process. The thinking here is that if it takes more than 30s to even commit the cluster state then the master is unhealthy and needs to be replaced. However I don't think this holds up in practice:

We can hit this timeout if the cluster state is large enough, especially when restarting a cluster. This results in a timeout-and-retry loop that prevents the cluster from forming. A little more patience would get us past the initial large and slow publication and into diff-based publishing and the cluster would then be fine.
We can hit this timeout if there's a network partition and there is no reachable majority from the master. However we should be relying on the OS detecting this case and failing the affected connections - it can do this much more quickly than 30s if properly configured.
We can hit this timeout if a majority of the voting nodes are writing their state to disk egregiously slowly. Failing and retrying doesn't help in this case, it just piles more work on these slow nodes, and this'll affect any other master too.
The timeout doesn't protect against the master itself being slow to write the state to disk even if a majority of the other nodes are fast, but we should fail over to a different master in this case (relates #56707).
There's admittedly some combinations of slow/faulty nodes and network partitions in which failing over to a new master is the appropriate response[^1], but they're rather improbable.

I'm proposing changing this behaviour to wait indefinitely until the publication is committed (or definitely won't commit) and then starting the clock to protect against individual/isolated slow nodes.

I also think we should fail the master if it's much slower to accept the publication than a quorum of the other nodes, although in practice I don't think we see this happen very often.

[^1]: For instance, consider the following five-node setup in which T are tiebreakers, S is painfully slow, and A cannot communicate with either tiebreaker.

```
    T
   ╱ ╲
  ╲   ╲
 ╱     ╲
A―――――――B
│╲     ╱│
│ ╲   ╱ │
│  ╲ ╱  │
─   ╳   │
│  ╱ ╲  │
│ ╱   ╲ │
│╱     ╲│
T       S
```

`A` is a viable master since it is still in contact with a majority of nodes, but one of those is the slow node so it will not be working very well. It would be better to fail over to `B` which is connected to a majority of fast nodes, but it would be difficult for `A` to determine this. The same thing is possible with three nodes:

```
    T
   ╱ ╲
  ╲   ╲
 ╱     ╲
S―――――――B
```

Here even if `S` is egregiously slow it can't tell it's slow because it's disconnected from `T`.

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 2 years ago

We (the @elastic/es-distributed team) discussed this yesterday and decided to proceed with this idea: we will start the clock on the publish timeout when we've received a quorum of successful publish responses, or we've received sufficiently many failed responses that the remaining nodes cannot form a quorum. When the timeout elapses if the state is committed then the master will ignore any remaining nodes and apply the state locally. If the state is still not committed when the timeout elapses then the master will report itself faulty via the LeaderChecker and will stand down as master.

One slightly tricky bit is that we need to be able to detect the master being slow without needing to acquire Coordinator#mutex since we hold the mutex when writing the state to disk which is a possible location where it would get stuck. We therefore need to track publish responses as they arrive and before the handler acquires the mutex, and perform an independent quorum check to trigger the timeout and feed the LeaderChecker.

We also discussed various situations where the network was pathologically broken, e.g. only affecting specific connections or connections opened in specific directions. We're satisfied that properly-configured TCP keepalives mean that you cannot get stuck because of a broken network, because every request either gets a proper response or an eventual NodeDisconnectedException.

DaveCTurner commented 1 year ago

Another kinda-tricky bit that came up in the context of https://github.com/elastic/elasticsearch/pull/95833 is that a blackholed master will remain in mode LEADER even as the rest of the cluster moves on, which breaks tests. I think it might be enough to adjust org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase.Cluster#getAnyLeader to pick the leader with the greatest term, but there might be other subtleties in this area too.

elastic / elasticsearch

Start the clock on `cluster.publish.timeout` when cluster state is committed #80400