Ignore NodeGatewayStartedShards#primary() in primary shards allocator

elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine

https://www.elastic.co/products/elasticsearch

Other

69.92k stars 24.73k forks source link

Ignore NodeGatewayStartedShards#primary() in primary shards allocator #56779

Open DaveCTurner opened 4 years ago

DaveCTurner commented 4 years ago

Today the primary shards allocator prefers to allocate to the location that previously held the primary (via NodeGatewayStartedShards::primary which is read from the on-disk ShardStateMetadata). This means that when a replica is promoted to primary we must update its shard state metadata to record the promotion, and this update happens on the cluster applier thread.

In fact I don't think there's a good reason to prefer the old primary location in the primary shards allocator any more. I propose removing this flag from consideration and dropping the corresponding plumbing so we don't have to synchronously write this state to disk during cluster state application.

elasticmachine commented 4 years ago

Pinging @elastic/es-distributed (:Distributed/Allocation)

henningandersen commented 4 years ago

Was introduced explicitly in #16096 as part of introducing allocation-ids.

ywelsch commented 4 years ago

The reason this existed was to provide a more balanced primary shard allocation after a full cluster restart (or a failure that brought down all shard copies), as allocation of existing primary shards does not take balancing into consideration. Selecting the same shard copies that had previously been primary to become primary again increases the likelihood that the cluster does not require any rebalancing after restart. The only thing which would otherwise "help" with the balancing is the allocation throttling. I'm not sure how much of an impact this logic actually has in a full cluster restart, but this would be interesting to find out in a test.

DaveCTurner commented 4 years ago

Spreading the primaries out in a full cluster restart is something I agree we want, but I think we'd get similar/better results by shuffling the list of in-sync shard copies (e.g. sorting by allocation ID) rather than by sorting on this flag.

In particular if you do a rolling restart of a cluster then the resulting cluster will contain at least one node with no primaries and that's not something I think we should try and preserve across a full cluster restart. Relatedly if you shut down the nodes one-by-one for a full cluster restart then some shards may come back with multiple copies having this primary flag set, and IIUC we tiebreak them via the iteration order in a HashMap<DiscoveryNode, ...> which I think will introduce quite some skew.

DaveCTurner commented 4 years ago

We (the @elastic/es-distributed team) discussed this today and agreed to proceed. There were a few concerns around losing today's behaviour, but full cluster restarts are very much the exception today and this flag has no effect in a rolling restart.