Open DaveCTurner opened 4 years ago
Pinging @elastic/es-distributed (:Distributed/Allocation)
Was introduced explicitly in #16096 as part of introducing allocation-ids.
The reason this existed was to provide a more balanced primary shard allocation after a full cluster restart (or a failure that brought down all shard copies), as allocation of existing primary shards does not take balancing into consideration. Selecting the same shard copies that had previously been primary to become primary again increases the likelihood that the cluster does not require any rebalancing after restart. The only thing which would otherwise "help" with the balancing is the allocation throttling. I'm not sure how much of an impact this logic actually has in a full cluster restart, but this would be interesting to find out in a test.
Spreading the primaries out in a full cluster restart is something I agree we want, but I think we'd get similar/better results by shuffling the list of in-sync shard copies (e.g. sorting by allocation ID) rather than by sorting on this flag.
In particular if you do a rolling restart of a cluster then the resulting cluster will contain at least one node with no primaries and that's not something I think we should try and preserve across a full cluster restart. Relatedly if you shut down the nodes one-by-one for a full cluster restart then some shards may come back with multiple copies having this primary flag set, and IIUC we tiebreak them via the iteration order in a HashMap<DiscoveryNode, ...>
which I think will introduce quite some skew.
We (the @elastic/es-distributed team) discussed this today and agreed to proceed. There were a few concerns around losing today's behaviour, but full cluster restarts are very much the exception today and this flag has no effect in a rolling restart.
Today the primary shards allocator prefers to allocate to the location that previously held the primary (via
NodeGatewayStartedShards::primary
which is read from the on-diskShardStateMetadata
). This means that when a replica is promoted to primary we must update its shard state metadata to record the promotion, and this update happens on the cluster applier thread.In fact I don't think there's a good reason to prefer the old primary location in the primary shards allocator any more. I propose removing this flag from consideration and dropping the corresponding plumbing so we don't have to synchronously write this state to disk during cluster state application.