Open idegtiarenko opened 1 day ago
Pinging @elastic/es-distributed (Team:Distributed)
write alias is updated immediately after index is created without waiting for assigning the shards
In addition to what have been suggested here, I also wonder whether the write alias should be updated after the new shard comes online. This can be helpful even if we implement what is suggested here to avoid potential downtime. Probably not an easy change. But feels useful to at least have an assessment.
I think the 2-phase rollover described in ES-8377 is pretty much what I have suggested above.
Today new unassigned primary shards only change cluster health to yellow (as they do not affect any existing data unavailability):
https://github.com/elastic/elasticsearch/blob/a59c182f9f7e9d1bf3d6eecbc0e44f24ff91d053/server/src/main/java/org/elasticsearch/cluster/health/ClusterShardHealth.java#L186-L202
In certain situations (such as long desired balance computation or when reroute computation is delayed due to other pending tasks) new primary shard could be delayed for tens of seconds or even minutes. This could affect data ingestion when happens during ILM rollover (as write alias is updated immediately after index is created without waiting for assigning the shards).
We should degrade the cluster health to RED (when new primary could not be assigned within reasonably short interval of time) to make such situations easier to detect.