Open fcofdez opened 2 years ago
Pinging @elastic/es-distributed (Team:Distributed)
I wonder if we should consider this a problem with auto-following or a problem with "put-follow" instead? Ideally, if the put-follow creates the index, we should also eventually start following.
I agree but I suspect that we have some cases where we might end up trying to "auto-follow" an already followed index multiple times during master failovers, this needs some tests to prove it.
By my reading if we do a resume-follow action on a shard that's already following then it will fail with a ResourceAlreadyExistsException
rather than creating a duplicate follower task. It is tricky tho, I don't know that there's a way to distinguish a shard which failed to create the initial follower task from one that was set up successfully and subsequently paused. We might need to add a flag to its index metadata and then do a single cluster state update which creates the follower task and flips the flag.
Today when the cluster is unstable and there are master failovers while some new leader indices match an auto-follow pattern it is possible that the following index end up in a state where it does not pull changes from the leader index or it is considered as an already following index when it is not followed.
One scenario where this is possible is after the following index is recovered from the leader index in: https://github.com/elastic/elasticsearch/blob/c7dc89f3cd86dbc9ad11c2f831c63651053e6e4a/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/TransportPutFollowAction.java#L269-L270
Eventhough that end up calling a
AcknowledgedTransportMasterNodeAction
, it uses the default timeout (30s) meaning that if there's a failure for more than (30s) the listener just logs the failure instead of retrying or informing back to the auto-follow coordinator, see:https://github.com/elastic/elasticsearch/blob/c7dc89f3cd86dbc9ad11c2f831c63651053e6e4a/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/TransportPutFollowAction.java#L244-L255