Open DaveCTurner opened 3 years ago
Pinging @elastic/es-distributed (Team:Distributed)
We discussed this today and didn't come up with any objections. It would be useful to collect examples of support cases in which this idea would have helped so we can judge how important it would be to pursue. I will raise it with some support folks.
Does the planned recovery schema apply to CCR synchronizations as well?
The question has recently been raised concerning a situation where CCR is being used to enable an active/standby configuration, and where the follower index becomes the "logical active index" when the leader index fails. If the failed leader index is brought back online after significant downtime, the re-synchronization of the current leader with the original leader could take many hours. Would file-based recoveries be applicable to this scenario as well? The possibility of getting around this potential problem by using bi-directional replication has been noted.
Does the planned recovery schema apply to CCR synchronizations as well?
It's a good question, but no, CCR recoveries are out of scope for this discussion. Here's why...
Would file-based recoveries be applicable to this scenario as well?
If the follower index is promoted to a leader (which is my interpretation of the phrase "logical active index" in your question) then this scenario requires recreating a follower, closing the original leader index and then calling the create-follower API to resynchronize it, and this process already always effectively uses a file-based recovery today.
If the follower index is not promoted during the leader's downtime then there will be no writes to replicate, so today's behaviour (effectively an operations-based recovery) is the best choice anyway.
If the follower is unavailable then we also always perform an operations-based recovery on its return, which may be the wrong choice for similar reasons to the ones outlined above if the downtime lasts for several (but <12) hours. Unlike for peer recoveries, we don't have the pieces we would need to take an alternative route in this situation today, and developing those pieces is not in scope here.
Here's a nice graph of translog ops recovered vs time for a shard that was struggling to recover while there was ongoing indexing in a 7.9.2 cluster:
It's not clear why the first (file-based) recovery was so choppy, but the subsequent recoveries were ops-based and probably should have been file-based or at least should have fallen back to file-based ones. It's interesting to me that the replay rate is so consistent. The "knee" in the graph of the second recovery is, I believe, where we started to encounter ops that we hadn't already recovered.
Operations-based recoveries are effective when there are only a few operations to replay, but replaying a lot of operations can be much more expensive than just copying the files over again. Today we have relatively simple heuristics for choosing between these options, based on comparing an estimate for the number of operations to copy vs the total size of the index and assuming that the shape and size of the index is not too unusual. If those assumptions do not hold then these heuristics can lead us to a bad decision, perhaps spending orders of magnitude more time on an operations-based recovery than we would have needed for a file-based one.
Early in a recovery we do not have a better way to compare the two options, primarily because we have no way to estimate how fast the operations can be replayed. Once we have started to recover operations, however, a clearer picture starts to emerge. Occasionally we encounter an ongoing ops-based recovery that intuitively (based on its current progress) would complete much quicker if we just cancelled it and performed a file-based recovery instead.
I'm proposing building this intuition into Elasticsearch by making empirical measurements of the recovery speed during an operations-based recovery and using these measurements to estimate the time remaining until the recovery is complete. If these measurements clearly indicate that the ops-based recovery was a bad choice then we would fall back to a file-based recovery, similarly to how one would do this "by eye" today. Specifically, I think we can calculate the balance of the options as follows:
This assumes that we replay operations at a constant rate, and that we get our fair share of the maximum permitted recovery bandwidth. It also assumes that ongoing indexing happens at a constant rate, measured by the change in the max seqno during the replay so far, and uses this to estimate how long we would have to spend replaying operations after the file copy was complete.
Some further thoughts:
we would wait for some time (e.g. an hour) before making this comparison, so that we have a reasonable picture of its progress. Recoveries that complete in under an hour are of much less concern.
we would only fall back if we didn't already try a file-based recovery, to avoid a retry loop.
we could adjust these estimates to conservatively prefer not to discard completed work, perhaps requiring the fall-back estimate to be a fraction of the estimated time remaining before taking action. In the cases I have in mind, the estimated fall-back time has typically been less than 1/10th of the remaining ops-based-recovery time.