Open joegallo opened 2 years ago
Pinging @elastic/es-distributed (Team:Distributed)
Just a little more context:
ILM's searchable_snapshot
action has a force_merge_index
option that defaults to true
. In the very reasonable and ordinary scenario where a policy already has a force_merge
action that merged down to "max_num_segments": 1
then the subsequent force merge in the searchable_snapshot
action will likely be a no-op -- but in the wrong circumstances it could spend a whole lot of time waiting in a queue before it finally turns out to be a no-op.
Better/lazier
same thing
We discussed this and while it sounds simple, we would prefer that the determination of whether a shard satisfies the force-merge request already to be done by lucene to avoid subtle bugs like a one-segment force-merge with a codec change not carrying out the codec change. Once that is in place, skipping the force merge queue inside Elasticsearch would be a simple change.
we would prefer that the determination of whether a shard satisfies the force-merge request already to be done by lucene
+1 in Lucene this responsibility is fully encapsulated in the merge policy, and some implementations are free to create singleton merges to reclaim deletes, change a codec, reorder documents, remove fields, or any other reason.
I don't feel good about adding more surface API for this use-case though, which feels a bit esoteric. One way to go about it could be to have a merge policy wrapper that we could temporarily configure to disable merging and only tell us whether the underlying MP would produce merges, then enable merges again and queue the shard for a force-merge if the underlying MP had produced one or more merges.
Thinking out loud, maybe another approach that could work well in practice could consist of using a priority queue for the force_merge thread pool that would give a higher priority to shards that have a single segment already. It wouldn't be true all the time that these shards wouldn't need merging, but probably often enough that this would practically address this problem? It could also help with force-merging small shards before large shards in order to allow ILM to move to following steps as soon as possible.
We discussed this again and:
could be to have a merge policy wrapper
This sounds like a reasonable direction, though preferably without temporarily reconfiguring the wrapper if possible. Perhaps we can pass enough context information to allow the mode of operation you describe where we are told whether a merge is necessary. We also think perhaps we can hook in when threads are created or make the merge threads node-wide rather than per shard (essentially moving the determination of how many force-merges to run down to the merge thread level).
using a priority queue for the force_merge thread pool that would give a higher priority to shards that have a single segment already.
If above turns out difficult, we agree this can be an alternative. A caveat is that a no-op merge will have to wait for any ongoing force merge to complete.
also help with force-merging small shards before large shards
We have concerns on starvation or at least an extended wait time for outliers here. I am not sure this ensures more progress of data moving through, but can see how it could ensure more indices move through to following steps sooner.
Now that I look again, is the Lucene API already rich enough to answer whether a force-merge will be a no-op?
private boolean forceMergeIsNoOp(int maxNumSegments) throws IOException {
try (var reader = DirectoryReader.open(indexWriter)) {
final var segmentCommitInfos = SegmentInfos.readCommit(reader.directory(), reader.getIndexCommit().getSegmentsFileName());
final var segmentsToMerge = new HashMap<SegmentCommitInfo, Boolean>();
for (int i = segmentCommitInfos.size(); i >= 0; i--) {
segmentsToMerge.put(segmentCommitInfos.info(i), Boolean.TRUE);
}
return indexWriter.getConfig()
.getMergePolicy()
.findForcedMerges(segmentCommitInfos, maxNumSegments, segmentsToMerge, indexWriter).merges.isEmpty();
}
}
This looks sort of right to me. I wonder if we will need to flush first too to be accurate, which could maybe be a trade-off in some situations?
If we were considering a priority queue, would it be a good idea to prioritize finishing all shards of the same index ahead of starting new shards from other indices so that that the data from an index isn't sitting on a particular tier waiting to move off but cannot until one or two shards pending force merge that are at the back of the queue have to wait for all the shards ahead of them in the queue finish before they can commence their force merges?
What I mean here is to prioritize the shards that move data more frequently through lifecycle rather than having a bunch of index data sit around until other blocker shards complete before it can move that data off.
The problem this would solve is disk capacity stress from data hanging around unnecessarily.
For example, where some shards might have re-queued to the back of the merge queue for whatever reason (eg cluster reroute with retry_failed
was called after some data was held up for some reason) and both the data from the indices that could have moved off already but can't due to waiting on the last couple of shards at the back of the merge queue waiting their turn to process, along with the other shards (from other indices) that are currently merging and causing additional disk usage to eat space until their merges finish, combined to cause unnecessary disk capacity growth/usage rather than completing+moving that data earlier to free up space.
If we were going to turn the merge queue into a priority queue anyway, aside from prioritizing probable no-ops to the front, then we could also try to prioritize finishing off remaining shards for indices that already have shards merging or merged rather than processing shards in basic queue order and starting merges for shards from indices that haven't already started processing yet.
There might be edge cases here and some cutoff needed for those edge cases (eg a 200 shard index holding up all the other work of smaller shard count indices might be one example, another might be shards sitting in the queue for too long overall just get their turn after some threshold of time, eg make the prioritization a reasonable effort one rather than a hard prioritization)
Description
Currently, the
FORCE_MERGE
thread pool is sized to 1 per node (see also https://github.com/elastic/elasticsearch/issues/84943). Given that, if you attempt to force merge many indices down to one segment, you can saturate the thread pool and get a lot of queueing.In some scenarios, it's possible that some or many of the indices that are queued up waiting for the thread pool have already been merged down to just one segment -- that is, the eventual force merge is actually going to be a no-op (in the sense of there being no effect after having completed the operation).
However, we don't have a path on TransportForceMergeAction that avoids doing work on the
FORCE_MERGE
thread pool in this case -- the task still gets queued up anyway.As an enhancement, perhaps
TransportForceMergeAction
can learn to skip even enqueueing a task in the event that the force merge will be a no-op?Alternatively, if this doesn't make sense to do this at the
TransportForceMergeAction
layer, then it might still make sense to do this, but to do it in ILM -- e.g.ForceMergeAction
orForceMergeStep
could check the number of segments first before calling out toforceMerge
.