Better/lazier ForceMerge no-ops

joegallo commented 2 years ago

Description

Currently, the FORCE_MERGE thread pool is sized to 1 per node (see also https://github.com/elastic/elasticsearch/issues/84943). Given that, if you attempt to force merge many indices down to one segment, you can saturate the thread pool and get a lot of queueing.

In some scenarios, it's possible that some or many of the indices that are queued up waiting for the thread pool have already been merged down to just one segment -- that is, the eventual force merge is actually going to be a no-op (in the sense of there being no effect after having completed the operation).

However, we don't have a path on TransportForceMergeAction that avoids doing work on the FORCE_MERGE thread pool in this case -- the task still gets queued up anyway.

As an enhancement, perhaps TransportForceMergeAction can learn to skip even enqueueing a task in the event that the force merge will be a no-op?

Alternatively, if this doesn't make sense to do this at the TransportForceMergeAction layer, then it might still make sense to do this, but to do it in ILM -- e.g. ForceMergeAction or ForceMergeStep could check the number of segments first before calling out to forceMerge.

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

joegallo commented 2 years ago

Just a little more context:

ILM's searchable_snapshot action has a force_merge_index option that defaults to true. In the very reasonable and ordinary scenario where a policy already has a force_merge action that merged down to "max_num_segments": 1 then the subsequent force merge in the searchable_snapshot action will likely be a no-op -- but in the wrong circumstances it could spend a whole lot of time waiting in a queue before it finally turns out to be a no-op.

nik9000 commented 2 years ago

Better/lazier

same thing

henningandersen commented 2 years ago

We discussed this and while it sounds simple, we would prefer that the determination of whether a shard satisfies the force-merge request already to be done by lucene to avoid subtle bugs like a one-segment force-merge with a codec change not carrying out the codec change. Once that is in place, skipping the force merge queue inside Elasticsearch would be a simple change.

jpountz commented 2 years ago

we would prefer that the determination of whether a shard satisfies the force-merge request already to be done by lucene

+1 in Lucene this responsibility is fully encapsulated in the merge policy, and some implementations are free to create singleton merges to reclaim deletes, change a codec, reorder documents, remove fields, or any other reason.

I don't feel good about adding more surface API for this use-case though, which feels a bit esoteric. One way to go about it could be to have a merge policy wrapper that we could temporarily configure to disable merging and only tell us whether the underlying MP would produce merges, then enable merges again and queue the shard for a force-merge if the underlying MP had produced one or more merges.

Thinking out loud, maybe another approach that could work well in practice could consist of using a priority queue for the force_merge thread pool that would give a higher priority to shards that have a single segment already. It wouldn't be true all the time that these shards wouldn't need merging, but probably often enough that this would practically address this problem? It could also help with force-merging small shards before large shards in order to allow ILM to move to following steps as soon as possible.

henningandersen commented 2 years ago

We discussed this again and:

could be to have a merge policy wrapper

This sounds like a reasonable direction, though preferably without temporarily reconfiguring the wrapper if possible. Perhaps we can pass enough context information to allow the mode of operation you describe where we are told whether a merge is necessary. We also think perhaps we can hook in when threads are created or make the merge threads node-wide rather than per shard (essentially moving the determination of how many force-merges to run down to the merge thread level).

using a priority queue for the force_merge thread pool that would give a higher priority to shards that have a single segment already.

If above turns out difficult, we agree this can be an alternative. A caveat is that a no-op merge will have to wait for any ongoing force merge to complete.

also help with force-merging small shards before large shards

We have concerns on starvation or at least an extended wait time for outliers here. I am not sure this ensures more progress of data moving through, but can see how it could ensure more indices move through to following steps sooner.

DaveCTurner commented 1 year ago

Now that I look again, is the Lucene API already rich enough to answer whether a force-merge will be a no-op?

private boolean forceMergeIsNoOp(int maxNumSegments) throws IOException {
    try (var reader = DirectoryReader.open(indexWriter)) {
        final var segmentCommitInfos = SegmentInfos.readCommit(reader.directory(), reader.getIndexCommit().getSegmentsFileName());
        final var segmentsToMerge = new HashMap<SegmentCommitInfo, Boolean>();
        for (int i = segmentCommitInfos.size(); i >= 0; i--) {
            segmentsToMerge.put(segmentCommitInfos.info(i), Boolean.TRUE);
        }

        return indexWriter.getConfig()
            .getMergePolicy()
            .findForcedMerges(segmentCommitInfos, maxNumSegments, segmentsToMerge, indexWriter).merges.isEmpty();
    }
}

henningandersen commented 1 year ago

This looks sort of right to me. I wonder if we will need to flush first too to be accurate, which could maybe be a trade-off in some situations?

geekpete commented 1 year ago

If we were considering a priority queue, would it be a good idea to prioritize finishing all shards of the same index ahead of starting new shards from other indices so that that the data from an index isn't sitting on a particular tier waiting to move off but cannot until one or two shards pending force merge that are at the back of the queue have to wait for all the shards ahead of them in the queue finish before they can commence their force merges?

What I mean here is to prioritize the shards that move data more frequently through lifecycle rather than having a bunch of index data sit around until other blocker shards complete before it can move that data off.

The problem this would solve is disk capacity stress from data hanging around unnecessarily.

For example, where some shards might have re-queued to the back of the merge queue for whatever reason (eg cluster reroute with retry_failed was called after some data was held up for some reason) and both the data from the indices that could have moved off already but can't due to waiting on the last couple of shards at the back of the merge queue waiting their turn to process, along with the other shards (from other indices) that are currently merging and causing additional disk usage to eat space until their merges finish, combined to cause unnecessary disk capacity growth/usage rather than completing+moving that data earlier to free up space.

If we were going to turn the merge queue into a priority queue anyway, aside from prioritizing probable no-ops to the front, then we could also try to prioritize finishing off remaining shards for indices that already have shards merging or merged rather than processing shards in basic queue order and starting merges for shards from indices that haven't already started processing yet.

There might be edge cases here and some cutoff needed for those edge cases (eg a 200 shard index holding up all the other work of smaller shard count indices might be one example, another might be shards sitting in the queue for too long overall just get their turn after some threshold of time, eg make the prioritization a reasonable effort one rather than a hard prioritization)

elastic / elasticsearch

Better/lazier ForceMerge no-ops #86013

Description