Force merge behaviour unreliable

EgbertW commented 9 months ago

Elasticsearch Version

8.10.3

Installed Plugins

No response

Java Version

bundled

OS Version

Linux 5.15.107+ #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I monitor an index that relies on force merging to 1 segment for performance considerations. The process is:

Create index, replicas set to 0
Bulk insert millions of documents into it
Refresh index
Force merge with max_num_segments set to 1, wait_for_completion=false
Use task API to read task until task is done
Use stats API to get the number of primary segments
If primary segments is more than the number of shards, go back to step 4 and repeat
Set number_of_replicas to 5, wait for replication to complete
Start using index for querying

Multiple concurrent processes like this can run on the same cluster. We've noticed on multiple occassions that step 3&4 is not enough: the task completes successfully but still more than 1 segments exist for 1 or more shards. So we added step 6 to retry if that happens.

A new issue surfaced recently: even step 6 was not good enough. The stats API reported a number of primary segments equal to the amount of shards for several minutes and replication was initiated. During the allocation of additional replicas, the number of segments jumped back up to 44. Essentially, the force merge on one shard was reversed.

This mostly seems to happen when multiple indices are force merged at the same time. We noticed log messages like:

now throttling indexing: numMergesInFlight=10, maxNumMerges=9
stop throttling indexing: numMergesInFlight=8, maxNumMerges=9

This message appears to be a red flag - it looks like not only indexing is throttled but also force merging is partially cancelled.

This all feels like a bug somewhere in the force merge administration - the task is not monitored properly and not guaranteed to complete, and even if if does appear to complete correctly, there looks to be a change that the force merge is reversed.

Steps to Reproduce

Have a Elasticsearch cluster, set force_merge threadpool size to X=8
Create 2 separate indices with 0 replicas, with S=6 shards each (where 2xS is more than X=8)
Bulk insert millions of documents into both of them, concurrently (enough to be sure the force merge takes a significant amount of time)
As soon as all documents are indexed into an index, trigger a refresh on it
Directly after the refresh, initate a force merge with max_num_segments set to 1 (do this separately for both indices)
Use the task API to monitor the resulting tasks for completion
Initiate replicaton on the indices as soon as the tasks are done
Monitor the amount of primary shards throughout the entire process

Rinse & repeat a couple of times if necessary. Observe that the number of primary segments is not guaranteed to be 12 (6 on each index, 1 on each shard) after this.

The only lead we currently have is this: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-forcemerge.html

that states:

Shards that are relocating during a forcemerge will not be merged.

Theoretically, the replication of one index could cause a rebalancing for primary shards on the other, effectively nullifying the effect of the force merge that is still ongoing.

It would be great to have some reliable grip on the process and being sure that it completed correctly.

Logs (if relevant)

now throttling indexing: numMergesInFlight=10, maxNumMerges=9 stop throttling indexing: numMergesInFlight=8, maxNumMerges=9

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-distributed (Team:Distributed)

tlrx commented 9 months ago

This all feels like a bug somewhere in the force merge administration - the task is not monitored properly and not guaranteed to complete, and even if if does appear to complete correctly, there looks to be a change that the force merge is reversed.

There is no such thing in Elasticsearch that reverts a completed force merge, but there are various operations that may produce new segments or retain existing segments while the force merge is running.

Among the ones I can think of:

indexing is not really stopped
refreshes produces new segments (think of the default refresh interval of 1 sec, or external client refreshing indices)
shard relocations may retain segments until the relocation is done
snapshots retain segments until snapshot is completed
indexing memory controller activate indexing throttling and flush largest shards (it produces more segments)
point-in-time and scroll search queries retain segments

Your use case probably needs to be adapted to take into consideration all these operations (and there are maybe more, sorry if I forgot some of them).

A good start would be to create the index with refresh interval of -1 and index through an alias so that the alias can be dropped once bulk indexing is done (or even better, use Security and specific permission to ensure your process is the only possible indexing client). Then flush the index and add a write block to the index (see Add Block API), this would guarantee not further indexing. The index can then be relocated to a node with less active indexing so that indexing throttling won't be activated and shard relocation due to disk space less likely to happen. Then force merge to 1 segment and use the Task API to see the results.

I'm keeping this issue opened since I think we can at least add a note in documentation about the various reason that could cause a shard to have more segments than expected after a force merge.

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-docs (Team:Docs)

EgbertW commented 8 months ago

@tlrx Thanks for your response. It took me some time to find some time to respond.

I understand there is no such thing as a reversal. Then there must be some misalignment, not sure what's causing it. In the indexing task, all documents are inserted into Elasticsearch, the force merge API is called and then periodically (every minute or so) the <indexname>/_segments endpoint is called to get the number of segments.

The behaviour I noticed was that at some point num_committed_segments in the reply of this was 1, and then a minute later it suddenly reverted back to its old value of 45. I definitely can't rule out that the force merge was aborted by any of the causes you list. Shard relocations seem the must likely culprit. However, the possibility of a force merge not succeeding completely are taken into account in the code - that is why the _segments endpoint is called after the task is finished. However, it turns out that even if _segments reports that there is one segment, there's still a possiblity of it not being accurate. This feels like a bug to me.

Prefably I'd see a forge merge task that allows me to state that no matter what happens and no matter how long it takes, I want the task to continue until there really is just one segment in every shard left. If this is unfeasible, it would at least be nice to have some reliable way to find out afterwards if it succeeded.

The indexing process already has refresh disabled. We're only calling _refresh endpoint after we got confirmation that all documents have been indexed by the _bulk endpoint.

Right now the solution we've going with is:

Do the index
Try to avoid all the conditions that may cause the forge merge to fail as much as possible
This includes postponing any shard allocations triggered by our process
Call forge merge and wait for the task to finish
Check _segments to see if it works

Because _segments has proven to be not completely accurate, we're now considering waiting for a couple of minutes and then calling _segments again to see if the number of segments is still 1 but this feels quite quirky.

So I don't think it would be solved by just updating the documentation, to be honest, there's something more going on.

tomekred commented 3 months ago

It seems like sending change from 0 replicas to 1 (or more):

PUT /<index_name>/_settings
{
  "index": {
    "number_of_replicas": 1
  }
}

right (like 1-2 seconds at most) after calling the _forcemerge also makes the operation fail silently. What is worse, the task log on _forcemerge says "completed": "true", but number of segments stays the same.

elastic / elasticsearch