elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.82k stars 24.71k forks source link

Combine ILM shrink and force merge #73499

Open dakrone opened 3 years ago

dakrone commented 3 years ago

It's a common use case for an ILM policy to have a shrink action as well as a forcemerge action in the warm phase. However, in order to reduce DTS costs, we should investigate combining these actions.

Currently when performing a shrink, the following actions are taken by ILM (this is a subset):

The forcemerge performs a simple forcemerge of the index, but it does mean that the forcemerge is duplicated, and because merging is non deterministic the segments will likely differ between the nodes, leading to replication of segments.

There are at least two things we can do to help reduce DTS costs related to this:

Shrink into an index with zero replicas

When we shrink, currently ILM creates the shrunken index with the same replica count, but since this is going on transparently in the background, there is no need to create a shrunken index with a single replica. Instead, we can create the index with zero replicas, and increase the number of replicas to the original index's count prior to deletion of the original index.

Since shrink now has ILM resiliency, it means that in the event that something goes wrong, no data loss occurs, and ILM can retry.

By itself, this doesn't reduce DTS, because regardless the data will still have to be replicated across the zone boundary. However, if it was combined with the next enhancement:

Perform forcemerge prior to increasing the replica count

Forcemerge also ends up leading to replication across zone boundaries, however, if we perform the forcemerge at a point where the index has no replicas, then it only need be performed once, and the data will be replicated to a different zone only a single time.

If we combine both of these behaviors, the new behavior looks like:

Here is a before picture: 71903033-F5B5-48DA-AD30-2DB01F26D696

And here is an after picture: 4D545A4E-3157-4B00-A796-AD4F6709E755

In both examples I treated the single node allocation rule (where ILM has to get a copy of each shard on the same node) as "smart" and not sending any data across zones. Still, this step is tedious, and it would be nice if we could skip it.

elasticmachine commented 3 years ago

Pinging @elastic/es-core-features (Team:Core/Features)

gaobinlong commented 3 years ago

@dakrone , can I work on this issue? I'm a deep user of ILM and want to make more contributions to the feature.

dakrone commented 3 years ago

@gaobinlong I appreciate the interest! For this one though, I think we should hold off. I'm not sure yet the best way to implement this, whether we want to put something solely in the shrink action, or whether we want to introduce the concept of a "logical plan" into ILM that can re-order or combine steps to be optimized.

gaobinlong commented 3 years ago

@dakrone thanks for you reply, I will keep track of this issue and follow up the development of ILM.

jpountz commented 2 years ago

whether we want to put something solely in the shrink action, or whether we want to introduce the concept of a "logical plan" into ILM that can re-order or combine steps to be optimized

Maybe one argument for the latter is that we would likely want to also optimize the forcemerge + shrink + searchable_snapshot workflow to replace the step that increases the number of replicas of the shrunken index with taking a snapshot and doing a snapshot recovery?

dakrone commented 2 years ago

@jpountz yes with a logical plan we could re-order, elide, or enhance actions to make more combinations of actions efficient.

jpountz commented 2 years ago

In addition to the DTS costs, there is another aspect of this proposal that I like a lot, which is the fact that we would reduce the CPU cost of the forcemerge operation by 2x since it would run on a single shard copy.

This would be a win on its own, plus we could then have more discussions about shifting some of the CPU cost from natural merges to forced merges, e.g.

VimCommando commented 2 years ago

There is related discussion in Can we avoid force-merging all shard copies?