Suboptimal shard allocation with imbalanced shards

bobrik commented 8 years ago

Elasticsearch version:

# elasticsearch --version
Version: 2.2.0, Build: 8ff36d1/2016-01-27T13:32:39Z, JVM: 1.8.0_72-internal

JVM version:

# java -version
openjdk version "1.8.0_72-internal"
OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)
OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)

OS version: Debian Jessie on kernel 4.1.3.

Description of the problem including expected versus actual behavior:

Elasticsearch does not try to spread shards for new indices equally if one node has less shards than others.

Elasticsearch on on of 12 machines died early in the morning (UTC) and caused several indices go red, even though each index has 1 replica. All of the red indices were today's indices. This is a separate issue, this one is about allocation. I unmounted the bad disk on unlucky node and restarted it, roughly 6 hours after the incident. Indices did not recover.

Since there wasn't much to do at this point, I've dropped red indices to let them refill from kafka. At the same time elasticsearch started recovery procedure to rebalance shards evenly. This happened for new indices:

This is how node list looked like:

It seems like elasticsearch decided to ignore the fact that it's going to rebalance shards anyway and allocated more shards to the empties node. This caused much higher load on this node and made it a bottleneck.

Another issue is that all shards disappeared. I think the following happened:

After the incident elasticsearch recovered available indices to have 2 copies on healthy nodes.
After bad node rejoined it had the same copies removed since they were redundant.
Since bad node has less shards now, it started recovering inactive shards from old indices that were just removed from it.

This looks suboptimal if it's right. This could be #17019 again, since I run optimize and flush every night and it flushed sync happened after the disk failure.

ywelsch commented 8 years ago

The balancer currently has separate steps for: 1) allocating unassigned shards (This happens when a new index is created) and 2) rebalancing shards in the cluster This means that if an empty node joins the cluster it will receive a large share of the new shards in step 1 while also receiving shards that come in due to rebalancing in step 2. This results in increased load on that node if the shards in 1) are hot (see also discussion in #12279 and solutions discussed). Can you confirm that the newly created shards are indeed hot shards (get more load due to heavy indexing etc.)?

Note that as an initial step towards integrating steps 1 and 2 better, I've recently removed these separate steps on the interface level of the balancer (which were separate methods) so that we have the flexibility to have them collaborate better on the implementation level #17028.

Our solution to the second issue is currently the use of index.unassigned.node_left.delayed_timeout (see https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html). How long was the bad node down? If it is down for a longer time than the timeout AND we have again 2 copies on healthy nodes, we indeed remove the old shard copies from the node even though we might in the future balance one of the copies to that node. Accounting for this specific scenario would add quite some complexity to the balancer. In the presence of throttling, this could also introduce new issues (e.g. wasting space on the new node when cluster is running tight on disk space).

bobrik commented 8 years ago

Can you confirm that the newly created shards are indeed hot shards (get more load due to heavy indexing etc.)?

Yes, 100% ingestion happens on new indices since they are rotated daily.

index.unassigned.node_left.delayed_timeout is set to 20m, node was down for a few hours. I understand that accounting for rebalancing introduces more complexity, but copying terabytes of data when you don't need to copy anything is not something you want to see.

Not sure how you could waste space with throttling. My idea is that you can remove extra copies, but avoid doing that on only one node. Since you remove extra copies anyway, how come you waste more space? It's the opposite: you free space equally on each node instead of freeing it all on one node.

Imagine cluster with 3 nodes and completely idle indices. I kill one node and let the cluster go green again. Once it's green, I start killed node back.

Expected: equal number of extra shards removed from each node, no rebalancing. Actual: everything is removed from rejoined node and rebalancing kicks in.

Does this make sense?

bobrik commented 8 years ago

To give you an idea how bad it is for indexing, all shards were moved away from the fresh node at 12:30:

Y axis is messages per second consumed from each kafka partition.

ywelsch commented 8 years ago

I understand your suggestion of no-copy balancing. In terms of implementation, we will have to investigate how much complexity this adds. The biggest impact on ingestion here seems to be first issue though (too many hot shards on one node). To address this in the short-term I suggest pre-allocating indices.

nik9000 commented 8 years ago

Another work around is to set total_shards_per_node. You have to be careful with it because it is a rule rather than a suggestion, but it is useful. If the hot shards aren't all one index, well, total_shards_per_node will help some, but isn't perfect there. I have on my list to file an issue about getting it (or something like it) to work across indexes.

DaveCTurner commented 6 years ago

@ywelsch and I discussed this and believe a possible way forward would be to add a user-configurable (and dynamically updatable) index setting shard_weight together with an allocation decider that places an upper bound on the total weight of shards that can be allocated to each node.

The expected usage pattern with daily indices would be for new indices to carry a high shard weight, reflecting the fact that they see a lot of indexing traffic while they are "current". Once they cease being current their weight is updated to something much lower, reflecting the fact that each node can support many more non-current indices.

Note that we do not propose balancing shard weights: this is just about putting an upper bound on the total shard weight on each node. This effectively generalises the total_shards_per_node setting to account for indices with different usage patterns, such as the case with daily indices.

DaveCTurner commented 6 years ago

The @elastic/es-distributed team discussed this and had no major objections. Points raised:

A better name would be shard_resource_cost rather than shard_weight since "weight" has balancing connotations that we want to avoid.
Multi-dimensional costs would help advanced users: being able to separately control RAM, CPU, IOPS, etc. is something that's come up before. This could be combined with numeric node attributes so that the limit could vary on a node-by-node basis. Although this'd be more complex to explain and administer, there were no explicit objections in the discussion.
Like total_shards_per_node, this won't place nicely with auto-expanding replicas, since as it's currently implemented we may end up expanding an index but then end up unable to allocate the resulting replicas because of resource constraints. It's hoped that Zen 2 will allow us to perform calculations on the cluster state at the right place in the pipeline to improve this issue.
It will be used for balancing, even though that's not the goal.
A hard limit could lead to a red cluster on index creation, either because of misconfiguration or because of resource exhaustion. The latter is the point of this, but the former is something to try and help users avoid. It'd be good to be able to tell when the limit is approaching in advance of hitting it so that evasive action can be taken.

DaveCTurner commented 6 years ago

I think https://discuss.elastic.co/t/shard-relocation-storms-when-cluster-disk-low/136385 would be helped by this idea.

vigyasharma commented 5 years ago

@ywelsch, @DaveCTurner -- Will shard-resource-cost have a default value for new shards, or is it an optional setting? Will it also be used for the weight function?

It is very difficult to predict the resource footprint of a shard. Assign too high a value and we declare an underutilized node exhausted, keep it too low and we overwhelm our nodes. Moreover, these values must change with indexing rates, lifecycle etc., so it needs to be updated as part of index lifecycle management which is another maintenance overhead.

I would like to understand how common it is in the Elasticsearch community to set balance.shard, balance.index and balance.threshold. In my experience, these are high stakes values and very hard to get right.

It seems more feasible to compute a shard's resource footprint by monitoring shard level metrics from within the cluster itself. If the cluster can observe each shard's CPU, JVM heap etc. via (InternalClusterInfoService), these values can be used for balancing as well. A balancing function on only shard counts is limiting; it requires similar resource use across all shards. Using actual resource footprint could be more effective.

Multi dimensional weight functions

Having a multi-dimensional weight function (RAM, CPU, IOPS etc.) is a good approach, but it needs to have the right relative weights and value ranges for each dimension.

One problem with the current weight function is that node.numShards() can be in hundreds while node.numShards(index) is only single digit values, making it impossible for indexBalance to offset shardBalance. The complexity exacerbates with adding more dimensions. Some solutions to this include having non-linear features in the function and normalizing your features; but it needs training on some data and the model becomes very use case specific.

Having a single shard-resource-weight variable is more viable. But instead of accepting an initial value (prediction), it should be computed by observing shard metrics. It could just be the most significant metric for resource use (heap used?).

This means moving away from proxy metrics for balancing like shard count. The problem is not too many shards on one node, it is too many hot shards on one node. A resource-use based balancer takes care of this.

Advanced users could monitor shard resource use across multiple dimensions, convert it into a final single weight value that is tuned to their usage pattern, and supply it to shards via an interface.

Why do we need initial shard weights (prediction)?

The whole requirement of predicting shard weights before they are created comes from the tight coupling between initial unassigned shard allocation, and shard rebalance (and move). Current count based balance is simple in the sense that count is intrinsically present for a new shard (each shard adds count=1). This is accurate and doesn't change over the life of a shard. But count is not a resource we run out of, it is a proxy that maps to resource use. Works for homogeneous shards, but fails to account for dissimilar shards.

This is solved by using actual resource-use based balancing; but now the metric value is 0 when a shard is created. Allocating on the node with minimum weight will just put all shards on one node. Hence, we need to break this tight coupling.

This can be done by using random allocation for initial unassigned shards while ensuring that same index shards get spread across nodes. There is an unpleasant element of compromising predictability introduced by any randomness. But it is similar to current implementation when all nodes are perfectly balanced. Random initial allocation eliminates the need to predict values which is a valuable gain.

Resource Limit Rules

It makes sense to have upper bounds on a node's resource utilization for cpu, ram etc. Similar to disk watermarks, these should be triggered on actual resource usage and set to high thresholds.

DaveCTurner commented 5 years ago

It is very difficult to predict the resource footprint of a shard.

In due course we may be able to work towards automating this process, but I want to clarify that this idea is out of the scope of this proposal right now. Our experience is that in most use cases the resource costs of each shard are predictable enough that we do not need to automate these predictions to add value.

But instead of accepting an initial value (prediction), it should be computed by observing shard metrics.

Possibly, but again I want to clarify that this idea is out of the scope of this proposal right now.

The whole requirement of predicting shard weights before they are created comes from the tight coupling between initial unassigned shard allocation, and shard rebalance (and move)

This is not true. Relocating a misallocated shard is necessarily time-consuming and resource-intensive and we must account carefully for this problem in whatever solution we settle on.

DaveWHarvey commented 5 years ago

I have a different problem I'm trying to solve, but I see a possible tweak to what is being discussed here as a solution. On Kubernetes, a node/pod may want to terminate because of a rolling upgrade or because the underlying worker node is being replaced. The pod gets a "pre stop hook", and we have coded this to add the node to the exclusion list (and to periodically retry that in case another node also did that) and then wait for shards to drain. The issue we have it this is extremely aggressive, i.e., even though each node only allows 2 shard relocations, we end up with 40 relocations in parallel for one node, because replicas on other nodes can be the data source. For this use case, we would prefer the shards to migrate from the node using less aggreesive and controllable rebalancing rules. In the above discussion, there is the concept of unequal nodes, where some nodes should get more shards. I'd like the ability to simply say that a particular node should not get new shards, and to rebalance shards away.

With the exclusion, the second node replaced in a rolling upgrade, moves "2" of its shards to the 1 node replaced, and a large number of shards to nodes that are due to be replaced, i.e., where those shards will need to be moved again, i.e., 2 relocations per shard, when only 1 relocation per shard is required after the first node is replaced.

DaveCTurner commented 5 years ago

The issue we have it this is extremely aggressive, i.e., even though each node only allows 2 shard relocations, we end up with 40 relocations in parallel for one node, because replicas on other nodes can be the data source.

@DaveWHarvey you're right that relocations are really copies from the primary to the new location, and the concurrent recovery limit is enforced on a node-by-node basis, so you might get up to 2*#nodes concurrent recoveries at any time, but I don't understand why this is an issue for you. Each node should only see 2 concurrent recoveries, and that's what I expect to be the bottleneck. What are the consequences of this that cause you problems?

With the exclusion, the second node replaced in a rolling upgrade, moves "2" of its shards to the 1 node replaced, and a large number of shards to nodes that are due to be replaced

Do you mean that the "1 node replaced" is empty and you're expecting Elasticsearch to move all the shards from the "second node replaced" onto this node? I'm not completely following the process you're describing. It isn't really a rolling upgrade because in a rolling upgrade nodes are upgraded in place with allocation disabled, so there shouldn't really be any shard movement.

DaveWHarvey commented 5 years ago

@DaveCTurner A rolling upgrade could be implemented by disabling shard allocation or by moving shards. The former creates a high risk of data loss if there is 1 replica, requires synchronization to ensure only one node is acting at a time (but I need a statefulSet per AZ to support multiple availability zones), will cause transient search failures when the node goes down, and does not work for cases where the node configuration is changed, e.g., a larger disk is used. We want to be able to do arbitrary automated upgrades at arbitrary times with HA risk and without errors and we really are unconcerned about the duration of the upgrade.

There are these issues with the aggressive shard recovery that we see:
1) If you are running the kubernetes cluster auto-scaler, and you spin up a new cluster, the odds are that even if you have specified anti-affinity, you get a lot of pods on the same worker node. We started looking at this issue because many of those 40 concurrent copies were on the same worker node, and we saturated the EC2 instance's network bandwidth. 2) The aggressive shard recovery pointlessly moves shards to nodes with the highest shard count, when those nodes will later be evacuated, and it has caused those those node to hit watermarks in one case.
3) For us, a search is two steps real-time indexing of a small number of use-once documents followed by refresh and the a search those and the main index. In 2.3, we used the different thread pools for bulk and individual indexing to prioritize the real-time, but it is one thread pool in 6.5. We see much higher write queue rejections when more shards are moving. Perhaps you are correct in that we would only spread these out in time because the per node copies are limited, except for the fact that shards are being moved twice rather than once.

ppf2 commented 5 years ago

The expected usage pattern with daily indices would be for new indices to carry a high shard weight, reflecting the fact that they see a lot of indexing traffic while they are "current". Once they cease being current their weight is updated to something much lower, reflecting the fact that each node can support many more non-current indices.

+1 on the above suggestion to avoid situations like this in a production environment with very high indexing traffic causing a major hotspot situation when all shards for the current time-based index end up on the same node. Certainly, the workaround is to use total number of shards (with tradeoffs):

I also want to add that while total_shards_per_node is widely used to avoid this shard allocation issue today, apart from the potential of having unassigned shards (well known tradeoff), total_shards_per_node doesn't work well with ILM.

In order for ILM's shrink action to work, it needs to move a copy of every shard of the index to a single node and it will not be possible unless total_shards_per_node is removed from the index before the shrink action. This makes it challenging to manage indices using ILM when the index in hot phase will have total_shards_per_node set, but this setting will have to be removed before it hits the shrink action in the warm phase (and added back post-shrink); ILM does not provide an action today to be able to change other index level settings like total_shards_per_node. @jakelandis

vigyasharma commented 5 years ago

very high indexing traffic causing a major hotspot situation when all shards for the current time-based index end up on the same node.

This is a common problem arising due to how the weight function is structured.

Certainly, the workaround is to use total number of shards (with tradeoffs):

@ppf2 The change proposed in #43350 should enable shard allocation to implicitly handle this without the need for total number of shards settings.

seang-es commented 4 years ago

This is becoming more of a problem with the reduced shard counts we're starting to see in the field. The total_shards_per_node workaround completely falls apart if you have 40 single-shard indices that are all being actively written and you add a couple new nodes to your cluster.

Prior to ILM, if you timed this properly you could get things fairly well balanced before a daily rollover time hit for a standard logging use case. With ILM, we can have rollovers happening at any time, and if enough happen while a new node has not yet reached shard parity with the other hot nodes, the new node will get crushed with load and impair the entire cluster.

Either way, we are hitting a situation where a user may have a heavily loaded cluster, and in attempting to add capacity to address it, they will make things much worse for the immediate time after making the change.

Another nice-to-have feature here would be an automatic adjustment of the concurrency and bandwidth for the new nodes, since they will not be performing much active ingestion (if the shard weight is implemented as described above) and should be able to support more I/O than they ordinarily would as we attempt to get them to shard parity.

steevo590 commented 4 years ago

One thing that I'm not sure has been discussed here that I think would be relevant is disk space on the nodes especially when there are multiple volumes on the node presented as separate disk volumes. One issue I've run into is there being lots of disk space on the node as a whole but it trips the low watermark on one volume and always has to move a shard off to another node, could a different and quicker solution could be to move the shard internally to another volume on the same node. I'd expect this to greatly increase the resolution of hitting the watermark and lead to more effective balancing. The same sort of thing could be done as it relates to disk IOPS/write rejections as well to improve performance.

tnycum commented 4 years ago

I've also been seeing this issue recently on my cluster where we have been pulling blocks of nodes in and out of the cluster to perform hardware upgrades. We drain all of the shards from the nodes prior to removing them from the cluster, and when they are reintroduced, the cluster will favor this small subset of nodes for new shard allocation. These nodes then get overwhelmed, which slows down indexing.

The only decent solution I could think of for my scenario was adding an index template matching our hourly indices for the next few days that excludes these nodes from shard allocation.

It would be great if there was a setting to prevent these nodes from getting new shards allocated to them while they are still rebalancing. Or, if we could dynamically set node attributes (because sadly the _node/settings API doesn't allow PUTs) I could try to do this myself by adding a temporary node.attr on these low shard count nodes with a global index template that would prevent shard placement on nodes with this attribute. Then, once the nodes are rebalanced I could then remove the node.attr. It's hard to do this strategy today because it requires a editing elasticsearch.yml and doing a service restart.

DavidWHarvey commented 4 years ago

You are probably better off manipulating the indices' total_shards_per_node field. Perhaps a cron job that counts candidate nodes, accounts for possible node failures, reads the primary shard count and sets this field. This is only going to be effective with more shards per index than total nodes. You could use exclusion to create subsets of nodes.

tnycum commented 4 years ago

I forgot to mention that I do have total_shards_per_node=1 on these indices, so the shards do get spread out for a single index. The issue is when highly active shards from different indices land on the same node. These different indices are generated with the same name other than the yyyyddmmhh tacked on to the end, which makes it hard to fence them off from each other using index templates or node attributes.

XANi commented 3 years ago

Balancing also hilariously breaks when there is huge discrepancy between shard size in different indexes.

We have cluster with a bunch of indices that are very small (~100mb), few within single digit GB range that are pretty busy (so many shards) and one big one (with 50GB shards) and the result of automatic rebalancing is absolute disaster.

Average number of shards per node is ~160. One node have 12.... most of them from the "big" index. It is also completely full (just at the edge of low watermark) while others have hundreds of gigs of free space. Doesn't make load distribution any good either.

shards disk.indices disk.used disk.avail disk.total disk.percent
   163        154gb   154.8gb       95gb    249.9gb           61
   163      152.9gb   154.1gb     59.7gb    213.9gb           72
    12      332.9gb   335.4gb     65.5gb    400.9gb           83
   163      154.8gb   156.5gb     93.3gb    249.9gb           62
   163      157.2gb   158.5gb    241.3gb    399.9gb           39
   163      153.4gb   154.5gb    135.3gb    289.9gb           53
   163      158.9gb   160.1gb    179.8gb    339.9gb           47
   163      158.6gb   159.6gb     90.2gb    249.9gb           63
   162      154.7gb   157.5gb    122.3gb    279.9gb           56
   163      155.6gb   157.4gb    132.4gb    289.9gb           54

bra-fsn commented 3 years ago

Balancing also hilariously breaks when there is huge discrepancy between shard size in different indexes.

To fight this, I've turned automatic shard balancing off long ago and running a script which tries to do this right. Sadly when one of the nodes leave the cluster and rejoin the reshuffling (which is not aware of what's important here) causes massive problems. It would be nice to have either an easily scriptable allocator and/or an external API (like the elastic master could ask a REST service about where to put its shards, so this could be implemented easily by anyone in their preferred environment).

XANi commented 3 years ago

In my case kicking few of the big shards off the node made it reasonable... for now but having allocator that only considers shard count is almost useless with big spread between both shard size and how often a given index is used

ViggoC commented 2 years ago

Disk usage is a slowly changing metric compared to cpu usage. Is it possible to add this metric first to solve problems caused by huge differences in shard size?

elastic / elasticsearch