Open bobrik opened 8 years ago
The balancer currently has separate steps for: 1) allocating unassigned shards (This happens when a new index is created) and 2) rebalancing shards in the cluster This means that if an empty node joins the cluster it will receive a large share of the new shards in step 1 while also receiving shards that come in due to rebalancing in step 2. This results in increased load on that node if the shards in 1) are hot (see also discussion in #12279 and solutions discussed). Can you confirm that the newly created shards are indeed hot shards (get more load due to heavy indexing etc.)?
Note that as an initial step towards integrating steps 1 and 2 better, I've recently removed these separate steps on the interface level of the balancer (which were separate methods) so that we have the flexibility to have them collaborate better on the implementation level #17028.
Our solution to the second issue is currently the use of index.unassigned.node_left.delayed_timeout
(see https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html). How long was the bad node down? If it is down for a longer time than the timeout AND we have again 2 copies on healthy nodes, we indeed remove the old shard copies from the node even though we might in the future balance one of the copies to that node. Accounting for this specific scenario would add quite some complexity to the balancer. In the presence of throttling, this could also introduce new issues (e.g. wasting space on the new node when cluster is running tight on disk space).
Can you confirm that the newly created shards are indeed hot shards (get more load due to heavy indexing etc.)?
Yes, 100% ingestion happens on new indices since they are rotated daily.
index.unassigned.node_left.delayed_timeout
is set to 20m
, node was down for a few hours. I understand that accounting for rebalancing introduces more complexity, but copying terabytes of data when you don't need to copy anything is not something you want to see.
Not sure how you could waste space with throttling. My idea is that you can remove extra copies, but avoid doing that on only one node. Since you remove extra copies anyway, how come you waste more space? It's the opposite: you free space equally on each node instead of freeing it all on one node.
Imagine cluster with 3 nodes and completely idle indices. I kill one node and let the cluster go green again. Once it's green, I start killed node back.
Expected: equal number of extra shards removed from each node, no rebalancing. Actual: everything is removed from rejoined node and rebalancing kicks in.
Does this make sense?
To give you an idea how bad it is for indexing, all shards were moved away from the fresh node at 12:30:
Y axis is messages per second consumed from each kafka partition.
I understand your suggestion of no-copy balancing. In terms of implementation, we will have to investigate how much complexity this adds. The biggest impact on ingestion here seems to be first issue though (too many hot shards on one node). To address this in the short-term I suggest pre-allocating indices.
Another work around is to set total_shards_per_node. You have to be careful with it because it is a rule rather than a suggestion, but it is useful. If the hot shards aren't all one index, well, total_shards_per_node will help some, but isn't perfect there. I have on my list to file an issue about getting it (or something like it) to work across indexes.
@ywelsch and I discussed this and believe a possible way forward would be to add a user-configurable (and dynamically updatable) index setting shard_weight
together with an allocation decider that places an upper bound on the total weight of shards that can be allocated to each node.
The expected usage pattern with daily indices would be for new indices to carry a high shard weight, reflecting the fact that they see a lot of indexing traffic while they are "current". Once they cease being current their weight is updated to something much lower, reflecting the fact that each node can support many more non-current indices.
Note that we do not propose balancing shard weights: this is just about putting an upper bound on the total shard weight on each node. This effectively generalises the total_shards_per_node
setting to account for indices with different usage patterns, such as the case with daily indices.
The @elastic/es-distributed team discussed this and had no major objections. Points raised:
A better name would be shard_resource_cost
rather than shard_weight
since "weight" has balancing connotations that we want to avoid.
Multi-dimensional costs would help advanced users: being able to separately control RAM, CPU, IOPS, etc. is something that's come up before. This could be combined with numeric node attributes so that the limit could vary on a node-by-node basis. Although this'd be more complex to explain and administer, there were no explicit objections in the discussion.
Like total_shards_per_node
, this won't place nicely with auto-expanding replicas, since as it's currently implemented we may end up expanding an index but then end up unable to allocate the resulting replicas because of resource constraints. It's hoped that Zen 2 will allow us to perform calculations on the cluster state at the right place in the pipeline to improve this issue.
It will be used for balancing, even though that's not the goal.
A hard limit could lead to a red cluster on index creation, either because of misconfiguration or because of resource exhaustion. The latter is the point of this, but the former is something to try and help users avoid. It'd be good to be able to tell when the limit is approaching in advance of hitting it so that evasive action can be taken.
I think https://discuss.elastic.co/t/shard-relocation-storms-when-cluster-disk-low/136385 would be helped by this idea.
@ywelsch, @DaveCTurner -- Will shard-resource-cost
have a default value for new shards, or is it an optional setting? Will it also be used for the weight function?
It is very difficult to predict the resource footprint of a shard. Assign too high a value and we declare an underutilized node exhausted, keep it too low and we overwhelm our nodes. Moreover, these values must change with indexing rates, lifecycle etc., so it needs to be updated as part of index lifecycle management which is another maintenance overhead.
I would like to understand how common it is in the Elasticsearch community to set balance.shard
, balance.index
and balance.threshold
. In my experience, these are high stakes values and very hard to get right.
It seems more feasible to compute a shard's resource footprint by monitoring shard level metrics from within the cluster itself. If the cluster can observe each shard's CPU, JVM heap etc. via (InternalClusterInfoService
), these values can be used for balancing as well. A balancing function on only shard counts is limiting; it requires similar resource use across all shards. Using actual resource footprint could be more effective.
Having a multi-dimensional weight function (RAM, CPU, IOPS etc.) is a good approach, but it needs to have the right relative weights and value ranges for each dimension.
One problem with the current weight function is that node.numShards()
can be in hundreds while node.numShards(index)
is only single digit values, making it impossible for indexBalance
to offset shardBalance
. The complexity exacerbates with adding more dimensions.
Some solutions to this include having non-linear features in the function and normalizing your features; but it needs training on some data and the model becomes very use case specific.
Having a single shard-resource-weight
variable is more viable. But instead of accepting an initial value (prediction), it should be computed by observing shard metrics. It could just be the most significant metric for resource use (heap used?).
This means moving away from proxy metrics for balancing like shard count
. The problem is not too many shards on one node, it is too many hot shards on one node. A resource-use based balancer takes care of this.
Advanced users could monitor shard resource use across multiple dimensions, convert it into a final single weight value that is tuned to their usage pattern, and supply it to shards via an interface.
The whole requirement of predicting shard weights before they are created comes from the tight coupling between initial unassigned shard allocation, and shard rebalance (and move). Current count based balance is simple in the sense that count is intrinsically present for a new shard (each shard adds count=1). This is accurate and doesn't change over the life of a shard. But count is not a resource we run out of, it is a proxy that maps to resource use. Works for homogeneous shards, but fails to account for dissimilar shards.
This is solved by using actual resource-use based balancing; but now the metric value is 0 when a shard is created. Allocating on the node with minimum weight will just put all shards on one node. Hence, we need to break this tight coupling.
This can be done by using random allocation for initial unassigned shards while ensuring that same index shards get spread across nodes. There is an unpleasant element of compromising predictability introduced by any randomness. But it is similar to current implementation when all nodes are perfectly balanced. Random initial allocation eliminates the need to predict values which is a valuable gain.
It makes sense to have upper bounds on a node's resource utilization for cpu, ram etc. Similar to disk watermarks, these should be triggered on actual resource usage and set to high thresholds.
It is very difficult to predict the resource footprint of a shard.
In due course we may be able to work towards automating this process, but I want to clarify that this idea is out of the scope of this proposal right now. Our experience is that in most use cases the resource costs of each shard are predictable enough that we do not need to automate these predictions to add value.
But instead of accepting an initial value (prediction), it should be computed by observing shard metrics.
Possibly, but again I want to clarify that this idea is out of the scope of this proposal right now.
The whole requirement of predicting shard weights before they are created comes from the tight coupling between initial unassigned shard allocation, and shard rebalance (and move)
This is not true. Relocating a misallocated shard is necessarily time-consuming and resource-intensive and we must account carefully for this problem in whatever solution we settle on.
I have a different problem I'm trying to solve, but I see a possible tweak to what is being discussed here as a solution. On Kubernetes, a node/pod may want to terminate because of a rolling upgrade or because the underlying worker node is being replaced. The pod gets a "pre stop hook", and we have coded this to add the node to the exclusion list (and to periodically retry that in case another node also did that) and then wait for shards to drain. The issue we have it this is extremely aggressive, i.e., even though each node only allows 2 shard relocations, we end up with 40 relocations in parallel for one node, because replicas on other nodes can be the data source. For this use case, we would prefer the shards to migrate from the node using less aggreesive and controllable rebalancing rules. In the above discussion, there is the concept of unequal nodes, where some nodes should get more shards. I'd like the ability to simply say that a particular node should not get new shards, and to rebalance shards away.
With the exclusion, the second node replaced in a rolling upgrade, moves "2" of its shards to the 1 node replaced, and a large number of shards to nodes that are due to be replaced, i.e., where those shards will need to be moved again, i.e., 2 relocations per shard, when only 1 relocation per shard is required after the first node is replaced.
The issue we have it this is extremely aggressive, i.e., even though each node only allows 2 shard relocations, we end up with 40 relocations in parallel for one node, because replicas on other nodes can be the data source.
@DaveWHarvey you're right that relocations are really copies from the primary to the new location, and the concurrent recovery limit is enforced on a node-by-node basis, so you might get up to 2*#nodes
concurrent recoveries at any time, but I don't understand why this is an issue for you. Each node should only see 2 concurrent recoveries, and that's what I expect to be the bottleneck. What are the consequences of this that cause you problems?
With the exclusion, the second node replaced in a rolling upgrade, moves "2" of its shards to the 1 node replaced, and a large number of shards to nodes that are due to be replaced
Do you mean that the "1 node replaced" is empty and you're expecting Elasticsearch to move all the shards from the "second node replaced" onto this node? I'm not completely following the process you're describing. It isn't really a rolling upgrade because in a rolling upgrade nodes are upgraded in place with allocation disabled, so there shouldn't really be any shard movement.
@DaveCTurner A rolling upgrade could be implemented by disabling shard allocation or by moving shards. The former creates a high risk of data loss if there is 1 replica, requires synchronization to ensure only one node is acting at a time (but I need a statefulSet per AZ to support multiple availability zones), will cause transient search failures when the node goes down, and does not work for cases where the node configuration is changed, e.g., a larger disk is used. We want to be able to do arbitrary automated upgrades at arbitrary times with HA risk and without errors and we really are unconcerned about the duration of the upgrade.
There are these issues with the aggressive shard recovery that we see:
1) If you are running the kubernetes cluster auto-scaler, and you spin up a new cluster, the odds are that even if you have specified anti-affinity, you get a lot of pods on the same worker node. We started looking at this issue because many of those 40 concurrent copies were on the same worker node, and we saturated the EC2 instance's network bandwidth.
2) The aggressive shard recovery pointlessly moves shards to nodes with the highest shard count, when those nodes will later be evacuated, and it has caused those those node to hit watermarks in one case.
3) For us, a search is two steps real-time indexing of a small number of use-once documents followed by refresh and the a search those and the main index. In 2.3, we used the different thread pools for bulk and individual indexing to prioritize the real-time, but it is one thread pool in 6.5. We see much higher write queue rejections when more shards are moving. Perhaps you are correct in that we would only spread these out in time because the per node copies are limited, except for the fact that shards are being moved twice rather than once.
The expected usage pattern with daily indices would be for new indices to carry a high shard weight, reflecting the fact that they see a lot of indexing traffic while they are "current". Once they cease being current their weight is updated to something much lower, reflecting the fact that each node can support many more non-current indices.
+1 on the above suggestion to avoid situations like this in a production environment with very high indexing traffic causing a major hotspot situation when all shards for the current time-based index end up on the same node. Certainly, the workaround is to use total number of shards (with tradeoffs):
I also want to add that while total_shards_per_node
is widely used to avoid this shard allocation issue today, apart from the potential of having unassigned shards (well known tradeoff), total_shards_per_node
doesn't work well with ILM.
In order for ILM's shrink action to work, it needs to move a copy of every shard of the index to a single node and it will not be possible unless total_shards_per_node
is removed from the index before the shrink action. This makes it challenging to manage indices using ILM when the index in hot phase will have total_shards_per_node
set, but this setting will have to be removed before it hits the shrink action in the warm phase (and added back post-shrink); ILM does not provide an action today to be able to change other index level settings like total_shards_per_node
. @jakelandis
very high indexing traffic causing a major hotspot situation when all shards for the current time-based index end up on the same node.
This is a common problem arising due to how the weight function is structured.
Certainly, the workaround is to use total number of shards (with tradeoffs):
@ppf2 The change proposed in #43350 should enable shard allocation to implicitly handle this without the need for total number of shards settings.
This is becoming more of a problem with the reduced shard counts we're starting to see in the field. The total_shards_per_node
workaround completely falls apart if you have 40 single-shard indices that are all being actively written and you add a couple new nodes to your cluster.
Prior to ILM, if you timed this properly you could get things fairly well balanced before a daily rollover time hit for a standard logging use case. With ILM, we can have rollovers happening at any time, and if enough happen while a new node has not yet reached shard parity with the other hot nodes, the new node will get crushed with load and impair the entire cluster.
Either way, we are hitting a situation where a user may have a heavily loaded cluster, and in attempting to add capacity to address it, they will make things much worse for the immediate time after making the change.
Another nice-to-have feature here would be an automatic adjustment of the concurrency and bandwidth for the new nodes, since they will not be performing much active ingestion (if the shard weight is implemented as described above) and should be able to support more I/O than they ordinarily would as we attempt to get them to shard parity.
One thing that I'm not sure has been discussed here that I think would be relevant is disk space on the nodes especially when there are multiple volumes on the node presented as separate disk volumes. One issue I've run into is there being lots of disk space on the node as a whole but it trips the low watermark on one volume and always has to move a shard off to another node, could a different and quicker solution could be to move the shard internally to another volume on the same node. I'd expect this to greatly increase the resolution of hitting the watermark and lead to more effective balancing. The same sort of thing could be done as it relates to disk IOPS/write rejections as well to improve performance.
I've also been seeing this issue recently on my cluster where we have been pulling blocks of nodes in and out of the cluster to perform hardware upgrades. We drain all of the shards from the nodes prior to removing them from the cluster, and when they are reintroduced, the cluster will favor this small subset of nodes for new shard allocation. These nodes then get overwhelmed, which slows down indexing.
The only decent solution I could think of for my scenario was adding an index template matching our hourly indices for the next few days that excludes these nodes from shard allocation.
It would be great if there was a setting to prevent these nodes from getting new shards allocated to them while they are still rebalancing. Or, if we could dynamically set node attributes (because sadly the _node/settings API doesn't allow PUTs) I could try to do this myself by adding a temporary node.attr
on these low shard count nodes with a global index template that would prevent shard placement on nodes with this attribute. Then, once the nodes are rebalanced I could then remove the node.attr
. It's hard to do this strategy today because it requires a editing elasticsearch.yml
and doing a service restart.
You are probably better off manipulating the indices' total_shards_per_node field. Perhaps a cron job that counts candidate nodes, accounts for possible node failures, reads the primary shard count and sets this field. This is only going to be effective with more shards per index than total nodes. You could use exclusion to create subsets of nodes.
I forgot to mention that I do have total_shards_per_node=1
on these indices, so the shards do get spread out for a single index. The issue is when highly active shards from different indices land on the same node. These different indices are generated with the same name other than the yyyyddmmhh
tacked on to the end, which makes it hard to fence them off from each other using index templates or node attributes.
Balancing also hilariously breaks when there is huge discrepancy between shard size in different indexes.
We have cluster with a bunch of indices that are very small (~100mb), few within single digit GB range that are pretty busy (so many shards) and one big one (with 50GB shards) and the result of automatic rebalancing is absolute disaster.
Average number of shards per node is ~160. One node have 12.... most of them from the "big" index. It is also completely full (just at the edge of low watermark) while others have hundreds of gigs of free space. Doesn't make load distribution any good either.
shards disk.indices disk.used disk.avail disk.total disk.percent
163 154gb 154.8gb 95gb 249.9gb 61
163 152.9gb 154.1gb 59.7gb 213.9gb 72
12 332.9gb 335.4gb 65.5gb 400.9gb 83
163 154.8gb 156.5gb 93.3gb 249.9gb 62
163 157.2gb 158.5gb 241.3gb 399.9gb 39
163 153.4gb 154.5gb 135.3gb 289.9gb 53
163 158.9gb 160.1gb 179.8gb 339.9gb 47
163 158.6gb 159.6gb 90.2gb 249.9gb 63
162 154.7gb 157.5gb 122.3gb 279.9gb 56
163 155.6gb 157.4gb 132.4gb 289.9gb 54
Balancing also hilariously breaks when there is huge discrepancy between shard size in different indexes.
To fight this, I've turned automatic shard balancing off long ago and running a script which tries to do this right. Sadly when one of the nodes leave the cluster and rejoin the reshuffling (which is not aware of what's important here) causes massive problems. It would be nice to have either an easily scriptable allocator and/or an external API (like the elastic master could ask a REST service about where to put its shards, so this could be implemented easily by anyone in their preferred environment).
In my case kicking few of the big shards off the node made it reasonable... for now but having allocator that only considers shard count is almost useless with big spread between both shard size and how often a given index is used
Disk usage is a slowly changing metric compared to cpu usage. Is it possible to add this metric first to solve problems caused by huge differences in shard size?
Elasticsearch version:
JVM version:
OS version: Debian Jessie on kernel 4.1.3.
Description of the problem including expected versus actual behavior:
Elasticsearch does not try to spread shards for new indices equally if one node has less shards than others.
Elasticsearch on on of 12 machines died early in the morning (UTC) and caused several indices go red, even though each index has 1 replica. All of the red indices were today's indices. This is a separate issue, this one is about allocation. I unmounted the bad disk on unlucky node and restarted it, roughly 6 hours after the incident. Indices did not recover.
Since there wasn't much to do at this point, I've dropped red indices to let them refill from kafka. At the same time elasticsearch started recovery procedure to rebalance shards evenly. This happened for new indices:
This is how node list looked like:
It seems like elasticsearch decided to ignore the fact that it's going to rebalance shards anyway and allocated more shards to the empties node. This caused much higher load on this node and made it a bottleneck.
Another issue is that all shards disappeared. I think the following happened:
This looks suboptimal if it's right. This could be #17019 again, since I run optimize and flush every night and it flushed sync happened after the disk failure.