Add option to prevent recovering replica shards onto nodes that did not have them locally.

gibrown commented 10 years ago

For large deployments ES shard recovery after a node goes down or some networking event occurs seems to cause more problems than are necessary because all of the shards end up getting shuffled around the cluster and can cause other performance problems. Seems better to leave some replicas uninitialized until the nodes are alive again. Right now the recovery time from an event affecting a node is proportional to the total amount of data on that node rather than proportional to the amount of changed data on that node if we just waited for the node to be up again so it can recover locally. This behavior also makes it harder to contain a problem from affecting the entire cluster.

Backstory

We had a 4 min network disruption in one of our 3 data centers (14 data nodes per DC) at 23:04. As you would expect ES queries started timing out to that DC giving us a 30% error rate, but then recovered after about 5 minutes.

However, after about 30 minutes queries again started timing out and that continued to get worse over the next two hours.

The root cause of these bad queries was due to shards getting shuffled around within the DC that had the networking problem.

Some nodes actually ended up getting overloaded with shards causing server performance problems as the shards got initialized on those nodes and they could not handle the increasing query load (those first 14 bars are disk space in the affected DC):

Proposed Solution

I think a better set of "production" settings would say "do not re-allocate shards to other nodes when a node goes down".

It seems like cluster.routing.allocation.enable should have an additional option: recovery_and_new (this would still allow rebalancing). "Recovery" in this case would mean that shards can only be initialized on a node that already had them on it.

The assumption here is that the cluster has enough replication that losing a single replica is not such a large event that it is worth triggering a whole lot of network traffic and potentially causing other problems.

Other Workarounds

I don't think there is a way to do this right now. I could disable all allocation, but that would prevent shards for new indices from being allocated. recover_after_time could almost serve this purpose (I could set the timeout to 3 hours to give the ops team time to get a server back up), but that setting seems to only apply on a full cluster restart.

dakrone commented 10 years ago

@gibrown this sounds useful, but I think using the term "recovery" has too much baggage for this. Maybe naming it something like "existing" would be better suited (not 100% sure about the name).

So maybe "existing_and_new" would allow existing replicas to be recovered and brand new shards to be allocated, but not recovering replicas if a node went down. Does this sound like what you're after?

This does sound good, it sounds similar to (but not the same as) another issue (which I can't find right now) that has a "recover_after_time" setting for shard recovery, so you could say "wait 5 minutes after losing a replica before trying to recover it to another machine".

bleskes commented 10 years ago

@gibrown It seems you havea deployment where a single datacenter is not capable of hosting more then one copy of the data. I wonder if force awareness ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html#forced-awareness ) , where you use the datacenter as the attribute, in combination with gateway.expected_nodes for full cluster restart. This will make sure only one replica will be assigned to each datacenter, but if a node goes down within a data center another node in the same DC will pick up the missing replica.

gibrown commented 10 years ago

@dakrone "existing_and_new" sounds much better, and exactly like what I'm after. I thought about suggesting a timeout to control this. I think its hard to decide what I'd set the time to though. If nodes aren't coming back up after some period of time, there is probably a reason that our ops team is well aware of. Having the system suddenly try to correct itself is more likely to exacerbate the problem IMO.

@bleskes We do use a DC awareness attribute, but also have a secondary attribute ("parity" which indicates which is tied to the network router in the rack) so we can have multiple replicas in one DC. In practice we see the replicas getting reallocated on another node in the same DC (presumably due to the DC attribute being higher priority). Ideally though, I don't want the replica moved at all. If a DC is having problems, moving TBs of data around does not help.

sax commented 10 years ago

@gibrown we have a much smaller cluster than you, but are running into similar issues.

Have you tried setting index.routing.allocation.total_shards_per_node? We're thinking that if we set this to the exact number of shards expected per node in a healthy state (with all nodes available), then when nodes go down shards will not be reallocated. If multiple nodes become unavailable, this may leave the cluster in a red state; we are already catching this in our client applications and disabling associated features for users.

nik9000 commented 10 years ago

I think there are some things we can do that aren't as aggressive as not allowing reallocation on nodes that don't already have some of the index. That might still be a good idea but it isn't one I'm likely to use.

When a node goes down all the shards that it hosted are shifted to "unassigned" state. Right now when the allocation algorithm for unassigned shards tries to assign them as quickly as possible. It makes some effort to balance them while it is allocating them but if the most balanced node is throttled then it'll just assign the shard to the next most balanced node. This causes it to assign all the shards super fast (good) but it can make the cluster quite unbalanced (bad).

Maybe instead we need a way to be more leisurely and balanced when assigning shards. So we only assign the shard to the node that would be most balanced and if that node is throttled then we just give up on assigning that shard for now and come back to it when the node isn't throttled. I don't think you want to do this if there aren't any replicas live, but if you already have a single replica and you are just looking to add another then maybe its ok.

I think this would have prevented your massive shuffling problem. I think.

sax commented 10 years ago

@gibrown we just tried this and it appears to do what we expected. When we restarted an individual node, the shards on that node became unassigned. When the node rejoined the cluster, those shards were allocated back to it, and it was able to recover quickly from the shards already on disk (and in the disk cache).

gibrown commented 10 years ago

@sax that's a good workaround that I hadn't considered.

I don't think it works for my use case because we use index templates to auto create new indices, and on one of our clusters we are fairly constantly adding new indices (and hence shards).

sax commented 10 years ago

I'll double check, but I thought it was set per index.

Sent from my iPhone

On Aug 27, 2014, at 12:14 PM, Greg Ichneumon Brown notifications@github.com wrote:

@sax https://github.com/sax that's a good workaround that I hadn't considered.

I don't think it works for my use case because we use index templates to auto create new indices, and on one of our clusters we are fairly constantly adding new indices (and hence shards).

— Reply to this email directly or view it on GitHub https://github.com/elasticsearch/elasticsearch/issues/7288#issuecomment-53625310 .

gibrown commented 10 years ago

@nik9000 a smarter allocation algorithm would help for some use cases, but I think would also make the dynamics more complex and confusing. I'd prefer the system to be more predictable.

For our use case reusing the 2TB of data that is already on the disks of each node is almost always the fastest way to get our replicas back. Moving shards around will always be MUCH slower. Even if it takes hours to resolve a hardware problem, moving shards is often an event that takes 24-36 hours to complete. I'd rather intentionally decide to move that much data around.

In those cases where we do choose to move the data around though, I agree that paying more attention to the overall balance of shards would help.

gibrown commented 10 years ago

@sax oh you're totally right.

Still wouldn't be my preferred way to manage this, but does seem to be a good workaround.

sax commented 10 years ago

@gibrown agreed. I think we have a host of other tuning issues, but at least this will simplify the way we restart nodes until there's a better option.

sax commented 10 years ago

@nik9000 I think there are two competing priorities for us. One is the ability to do a rolling restart of a cluster as quickly as possible. The other is to tune recovery such that it does not cause the cluster to become unavailable. Since we're struggling mightily with the latter, being able to solve the former in a very simple fashion is nice.

Ideally there would be another way of configuring recovery timeouts, where you could tell the cluster to allow shards to remain unassigned for some time period. If a node rejoins the cluster, then expected_nodes would be met and the missing shards could be reassigned back (where they could be loaded from disk). After some timeout, however, the unassigned shards would be allocated to different nodes for redundancy.

Pardon if this is what is described above, my brain is spinning a bit trying to keep all the bits in.

nik9000 commented 10 years ago

@sax - what I was describing was more a way to prevent the cluster from becoming super unbalanced when it comes back up - a timeout would help, I think.

Maybe something like this:

If there aren't any replicas online then try to assign the shard as fast as possible just like how it works now.
If there are replicas online then wait some specified timeout hoping the node will come back. If it comes back then restore wherever would be most efficient/balanced. Hopefully that node is the one that just came back. If the node is already throttled then wait for it to become unthrottled.

I think it'd make make a single node going down less exciting.

nik9000 commented 10 years ago

It'd change rolling restarts too. I'm not sure if it'd help them or hurt them. Ultimately I think full cluster restarts are a lost cause until we have a way to restore from the master shard's translog even if the files differ.

dakrone commented 10 years ago

Ultimately I think full cluster restarts are a lost cause until we have a way to restore from the master shard's translog even if the files differ.

Yes, unfortunately this is something that will require sequence numbers, which we're working on, but until then a super-fast recovery after full restart (without pre-optimizing and changing replica settings) is not possible.

clintongormley commented 10 years ago

Stalled by #6069

gibrown commented 10 years ago

@clintongormley wasn't @dakrone's comment about only full cluster restarts requiring sequence numbers? I would think for implementing 'existing_and_new' as an additional option for the cluster.routing.allocation.enable would be independent of the translog.

clintongormley commented 10 years ago

@gibrown No, even a node restart will benefit from sequence numbers. Imagine that you have a primary on one node and a replica on the other. You keep indexing documents, refreshes happen at different times on primary and replica, so the segments diverge. Now the node with the replica disappears temporarily (eg network disconnect or node restart).

To ensure that the primary is in sync with the replica, we can compare segments and copy over all of the segments that the primary has and the replica doesn't. But these segments have diverged over time, so this can end up being a lot of data. With sequence numbers the situation is different. As long as the last sequence number that the replica knows about is still in the transaction log of the primary, we can just replay the translog from that point on.

gibrown commented 10 years ago

Right, I see how they can help restart times in lots of scenarios, and I'm all for it and agree sequence numbers would be a huge improvement. But I don't think they address the original problem of large shards being moved from a node that is temporarily down but then returns from production.

If shard A is on node 1, and node 1 goes down due to a hardware failure, then shard A is going to get reallocated on another node that has no data and all the data needs to be copied over the network. I would much rather have the reduced redundancy until our ops team resolves the hardware failure and node 1 comes back up than have TBs of data moving around the cluster exacerbating the problems that already exist in the cluster.

I would think this problem is orthogonal to sequence numbers, but its quite likely I'm misunderstanding something.

s1monw commented 9 years ago

just as a side note I just pushed #8190 to master which might help here as well @gibrown

clintongormley commented 9 years ago

This is now closed by #11438, #12421, and #11417

gibrown commented 9 years ago

@clintongormley thanks to you and everyone else for these improvements.

elastic / elasticsearch