elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.84k stars 24.71k forks source link

Creating effective back-pressure in ES Write Path #59116

Open getsaurabh02 opened 4 years ago

getsaurabh02 commented 4 years ago

Related to https://github.com/elastic/elasticsearch/pull/58885/ where we are adding rejections when the indexing memory limits are exceeded for primary or coordinating operations. The ongoing PR also plans to add a safeguard memory limit for replica operations, which is currently proposed to default at 1.5 times the indices.write.limit.

Few challenges which might come in with the proposed approach in the PR https://github.com/elastic/elasticsearch/pull/58885/

These are high level thoughts with primary motivation to not fail the replica shard instantly, while also providing the shard level isolation for both primary and replica shards. Since failing the replica shard is more intrusive, we would want to do it only in case there is some real issue. It will also allow us to distinguish between the transient vs steady state failures with more confidence. I have some thoughts around on how these accounting limits and thresholds can be defined, and can share more details on it. Let me know your thoughts.

getsaurabh02 commented 4 years ago

@tbrooks8 As you are actively working on https://github.com/elastic/elasticsearch/pull/58885/. Will be happy to get your thoughts on this. Thanks.

elasticmachine commented 4 years ago

Pinging @elastic/es-distributed (:Distributed/CRUD)

Tim-Brooks commented 4 years ago

Currently in Elasticsearch the primary mechanism of indexing back pressure is the write queue size. This does not relate effectively to the amount of work outstanding as the bulk shard requests can vary in size significantly inside the queue. Additionally, once the act of "indexing" is complete, the work moves to async coordination work. This means that the request no longer contributes to back pressure, even if there is still a "cost" associated to the work.

The PR that you linked is the first step towards addressing this issue. This PR accounts for the actual size of the requests. And the cost is accounted for until the request has completed. So the coordinating cost will be accounted until the coordinating, primary, and replica components are all complete. The primary cost will be accounted until the primary and replica components are all complete.

As you have noted, the replica limit is greater than the primary limit to prioritize replica work. If this replica work is significant it will naturally push back against new primary and coordinating work on a node.

You raise a concern about replica failures. The linked PR will combine with the work in #55633 to retry failed replication operations. Since the retry is still occurring the primary and coordinating bytes will still be marked, pushing back against a complete overload of the system.

In a few of your bullets you make reference to the primary knowing the state of the replica node and not sending unnecessary requests if it is aware that the replica is already overloaded. There is some potential follow-up work being discussed around this area (same thing with at the coordinating level, maintaining knowledge about the state of the primary). It just does not make sense as a first step as it is tricky (node roles can change, outstanding work could be generated from other nodes).

In terms of shard level quotas, that's not something we're looking at for this first step.

getsaurabh02 commented 4 years ago

Thank you @tbrooks8 for detailed explanation.

My concern with separate memory limit accounting for primary and replica shards : If I have understood the linked PR changes correctly, primary write operation on a node checks for total write limit (primary + replica), while replica operation considers only replica write limit (1.5 times). For nodes with varying shard allocation, this might end up enforcing different effective memory limits for write operations. Such as one node with 2 primaries and another node with 1 primary and 1 replica can have different effective write workload limits, before they actually start any rejection. Also in the latter case, primary writes on the node (with relatively less traffic) could get impacted due the memory spike caused by replica shard degradation on the same node, thereby taking away the fairness.

One way to address this with the approach I initially suggested is by having a dedicated write memory limit at primary level, to address the memory build up for all pending & outstanding requests, which also provides the primary to coordinator back-pressure. While primary to replica interactions are controlled by primaries. This involves primary being aware of the replica states, and not sending more requests, while rejecting early, if replica shows a signs of overload (such as active throughput degradation). I get the the complexity around this, especially with different allocation schemes, however to begin with this can be achieved by primary shards accounting the outstanding replication requests and active replication throughput. Memory limits such as RM CB etc could still be the last line of defence for replica nodes (preferably with higher thresholds).

Thanks for sharing the insight on the PR https://github.com/elastic/elasticsearch/pull/55633 related to replication retry strategy. While I understand this will be useful for addressing transient failures such as network blips, I am not really sure if retry will be highly effective in situations with memory limits breach under heavy workload. In non happy scenarios, if a node with multiple replica is under duress and is breaching the replica write limits, the situation can get even worse with more retries coming in, and primaries might end up failing the shards eventually. Again the fairness issue arising here as well. With the above approach, does primaries cutting off the traffic instead, allowing a cool off, and letting replica shards first process the pending requests, makes more sense here? This might provide effective load shedding, without impacting the stable cluster state, in longer traffic bursts scenarios.

If you think some of these thoughts might be useful, and are worth experimenting, I can share more insights, by running few benchmarks to verify the behaviour in both the scenarios.

Also, some of the fairness problems discussed above can be effectively addressed by bringing in shard level isolation. Since I already have some context to this, and as you are not planning yet, I would like to take that up. Please let me know if I can work on it. I have some ground work done around the approach.

Tim-Brooks commented 4 years ago

@getsaurabh02 If there are specific benchmarks that you think would provide useful information to this work feel free to share.

As I said this work in in-progress. Some of the potentially more advanced components (primaries being aware of the state of replicas, coordinating nodes being aware of primaries, etc) are definitely a ways off and will need more exploration.

if a node with multiple replica is under duress and is breaching the replica write limits, the situation can get even worse with more retries coming in

Rejections happen prior to enqueuing. They will not add to existing indexing load.

getsaurabh02 commented 4 years ago

Hi @tbrooks8,

Based on our discussion here, I have taken a stab at extending the current indexing pressure logic, which accounts for the memory build-up at the node level, to have shard level granularity as well. This allows cluster to track the indexing pressure build up, per shard, for every node role (coordinator, primary and replica), and then take an isolated decision for rejection. Now, Rejections are much more optimal and fair, as now only the requests (transport), which are targeted for the problematic shards/node are rejected, while still guaranteeing the availability of other nodes and shards.

I have also run some benchmark tests to demonstrate the gains, and assert on the functional behaviour. Sharing a quick result as below:

Cluster Setup : 3 Node m5 cluster with 32 GB heap.
Total Shards : 2 primary shards with one replica each. Allocation : Both primaries on Node-1, while replica-0 on Node-2 and replica-1 on Node-3. Induced Failure : Induced delay between Node-1 and Node-2. This will drop indexing requests targeted for replica-0 on Node-2, from Node-1. This causes a build up on primary node and hence primary will start rejecting the requests.

Result Metrics

Scenario 1 : With Node level Indexing Pressure

image (1)

Scenario 2 : With Shard level Indexing Pressure

image

Key Inferences and Gains observed from these tests

1. Optimal Rejections with Shard level Indexing Pressure :

2. Fairness in Rejections with Shard level Indexing Pressure :

3. Granular Control and Visibility into the system

In addition to the above gains, I have also found that with the availability of granular information, it is possible to take smarter decisions on rejections, than depending on static threshold, such as of 10% of heap. Since static thresholds might not always fit all the use-cases, across varying instance types.

Thoughts from PoC done on Dynamic Rejection Threshold in conjunction to Static Thresholds

Do let me your thoughts on this. Since I have have the basic frame ready, and have experimented this out to some extent, I will like to take up and build the Shard Level Indexing Pressure piece as an extension to https://github.com/elastic/elasticsearch/pull/58885. Please let me know if I can come up with the task breakdown along with the low level plan for further discussions, which can then be followed by individual PRs.

getsaurabh02 commented 4 years ago

@tbrooks8 Can you please take a look and let me know your thoughts.

Tim-Brooks commented 4 years ago

Yes. Thank you for the ping. I will take a look tomorrow.

Tim-Brooks commented 4 years ago

Induced Failure : Induced delay between Node-1 and Node-2. This will drop indexing requests targeted for replica-0 on Node-2, from Node-1. This causes a build up on primary node and hence primary will start rejecting the requests.

Can you tell me a little more about the failure scenario here? How is your test introducing delay?

build the Shard Level Indexing Pressure

Can you describe high level specifics about what you are proposing? As I have mentioned in previous comments there are number of layers we intend to add. That does not mean we are ready to commit to shard level rejections without further exploration and discussions.

getsaurabh02 commented 4 years ago

Thank you @tbrooks8 for the reply.

Can you tell me a little more about the failure scenario here? How is your test introducing delay?

For the specific test here, I have run it by introducing network delays (via ip-table rule) between Node-1 (primary shard) and Node-2 (replica-shard). This is essentially to represent replica acting as a black hole. Other node level failure tests, with packet drop at Node-2 for all requests originating from Node-1, has shown very similar results.

Additionally, we have also performed tests to reproduce shard level issues, where one of the shards on node acts slow, by inducing artificial sleep, representing scenarios such as slow disk/partition. This has also helped demonstrate fairness in the rejection, happening only for problematic shard.

Can you describe high level specifics about what you are proposing?

Proposal is to track the indexing pressure build up, for every shard, at every node-role (i.e. coordinator, primary and replica). Under duress perform rejections at relevant node-role, but only for the specific shard, based on where the shard level quota is breached. Therefore guaranteeing fairness while performing the rejections on the nodes. Approach wise, this is a logical extension to the original frame you have added as part of Node level indexing pressure metrics. Granular calculation is achieved by plumbing the memory tracking logic into different code paths of the bulk execution flow, for every node-role, while maintaining dynamic structures per shard on the node. We have evaluated and performed benchmarks to ensure there are no regression or overhead out of the additional tracking being done here.

Also, I am proposing the shard level rejection limits to be dynamic, based on the latency and throughput degradation, as observed by the node (for every role). The node-level indexing pressure limits, can still act as the last line of defence, to prevent complete node failures in worst cases.

getsaurabh02 commented 3 years ago

Hi @tbrooks8 Can you please share your thoughts around the proposal here. Let me know if you would want to have more details around the implementation.

Tim-Brooks commented 3 years ago

I raised this and your benchmark scenario as a discussion topic for the team. I will respond with some feedback from those conversations in the next day or two.

Tim-Brooks commented 3 years ago

I had another team member take a look at your description and am still gathering feedback. Will get back to you soon.

Tim-Brooks commented 3 years ago

Hey, as I mentioned I raised this as a discussion topic with the team to see what others thought. We considered your scenario where replica retries can influence the total amount of coordinating / primary requests. After discussing we wanted to know a little more about this scenario before we decide how to proceed. Was your system configured with the TCP settings we suggest (https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config-tcpretries.html)? Generally we would view a network blackhole (a replica which hangs, but stays in the replication group) an issue beyond just back-pressure. That replica needs to fail and be removed. We encourage setting the TCP settings and expose retry timeouts to limit this scenario.

We are still considering whether a replica should lead to rejections at the primary / coordinating level. Is there a specific scenario that happens in production where you see this outside of a synthetic failure in a benchmark?

I'm not sure that we view shard level rejections as the next step as we believe the next step is a better queue sizing strategy (opposed to the current situation of 10K requests). So we are trying to get a handle on what scenario arise in the wild where this might happen. We did discuss if we should implement some type of limit to the number of replica retries, but decided to hold off until we had an indication this is a common problem in production.

As for your proposal, generally the problem is that there is not a clear proposal or defined changes here. The write pathway is under active development so it is difficult for me to encourage other changes (which requires committing to APIs, docs, etc) without more understanding of what exactly is being proposed.

getsaurabh02 commented 3 years ago

Thank you @tbrooks8 for the reply. I will get back to you with more details on the failure mode testing scenarios that we have considered, and will share more insights on the implementation strategy (proposal) in a day.

getsaurabh02 commented 3 years ago

Hi @tbrooks8 The scenario which we had demonstrated above was one of the few use cases which we had identified in the write path, getting benefitted with the Shard level Indexing Pressure. Broadly covering - Shard level failures & Node level failures.

Here, as part of Node level failures, we can think of Slow Node & Network Blackhole as distinct scenarios. While the former will not hit the TCP retransmission limit and eventually lead to build up on primary & coordinator nodes, the latter will hit the limit in few minutes, based on the configured value. Even in our case, as seen in the first graph above, we saw it in action in about 8 minutes, as the problematic replica shard failed on the original node, and got re-assigned to a different node. This allowed indexing to auto-resume without any manual mitigation (refer 16:18 time frame in graph 1). However, under such Blackhole scenarios we are still looking at the wins until the replica shard is failed. The TCP limit could need more thinking as making it aggressive such as 5 (~6 seconds) might have more implications, such as during longer GC pauses. Irrespective of the replica shard failure, once the slowness is observed, with shard level indexing pressure we are containing the situation and not pushing the other nodes (in the request path) to their limits. By identifying the degradation early, and doing active rejections in the problematic path, helps in keeping the memory profile of other nodes (coordinator & primary) low, preventing cluster from running into stress, and allowing productive work (with fairness).

Is there a specific scenario that happens in production where you see this outside of a synthetic failure in a benchmark

We have seen multiple issues in the indexing path arising due to Shard level and Node level failures. Primary contributors to such issue are due to Shard Level Issues - slow disk, stuck tasks, hot shards, traffic bursts, software bugs and Node Level Failures - slow hardware, network bottlenecks, asymmetric partitioning, etc. There are also unknowns which we keep discovering, and that adds to the need of having the necessary resiliency in the write path, to prevent the cluster when under duress.

Additionally, focus here is not just on the replica failure, since a similar slow-down can happen in the other paths as well, such as on the Primary Shard, resulting into a build-up on the Coordinator node. Taking an active and fair rejection decision on the coordinators, can prevent a large number fo nodes from getting overwhelmed, thereby reducing the cluster stress. Often we have seen coordinator node-drops due to OOM & long GC pauses, since the stuck requests cannot be garbage collected fast.

next step is a better queue sizing strategy (opposed to the current situation of 10K requests)

I agree and that has been one of the primary motivation of building the smarter mechanism for accounting the request build up on nodes with different roles. Overall an informed workload management. In order to keep the decisions more granular, shard level isolation of incoming requests makes perfect sense. In our proposal, I am suggesting to keep the node level indexing pressure limit (10% of heap) as the hard limit & last line of defence on the node. While for every request, shard-level tracking of count and memory occupancy to be performed for every node-role. This shard-level tracking is against the allocated shard-level quota on the node. Shard level quota for each shard starts from a small default, and increases based on several factors such as new incoming request for shard, current shard occupancy, shard throughput, and overall node level available room. More details on the shard level tracking and rejection strategy below.

what exactly is being proposed

We have taken due measures in ensuring the secondary parameters are adaptive in nature, to avoid any rejections arising due to transient issues. Let me know your thoughts on this. For further discussion and clarity, I can also raise a draft PR with the proposed changes. Hope this can help us take a decision if we can pursue this further.

getsaurabh02 commented 3 years ago

Hi @tbrooks8 Can you please share your thoughts around the detailed proposal here.

Tim-Brooks commented 3 years ago

Thanks for the ping. I raised this as a discussion topic with the team. Let me circle back and summarize the feedback and I will get back to you soon.