Creating effective back-pressure in ES Write Path

getsaurabh02 commented 4 years ago

Related to https://github.com/elastic/elasticsearch/pull/58885/ where we are adding rejections when the indexing memory limits are exceeded for primary or coordinating operations. The ongoing PR also plans to add a safeguard memory limit for replica operations, which is currently proposed to default at 1.5 times the indices.write.limit.

Few challenges which might come in with the proposed approach in the PR https://github.com/elastic/elasticsearch/pull/58885/

Instant replica shard failures on the node: The lower limit of replica write queue memory (at 45% compared to 95% of RM CB) can result into immediate multiple shard failures on the node, if there is build up on node, even due to transient issue. This could adversely affect the cluster states and can result into repeated churns. Transient issue could be sudden traffic spike, slow disk, blocked IO etc.
Does not guarantee shard level isolation: All the replica shards on the node could get affected, resulting into failures, even if there is only one shard which contributed heavily to the build up, causing the limit breach. Same is true for primary shards rejecting the write requests on the node if the primary limit is breached.
Tricky to define/tune the exclusive limits: Hard to tune separate limit for replica based on memory (such as 1.5 times or primary in this case) for different workloads, even though this is related to the total available heap on the node. Keeping replica proportionate to pending primary shard operations might help here, as primary already has the limit now.
Possible approach at a high level to address these**
Memory/Request accounting of queued primary request at every primary shard level.
Primary shards also to keep the state (count/memory) for the outstanding replication requests.
Reject the request at Primary Shard, if the limit for primary shard is breached.
Reject the requests at Primary shard itself, without failing the replica shards immediately, if outstanding replication request limit for replica shard is breached. This to cater any transient build up.
Once rejection at primary due to replica build up breaches the time threshold, or total request build up breaches the accounting threshold (proportionate to primary limit), fail the replica shard, to continue useful work.

These are high level thoughts with primary motivation to not fail the replica shard instantly, while also providing the shard level isolation for both primary and replica shards. Since failing the replica shard is more intrusive, we would want to do it only in case there is some real issue. It will also allow us to distinguish between the transient vs steady state failures with more confidence. I have some thoughts around on how these accounting limits and thresholds can be defined, and can share more details on it. Let me know your thoughts.

getsaurabh02 commented 4 years ago

@tbrooks8 As you are actively working on https://github.com/elastic/elasticsearch/pull/58885/. Will be happy to get your thoughts on this. Thanks.

elasticmachine commented 4 years ago

Pinging @elastic/es-distributed (:Distributed/CRUD)

Tim-Brooks commented 4 years ago

Currently in Elasticsearch the primary mechanism of indexing back pressure is the write queue size. This does not relate effectively to the amount of work outstanding as the bulk shard requests can vary in size significantly inside the queue. Additionally, once the act of "indexing" is complete, the work moves to async coordination work. This means that the request no longer contributes to back pressure, even if there is still a "cost" associated to the work.

The PR that you linked is the first step towards addressing this issue. This PR accounts for the actual size of the requests. And the cost is accounted for until the request has completed. So the coordinating cost will be accounted until the coordinating, primary, and replica components are all complete. The primary cost will be accounted until the primary and replica components are all complete.

As you have noted, the replica limit is greater than the primary limit to prioritize replica work. If this replica work is significant it will naturally push back against new primary and coordinating work on a node.

You raise a concern about replica failures. The linked PR will combine with the work in #55633 to retry failed replication operations. Since the retry is still occurring the primary and coordinating bytes will still be marked, pushing back against a complete overload of the system.

In a few of your bullets you make reference to the primary knowing the state of the replica node and not sending unnecessary requests if it is aware that the replica is already overloaded. There is some potential follow-up work being discussed around this area (same thing with at the coordinating level, maintaining knowledge about the state of the primary). It just does not make sense as a first step as it is tricky (node roles can change, outstanding work could be generated from other nodes).

In terms of shard level quotas, that's not something we're looking at for this first step.

getsaurabh02 commented 4 years ago

Thank you @tbrooks8 for detailed explanation.

My concern with separate memory limit accounting for primary and replica shards : If I have understood the linked PR changes correctly, primary write operation on a node checks for total write limit (primary + replica), while replica operation considers only replica write limit (1.5 times). For nodes with varying shard allocation, this might end up enforcing different effective memory limits for write operations. Such as one node with 2 primaries and another node with 1 primary and 1 replica can have different effective write workload limits, before they actually start any rejection. Also in the latter case, primary writes on the node (with relatively less traffic) could get impacted due the memory spike caused by replica shard degradation on the same node, thereby taking away the fairness.

One way to address this with the approach I initially suggested is by having a dedicated write memory limit at primary level, to address the memory build up for all pending & outstanding requests, which also provides the primary to coordinator back-pressure. While primary to replica interactions are controlled by primaries. This involves primary being aware of the replica states, and not sending more requests, while rejecting early, if replica shows a signs of overload (such as active throughput degradation). I get the the complexity around this, especially with different allocation schemes, however to begin with this can be achieved by primary shards accounting the outstanding replication requests and active replication throughput. Memory limits such as RM CB etc could still be the last line of defence for replica nodes (preferably with higher thresholds).

Thanks for sharing the insight on the PR https://github.com/elastic/elasticsearch/pull/55633 related to replication retry strategy. While I understand this will be useful for addressing transient failures such as network blips, I am not really sure if retry will be highly effective in situations with memory limits breach under heavy workload. In non happy scenarios, if a node with multiple replica is under duress and is breaching the replica write limits, the situation can get even worse with more retries coming in, and primaries might end up failing the shards eventually. Again the fairness issue arising here as well. With the above approach, does primaries cutting off the traffic instead, allowing a cool off, and letting replica shards first process the pending requests, makes more sense here? This might provide effective load shedding, without impacting the stable cluster state, in longer traffic bursts scenarios.

If you think some of these thoughts might be useful, and are worth experimenting, I can share more insights, by running few benchmarks to verify the behaviour in both the scenarios.

Also, some of the fairness problems discussed above can be effectively addressed by bringing in shard level isolation. Since I already have some context to this, and as you are not planning yet, I would like to take that up. Please let me know if I can work on it. I have some ground work done around the approach.

Tim-Brooks commented 4 years ago

@getsaurabh02 If there are specific benchmarks that you think would provide useful information to this work feel free to share.

As I said this work in in-progress. Some of the potentially more advanced components (primaries being aware of the state of replicas, coordinating nodes being aware of primaries, etc) are definitely a ways off and will need more exploration.

if a node with multiple replica is under duress and is breaching the replica write limits, the situation can get even worse with more retries coming in

Rejections happen prior to enqueuing. They will not add to existing indexing load.

getsaurabh02 commented 4 years ago

Hi @tbrooks8,

Based on our discussion here, I have taken a stab at extending the current indexing pressure logic, which accounts for the memory build-up at the node level, to have shard level granularity as well. This allows cluster to track the indexing pressure build up, per shard, for every node role (coordinator, primary and replica), and then take an isolated decision for rejection. Now, Rejections are much more optimal and fair, as now only the requests (transport), which are targeted for the problematic shards/node are rejected, while still guaranteeing the availability of other nodes and shards.

I have also run some benchmark tests to demonstrate the gains, and assert on the functional behaviour. Sharing a quick result as below:

Cluster Setup : 3 Node m5 cluster with 32 GB heap.
Total Shards : 2 primary shards with one replica each. Allocation : Both primaries on Node-1, while replica-0 on Node-2 and replica-1 on Node-3. Induced Failure : Induced delay between Node-1 and Node-2. This will drop indexing requests targeted for replica-0 on Node-2, from Node-1. This causes a build up on primary node and hence primary will start rejecting the requests.

Result Metrics

Scenario 1 : With Node level Indexing Pressure

image (1)

Scenario 2 : With Shard level Indexing Pressure

Key Inferences and Gains observed from these tests

1. Optimal Rejections with Shard level Indexing Pressure :

The total number of Primary Rejections in Scenario-1 are 6 times the Scenario-2. This is because with node level indexing pressure, the request targeted for Node-3 (replica-1) were also discarded, and the total indexing rate dropped to zero. However with Scenario-2, the build up had an insolation, and only requests targeted for Node-2 (replica-0) were rejected. This allowed the cluster to continue to do the useful work, it was actually capable of doing.

The total number of Coordinator Rejections were 30 times less compared with shard level isolation again for the same reason. Thereby improved cluster availability.

2. Fairness in Rejections with Shard level Indexing Pressure :

Fairness is key with the granular tracking. Especially for large cluster and workload, having multiple indices and shards being operated simultaneously, having shard level accounting brings in the notion of fairness, against transient and partial failures.

3. Granular Control and Visibility into the system

Currently I am able to track and publish the memory build up metric (current and total) per shard, for every Node role. Such as for 'Shard-0' Coordinations node metrics on all node which can receive the request, Primary node metric on the node which holds the Primary copy of 'Shard-0' and Replica node metrics where replica copies for 'shard-0' are present.
These metric provide granular view of the system, and also the breakup of memory/rejections per shard. This allows better view of the workload, and understanding on how the request is behaving, from each node-role perspective. This allow an early indication of issues due to stuck states, hardware failure, hot shards etc.

In addition to the above gains, I have also found that with the availability of granular information, it is possible to take smarter decisions on rejections, than depending on static threshold, such as of 10% of heap. Since static thresholds might not always fit all the use-cases, across varying instance types.

Thoughts from PoC done on Dynamic Rejection Threshold in conjunction to Static Thresholds

With shard level accounting, I have been able to capture few advance metrics, such as mean turnaround time & throughput per shard, per node role. These can be modelled to be adaptive in nature.
Any sudden degradation in the mean latency or throughput can be detected at the shard level, for every role ie. coordinator, primary and replica.
These anomaly once detected at shard level can acts a secondary parameter. This when triggered in conjunction to soft-threshold limit breached (fixed), results in an actual rejection.
Node level limits can still act as the next line of defence, on top of shard level rejections, if required.

Do let me your thoughts on this. Since I have have the basic frame ready, and have experimented this out to some extent, I will like to take up and build the Shard Level Indexing Pressure piece as an extension to https://github.com/elastic/elasticsearch/pull/58885. Please let me know if I can come up with the task breakdown along with the low level plan for further discussions, which can then be followed by individual PRs.

getsaurabh02 commented 4 years ago

@tbrooks8 Can you please take a look and let me know your thoughts.

Tim-Brooks commented 4 years ago

Yes. Thank you for the ping. I will take a look tomorrow.

Tim-Brooks commented 4 years ago

Induced Failure : Induced delay between Node-1 and Node-2. This will drop indexing requests targeted for replica-0 on Node-2, from Node-1. This causes a build up on primary node and hence primary will start rejecting the requests.

Can you tell me a little more about the failure scenario here? How is your test introducing delay?

build the Shard Level Indexing Pressure

Can you describe high level specifics about what you are proposing? As I have mentioned in previous comments there are number of layers we intend to add. That does not mean we are ready to commit to shard level rejections without further exploration and discussions.

getsaurabh02 commented 4 years ago

Thank you @tbrooks8 for the reply.

Can you tell me a little more about the failure scenario here? How is your test introducing delay?

For the specific test here, I have run it by introducing network delays (via ip-table rule) between Node-1 (primary shard) and Node-2 (replica-shard). This is essentially to represent replica acting as a black hole. Other node level failure tests, with packet drop at Node-2 for all requests originating from Node-1, has shown very similar results.

Additionally, we have also performed tests to reproduce shard level issues, where one of the shards on node acts slow, by inducing artificial sleep, representing scenarios such as slow disk/partition. This has also helped demonstrate fairness in the rejection, happening only for problematic shard.

Can you describe high level specifics about what you are proposing?

Proposal is to track the indexing pressure build up, for every shard, at every node-role (i.e. coordinator, primary and replica). Under duress perform rejections at relevant node-role, but only for the specific shard, based on where the shard level quota is breached. Therefore guaranteeing fairness while performing the rejections on the nodes. Approach wise, this is a logical extension to the original frame you have added as part of Node level indexing pressure metrics. Granular calculation is achieved by plumbing the memory tracking logic into different code paths of the bulk execution flow, for every node-role, while maintaining dynamic structures per shard on the node. We have evaluated and performed benchmarks to ensure there are no regression or overhead out of the additional tracking being done here.

Also, I am proposing the shard level rejection limits to be dynamic, based on the latency and throughput degradation, as observed by the node (for every role). The node-level indexing pressure limits, can still act as the last line of defence, to prevent complete node failures in worst cases.

getsaurabh02 commented 3 years ago

Hi @tbrooks8 Can you please share your thoughts around the proposal here. Let me know if you would want to have more details around the implementation.

Tim-Brooks commented 3 years ago

I raised this and your benchmark scenario as a discussion topic for the team. I will respond with some feedback from those conversations in the next day or two.

Tim-Brooks commented 3 years ago

I had another team member take a look at your description and am still gathering feedback. Will get back to you soon.

Tim-Brooks commented 3 years ago

Hey, as I mentioned I raised this as a discussion topic with the team to see what others thought. We considered your scenario where replica retries can influence the total amount of coordinating / primary requests. After discussing we wanted to know a little more about this scenario before we decide how to proceed. Was your system configured with the TCP settings we suggest (https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config-tcpretries.html)? Generally we would view a network blackhole (a replica which hangs, but stays in the replication group) an issue beyond just back-pressure. That replica needs to fail and be removed. We encourage setting the TCP settings and expose retry timeouts to limit this scenario.

We are still considering whether a replica should lead to rejections at the primary / coordinating level. Is there a specific scenario that happens in production where you see this outside of a synthetic failure in a benchmark?

I'm not sure that we view shard level rejections as the next step as we believe the next step is a better queue sizing strategy (opposed to the current situation of 10K requests). So we are trying to get a handle on what scenario arise in the wild where this might happen. We did discuss if we should implement some type of limit to the number of replica retries, but decided to hold off until we had an indication this is a common problem in production.

As for your proposal, generally the problem is that there is not a clear proposal or defined changes here. The write pathway is under active development so it is difficult for me to encourage other changes (which requires committing to APIs, docs, etc) without more understanding of what exactly is being proposed.

getsaurabh02 commented 3 years ago

Thank you @tbrooks8 for the reply. I will get back to you with more details on the failure mode testing scenarios that we have considered, and will share more insights on the implementation strategy (proposal) in a day.

getsaurabh02 commented 3 years ago

Hi @tbrooks8 The scenario which we had demonstrated above was one of the few use cases which we had identified in the write path, getting benefitted with the Shard level Indexing Pressure. Broadly covering - Shard level failures & Node level failures.

Here, as part of Node level failures, we can think of Slow Node & Network Blackhole as distinct scenarios. While the former will not hit the TCP retransmission limit and eventually lead to build up on primary & coordinator nodes, the latter will hit the limit in few minutes, based on the configured value. Even in our case, as seen in the first graph above, we saw it in action in about 8 minutes, as the problematic replica shard failed on the original node, and got re-assigned to a different node. This allowed indexing to auto-resume without any manual mitigation (refer 16:18 time frame in graph 1). However, under such Blackhole scenarios we are still looking at the wins until the replica shard is failed. The TCP limit could need more thinking as making it aggressive such as 5 (~6 seconds) might have more implications, such as during longer GC pauses. Irrespective of the replica shard failure, once the slowness is observed, with shard level indexing pressure we are containing the situation and not pushing the other nodes (in the request path) to their limits. By identifying the degradation early, and doing active rejections in the problematic path, helps in keeping the memory profile of other nodes (coordinator & primary) low, preventing cluster from running into stress, and allowing productive work (with fairness).

Is there a specific scenario that happens in production where you see this outside of a synthetic failure in a benchmark

We have seen multiple issues in the indexing path arising due to Shard level and Node level failures. Primary contributors to such issue are due to Shard Level Issues - slow disk, stuck tasks, hot shards, traffic bursts, software bugs and Node Level Failures - slow hardware, network bottlenecks, asymmetric partitioning, etc. There are also unknowns which we keep discovering, and that adds to the need of having the necessary resiliency in the write path, to prevent the cluster when under duress.

Additionally, focus here is not just on the replica failure, since a similar slow-down can happen in the other paths as well, such as on the Primary Shard, resulting into a build-up on the Coordinator node. Taking an active and fair rejection decision on the coordinators, can prevent a large number fo nodes from getting overwhelmed, thereby reducing the cluster stress. Often we have seen coordinator node-drops due to OOM & long GC pauses, since the stuck requests cannot be garbage collected fast.

next step is a better queue sizing strategy (opposed to the current situation of 10K requests)

I agree and that has been one of the primary motivation of building the smarter mechanism for accounting the request build up on nodes with different roles. Overall an informed workload management. In order to keep the decisions more granular, shard level isolation of incoming requests makes perfect sense. In our proposal, I am suggesting to keep the node level indexing pressure limit (10% of heap) as the hard limit & last line of defence on the node. While for every request, shard-level tracking of count and memory occupancy to be performed for every node-role. This shard-level tracking is against the allocated shard-level quota on the node. Shard level quota for each shard starts from a small default, and increases based on several factors such as new incoming request for shard, current shard occupancy, shard throughput, and overall node level available room. More details on the shard level tracking and rejection strategy below.

what exactly is being proposed

Shard-Level Tracking (Primary Parameters) : Add shard level accounting on the node (memory and count) for every node role. For each index request targeted for a particular shard (index + shard-id), based upon the node's role in the write path, tracking is done. Similar to current Node level tracking, it is achieved by plumbing the tracking logic in the TransportActions such as TransportBulkAction for Coordinator Role, TransportWriteAction for Primary & Replica Operation. The tracking is done at the beginning of operation with Action Listeners supplied to mark the completion.
Shard-Level tracking view across cluster : With the above logic, for a Shard-0, coordinator tracking will be done on every node which receives a new index request for Shard-0. Primary View tracking will be done on the Node with primary Shard-0, while Replica View tracking will be done on the Node with replica Shard-0.
Secondary Parameters being tracked : Per shard tracking object on the node, along with the inflight request count and memory contribution, also tracks the shard throughput & last success time-stamp for the shard. This gives a clear breakup of a request, from the view of how coordinator, primary and replica are performing.
Shard Level Quota : From the Node level limit (10% of the heap size), shard level quota are carved and updated dynamically based on the current occupancy & throughput of the shard on the node. It also acknowledges the fact that there could be more than one hot shards on the node at any given point in time.
Dynamic update of shard level quota : Starts from a base value which is small (few MBs). Once the inflight request reaches 80% of the shard occupancy, the shard quota is evaluated for an increase. It is successfully increased to a higher value based on the performance of the shard's secondary parameters and overall available room at the node level. Similarly quota is adjusted as the traffic subsides (below 60%).
Shard State Transition - Shard can be assumed transitioning between the following state:
- Green : While shard quota utilisation is under 80%, and no increase is required.
- Yellow : When shard utilisation breaches 80% of the allocated quota, and secondary parameter needs to be evaluated to offer an increase.
- Red : When shard utilisation breaches 100% of the allocated quota, but no further increase can be offered. Rejection of additional request is bound to happen.
Rejection Strategy - Once the Primary Parameter is breached i.e. upon reaching 80% of the allocated shard quota shard transits to a Yellow state. Now, secondary Parameters are evaluated to verify if quota can be increased further to move it back to a Green state. This verifies any throughput degradation or stuck requests for the shard. If none are found and neither the Node level limit is reached, shard-level quota is increased. However, if secondary parameter signals a throughput degradation, shard is denied of any further increase in its quota, and thus moves to a Red state leading to rejections of additional requests.

We have taken due measures in ensuring the secondary parameters are adaptive in nature, to avoid any rejections arising due to transient issues. Let me know your thoughts on this. For further discussion and clarity, I can also raise a draft PR with the proposed changes. Hope this can help us take a decision if we can pursue this further.

getsaurabh02 commented 3 years ago

Hi @tbrooks8 Can you please share your thoughts around the detailed proposal here.

Tim-Brooks commented 3 years ago

Thanks for the ping. I raised this as a discussion topic with the team. Let me circle back and summarize the feedback and I will get back to you soon.

elastic / elasticsearch

Creating effective back-pressure in ES Write Path #59116