Open dbarbosapn opened 1 week ago
FWIW, similar issues discussed in the past
https://github.com/envoyproxy/envoy/issues/23517 https://github.com/envoyproxy/envoy/issues/24960
I think the async mechanism can still have problems because there is a time gap between when Envoy detects the failure and the requests is being queued for failed node?
FWIW, similar issues discussed in the past
23517 #24960
I think the async mechanism can still have problems because there is a time gap between when Envoy detects the failure and the requests is being queued for failed node?
For sure it can! But it's optional and can be tuned by having a smaller refresh window. In production workloads, we rather have 1s longer on a failed shard rather than sacrificing commands targeting different shards which are completely healthy.
The whole idea of the proposal is to make it optional always 😄
cc @weisisea @mattklein123 as codeowners
The whole idea of the proposal is to make it optional always
I get that it is optional :-) . The other point is a slow shard can also block/slowdown requests from other faster shards. If there is a common solution for that also it would be good.
The other point is a slow shard can also block/slowdown requests from other faster shards. If there is a common solution for that also it would be good.
This is a great point. On single cluster we can't really speed that up since we need to return the commands in order anyway. But once we implement the failover policy, we could do a follow-up improvement with a request hedging mechanism. Where it would go cross-cluster for shards that are too slow (due to high load, for example), and return from whoever responds first.
A 2-step Plan for Improving Envoy Redis Proxy Reliability
Intro
Contributing to the Envoy Redis Proxy filter has helped me get a deeper understanding of how it works and especially how it behaves under certain failure modes. For now, we can assign this issue to myself while I slowly get to work on a PoC for this, but open for takers, as long as @mattklein123 approves the idea 😄
Step 1: Solving the Latency snowball ⛄
Overview 👁️
We're using the project internally in production and one of the most noticeable issues when sharing connections through a proxy rather than directly connecting to cluster nodes is the following:
Essentially what happens is that since Envoy acts as a single Redis node, and results have to come in order (see image below), requests to completely different shards from a failed one, have to wait for a result in order to return. And unfortunately, failed shards do not return immediately, and usually translate into higher latencies (1 to 2 seconds, from a quick experiment). This gets significantly worse if further requests to the failed shard get scheduled sequentially, causing a snowball effect ⛄
We see this happening in production. Below, is the proposal on how to improve this.
Proposal 👨💻
The proposal involves two changes.
Redis Cluster - On the redis cluster configuration, along with the refresh options, there would be a new option for
use_cluster_nodes (bool)
. This would be implemented in a way that during refreshes, instead of runningCLUSTER SLOTS
, Envoy would use insteadCLUSTER NODES
. This would still provide with slot information, but also would allow to store the fail state of each node. With this, the cluster would keep track of a list of failed slot ranges.Redis Filter - On the filter side, we would provide a new
skip_failed_nodes
option. This would add a check after running CRC16 where if the calculated slot is within a failed slot range, the request is immediately aborted. This is what would mitigate the latency snowball.Arguably, in my opinion, this should even be true by default, but that would be a breaking change for API shepherds to decide (cc @wbpcode 😄). Anyway, I'll do it optional for now of course 👍
The result ✅
Step 2: Failover policy ▶️
Overview 👁️
Suppose that step 1 is done. Now, we could go the extra mile and provide an alternative when a cluster node is in a failed state.
The Redis filter already provides a very useful Mirroring Policy which can be used not only to warm up caches but to create a backup cluster easily.
Proposal 👨💻
Envoy could provide a
failover_policy
option. This one would receive acluster
to use as failover. The idea is that the user could reuse an existing cluster that is provided on themirroring_policy
.What the algorithm would do is whenever a node is in failed state, it would instead call the provided cluster.
The result ✅