envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.02k stars 4.82k forks source link

[Proposal] A 2-step Plan for Improving Envoy Redis Proxy Reliability #36962

Open dbarbosapn opened 1 week ago

dbarbosapn commented 1 week ago

A 2-step Plan for Improving Envoy Redis Proxy Reliability

Intro

Contributing to the Envoy Redis Proxy filter has helped me get a deeper understanding of how it works and especially how it behaves under certain failure modes. For now, we can assign this issue to myself while I slowly get to work on a PoC for this, but open for takers, as long as @mattklein123 approves the idea 😄

Step 1: Solving the Latency snowball ⛄

Overview 👁️

We're using the project internally in production and one of the most noticeable issues when sharing connections through a proxy rather than directly connecting to cluster nodes is the following:

sequenceDiagram
    actor Client as Redis Client
    participant Envoy as Envoy proxy
    participant Node1 as Redis cluster node (slot 1)
    participant Node2 as Redis cluster failed node (slot 2)

    Client->>Envoy: Read request for slot 2
    Envoy->>Node2: Read request for slot 2
    Client->>Envoy: Read request for slot 1
    Envoy->>Node1: Read request for slot 1
    Node1->>Envoy: Result of read request for slot 1
    Note over Envoy: Envoy waits for slot 2
    Node2-->>Envoy: Failure for slot 2
    Envoy-->>Client: Result for slot 2
    Envoy-->>Client: Result for slot 1

Essentially what happens is that since Envoy acts as a single Redis node, and results have to come in order (see image below), requests to completely different shards from a failed one, have to wait for a result in order to return. And unfortunately, failed shards do not return immediately, and usually translate into higher latencies (1 to 2 seconds, from a quick experiment). This gets significantly worse if further requests to the failed shard get scheduled sequentially, causing a snowball effect ⛄

We see this happening in production. Below, is the proposal on how to improve this.

Proposal 👨‍💻

The proposal involves two changes.

Redis Cluster - On the redis cluster configuration, along with the refresh options, there would be a new option for use_cluster_nodes (bool). This would be implemented in a way that during refreshes, instead of running CLUSTER SLOTS, Envoy would use instead CLUSTER NODES. This would still provide with slot information, but also would allow to store the fail state of each node. With this, the cluster would keep track of a list of failed slot ranges.

Redis Filter - On the filter side, we would provide a new skip_failed_nodes option. This would add a check after running CRC16 where if the calculated slot is within a failed slot range, the request is immediately aborted. This is what would mitigate the latency snowball.

[!NOTE]
For optimal performance, we can store the failed ranges in a binary tree for O(log n) time complexity.

Arguably, in my opinion, this should even be true by default, but that would be a breaking change for API shepherds to decide (cc @wbpcode 😄). Anyway, I'll do it optional for now of course 👍

The result ✅

sequenceDiagram
  actor Client as Redis Client
  participant Filter as Envoy Filter
  participant Cluster as Envoy Cluster
  participant Node1 as Redis cluster node (slot 1)
  participant Node2 as Redis cluster failed node (slot 2)

    rect rgba(0, 0, 255, 0.2)
    Note over Cluster: Async Refresh
    Cluster ->> Node1: CLUSTER NODES
    Node1 -->> Cluster: Cluster information
  end
  Client ->> Filter: Read request for slot 2
  Client ->> Filter: Read request for slot 1
  Filter ->> Cluster: Is slot 2 failed?
  Cluster -->> Filter: Yes, slot 2 is failed
  Filter -->> Client: Failed response for slot 2
  Filter ->> Cluster: Is slot 1 failed?
  Cluster -->> Filter: No, slot 1 is not failed
  Filter ->> Cluster: Forward read request for slot 1
  Cluster ->> Node1: Read request for slot 1
  Node1 -->> Cluster: Result of read request for slot 1
  Cluster -->> Filter: Result for slot 1
  Filter -->> Client: Result for slot 1

Step 2: Failover policy ▶️

Overview 👁️

Suppose that step 1 is done. Now, we could go the extra mile and provide an alternative when a cluster node is in a failed state.

The Redis filter already provides a very useful Mirroring Policy which can be used not only to warm up caches but to create a backup cluster easily.

Proposal 👨‍💻

Envoy could provide a failover_policy option. This one would receive a cluster to use as failover. The idea is that the user could reuse an existing cluster that is provided on the mirroring_policy.

What the algorithm would do is whenever a node is in failed state, it would instead call the provided cluster.

The result ✅

sequenceDiagram
  actor Client as Redis Client
  participant Filter as Envoy Filter
  participant Cluster as Envoy Cluster
  participant Node1 as Redis cluster 1 (failed nodes)
  participant Node2 as Redis cluster 2

  %% Section for async refresh
    rect rgba(0, 0, 255, 0.2)
    Note over Cluster: Async Refresh
    Cluster->>Node1: CLUSTER NODES
    Node1-->>Cluster: Cluster information
    end

  Client ->> Filter: Read request
  Filter ->> Cluster: Is slot failed?
  Cluster -->> Filter: Yes, slot is failed
  Filter ->> Node2: Read request
  Node2 -->> Filter: Result of read request
  Filter -->> Client: Result
ramaraochavali commented 1 week ago

FWIW, similar issues discussed in the past

https://github.com/envoyproxy/envoy/issues/23517 https://github.com/envoyproxy/envoy/issues/24960

I think the async mechanism can still have problems because there is a time gap between when Envoy detects the failure and the requests is being queued for failed node?

dbarbosapn commented 1 week ago

FWIW, similar issues discussed in the past

23517 #24960

I think the async mechanism can still have problems because there is a time gap between when Envoy detects the failure and the requests is being queued for failed node?

For sure it can! But it's optional and can be tuned by having a smaller refresh window. In production workloads, we rather have 1s longer on a failed shard rather than sacrificing commands targeting different shards which are completely healthy.

The whole idea of the proposal is to make it optional always 😄

KBaichoo commented 1 week ago

cc @weisisea @mattklein123 as codeowners

ramaraochavali commented 6 days ago

The whole idea of the proposal is to make it optional always

I get that it is optional :-) . The other point is a slow shard can also block/slowdown requests from other faster shards. If there is a common solution for that also it would be good.

dbarbosapn commented 6 days ago

The other point is a slow shard can also block/slowdown requests from other faster shards. If there is a common solution for that also it would be good.

This is a great point. On single cluster we can't really speed that up since we need to return the commands in order anyway. But once we implement the failover policy, we could do a follow-up improvement with a request hedging mechanism. Where it would go cross-cluster for shards that are too slow (due to high load, for example), and return from whoever responds first.