envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.86k stars 4.78k forks source link

high-latency connections experience report #9112

Open bobby-stripe opened 4 years ago

bobby-stripe commented 4 years ago

Summary: Implementing HTTP/2 connection pooling and connection-level outlier detection has the opportunity to improve Envoy’s resiliency with high-latency links and when used with transparent L4 load balancers.

We’ve been running Envoy to connect parts of our infrastructure together, as Dylan talked about in his EnvoyCon talk. I wanted to share a bit of an experience report to highlight a few issues we’ve hit along the way. I realize this isn't a traditional issue report - let me know if there is a better venue or format that would help.

Load Balancers

We use L4 load balancers as the gateways for new connections between isolated network segments, as in the diagram below:

image

In this example, the downstream envoy in VPC 1 can reach several upstream hosts in VPC 2 only through an L4 load balancer. This load balancer’s IP address is the single upstream in host 1 envoy’s cluster.

The major consequence of this is that Envoy active health checks are of limited utility - they can detect if there is a cluster-wide issue, but will only otherwise reflect the health of a single host in that cluster (unless reuse_connection is explicitly set to false, in which case each health check will be serviced by a random host).

Few Upstreams

Related to the use of load balancers, we have cases where we have a limited number of upstreams in a cluster, often 1 or 2:

image

In a steady state we may split traffic between two upstream load balancers, but during a deploy (utilizing blue/green clusters) send all traffic through a single load balancer.

This has given us FUD around enabling outlier detection - if we’re routing all traffic to VPC 2 during a deploy to VPC 3 and a single host in VPC 2 (out of 12) is having an issue, we wouldn’t want the downstream host 1’s envoy to enter panic routing and split traffic between VPC 2 and VPC 3. Some of this may be inexperience with outlier detection (I would love to hear if there is a straightforward misunderstanding or way to work around this), but as outlier detection applies the results of its analysis of connections to upstreams, it feels like there is a fundamental mismatch here for us.

Another consequence that we have worked around here is that we end up with uneven request distribution across upstream hosts. We believe this is an interaction of having a small number of upstreams with the fact that by default new incoming connections to the downstream envoy are likely to end up on the same thread/event loop (as that thread is cache/scheduler hot and likely to win the accept(2) race). We are excited to test the connection balancer in Envoy 1.12. Our current workaround is listing upstreams multiple times in a cluster, so that each event loop will open multiple connections through the load balancer.

One thing to note is that we use mTLS with client and server certificates for all connections, which means that when connecting through an L4 LB, the established connection can be tied to a specific upstream identity. It would be possible for a downstream Envoy to learn the members of an upstream cluster behind a load balancer over time. (This feels like either a great idea or a terrible idea).

Connection-level problems

When envoys on either end of long-lived HTTP/2 connections are in different data centers with tens or hundreds of milliseconds of latency between them, we regularly observe connection-level problems:

image

We most often see network issues where a small percentage of TCP connections between hosts in different regions are affected. I believe most commonly a single direction of the TCP stream is affected - either packets containing requests have a hard time getting from host 1 to host 2, or packets containing responses have a hard time getting from host 2 to host 1.

While a network issue is happening, active healthchecks typically keep passing. We run with reuse_connection set to false for healthchecks, and new connections rarely experience a problem (potentially new connections are routed around whatever the underlying physical problem is).

Relatively quickly one side or the other of a connection will fill the kernel’s TCP write buffer, and envoy flow control will trip. I think Envoy will continue queueing new requests for upstreams that are paused.

Next Steps

A few things that have been discussed in the community already seem like they would significantly improve the situation here, so I want to highlight them here.

Connection-level outlier detection

Outlier detection should be able to mark a connection as unhealthy (in addition to an upstream host). I think the tricky part is figuring out how this interacts with existing upstream-level outlier detection. Maybe if the percentage of connections to an upstream triggering outlier detection passes a threshold, the upstream is marked as unhealthy. If that threshold is 0, you get the current behavior, but users could configure it higher for situations like ours.

HTTP/2 Connection pooling and per-connection concurrency limits

Ideally we would want to limit the blast radius of a bad HTTP/2 connection - one way to do this would be to have a concurrency limit on HTTP/2 connections. If we hit the limit (as an example: 500 outstanding requests/streams), we open a new connection. This is tracked in #7403.

snowp commented 4 years ago

Support for load balancing through an L4 is generally not great with Envoy, for many of the reasons you mentioned. Some thoughts:

I think a relatively easy feature to add to the outlier detection feature set is having the outlier detection not mark an upstream as unhealthy, but just close the connection. This would trigger a new connection to be created without marking the host as down.

We do support concurrency limits on the H/2 stream, see max_concurrent_streams on the Cluster. This does create a new connection when the concurrency level is hit, but because we don't pool HTTP/2 connections beyond that it ends up just redirecting the traffic to another upstream, not splitting. It also doesn't help if you're staying below the threshold, which can be a problem for long lived expensive streams. Having support for multiple HTTP/2 connections being managed at the same time would definitely help, and is tracked in the issue you linked.

An approach I've seen here when doing client side load balancing through an L4 LB is to periodically re-establish connections (say every 5 min) as a way to even out the connections. Events such as scaling up the number of upstream hosts tend to leave connections imbalanced until they are recycled in some way, and periodically recycling the connections achieves that even for connections with very low concurrency.

Re active health checking, depending on the L4 LB in use a plain TCP connect health check might work better: if the LB fails the TCP handshake when all backend hosts are down, it would be a reasonable way to detect that the LB itself is unable to service the check, not any of the backing hosts.

mattklein123 commented 4 years ago

Thanks for the awesome write-up @bobby-stripe!

I think a relatively easy feature to add to the outlier detection feature set is having the outlier detection not mark an upstream as unhealthy, but just close the connection. This would trigger a new connection to be created without marking the host as down.

+1. This might be a little tricky as outlier detection is disjoint from the connection pool, but I think we could figure out a way to do it without major surgery.

bobby-stripe commented 4 years ago

thanks @mattklein123 and @snowp!

An approach I've seen here when doing client side load balancing through an L4 LB is to periodically re-establish connections (say every 5 min) as a way to even out the connections.

I see under HTTPProtocolOptions there is a max_connection_duration that can be set on the listener (but not on an upstream) - we could give that a try. I think our FUD around this in the past has been the impact of having to establish a large number of TLS connections at once if a number of clients end up on a synchronized schedule (like when Envoys are HUPed in a deploy). Is there any current way to add or ensure jitter here?

Another thought is that we could use HTTP/2 PING frames as cheap connection-level healthchecks, or as another feature (stats about RTT) in outlier detection.