Distribute HTTP/2 calls across all API server instances

timebertt commented 1 year ago

How to categorize this issue?

/area control-plane networking scalability /kind enhancement

Context

Since GEP-08, traffic to Shoot API servers is passed on by the istio ingress gateway to the respective shoot API server by looking at the TLS SNI header (see control plane endpoints). In this architecture, TLS connections are terminated by individual API server instances. After establishing a TLS connection, all requests sent over this connection end up on the same API server instance. I.e., the istio ingress gateway performs L4 load balancing only but doesn't distribute individual requests across API server instances.

Usual HTTP/2 implementations (e.g., in Go) use a single TCP/TLS connection as long as MAX_CONCURRENT_STREAMS is not reached (this was basically the promise of HTTP/2 – to reuse a single L4 connection for many concurrent streams). With this, a typical controller based on client-go will send all of its API requests to a single API server instance. By deactivating HTTP/2 however, client-go will open a pool of TCP/TLS connections instead and distribute API requests across these L4 connections, which realizes a good distribution across API server instances because the istio ingress gateway balances the load on this layer.

Problem Statement

With the current architecture, we can't make use of the HTTP/2 protocol in shoot API requests. In fact, activating HTTP/2 can come with a performance and scalability penalty over HTTP/1.1. However, HTTP/2 is used by default in most clients like client-go. We can observe an unequal distribution of API requests across API servers, especially in clusters with "big" operators performing most of the API requests. This comes with the potential of overloading individual API server instances while other instances are idling.

As a consequence, the efficiency of autoscaling the API server vertically is reduced, because the resource footprint of instances can differ significantly. VPA works best with equally sized instances. Also, when API servers are terminated/rolled, individual TLS connections are destroyed. This leads to clients reconnecting to other instances and flooding one of them with requests (especially re-list and watch requests) instead of distributing requests across other healthy instances.

Ideas

This is not a fully-fledged proposal yet. The issue should serve as a starting point for discussing the problem and ideas. Based on the feedback, we could create a full proposal later on.

We could introduce L7 load balancing for shoot API servers, i.e., multiplexing HTTP/2 streams from a single TLS connection to multiple instances. For this, we would need to terminate TLS connections earlier in the network flow in a proxy before the API server. This proxy could either be the existing istio ingress gateway (global proxy) or an additional proxy per shoot control plane (local proxy). This proxy would open backend connections to all API server instances and multiplex incoming streams over these backend connections.

In addition to presenting the expected server certificate to the client, the proxy would also need to translate L4 authentication (TLS client cert) to L7 authentication information, i.e. put client certificate CN/O into the --requestheader-*-headers headers configured in the API server (similar to https://github.com/envoyproxy/envoy/issues/6601). Envoy already supports the XFCC header for this, but the API server doesn't understand the XFCC header (https://github.com/kubernetes/kubernetes/issues/78252). Probably, Envoy can still be configured to pass information in the format expected by the API server using a wasm plugin.

timebertt commented 1 year ago

Things that still need to be discussed:

pros and cons of using a global vs local proxy
- if a local proxy should be used, which implementation should be used (envoy, custom, other)
security implications
implications on other request paths, e.g. connections made via API server proxy, VPN connections

Please add your feedback to this issue :)

mwennrich commented 1 year ago

shouldn't the kube-apiserver --goaway-chance flag prevent this?

--goaway-chance float To prevent HTTP/2 clients from getting stuck on a single apiserver, randomly close a connection (GOAWAY). The client's other in-flight requests won't be affected, and the client will reconnect, likely landing on a different apiserver after going through the load balancer again. This argument sets the fraction of requests that will be sent a GOAWAY. Clusters with single apiservers, or which don't use a load balancer, should NOT enable this. Min is 0 (off), Max is .02 (1/50 requests); .001 (1/1000) is a recommended starting point.

timebertt commented 1 year ago

Interesting, I will try this out. So far, I assumed that sending a GOAWAY will cause the client to establish a new TLS connection and send all future requests to a new server (only in-fligth requests, i.e. long running requests like watches will stick). However, this won't distribute concurrent requests across API server instances, but rather randomly jump from one server to the next one in a regular fashion. I might be mistaken though.

gardener-ci-robot commented 9 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Mark this issue as rotten with /lifecycle rotten
Close this issue with /close

/lifecycle stale

gardener-ci-robot commented 8 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close

/lifecycle rotten

timebertt commented 8 months ago

/remove-lifecycle rotten

I still want to try out the mentioned API server flag.

gardener-ci-robot commented 5 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Mark this issue as rotten with /lifecycle rotten
Close this issue with /close

/lifecycle stale

gardener-ci-robot commented 4 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close

/lifecycle rotten

gardener-ci-robot commented 3 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten

/close

gardener-prow[bot] commented 3 months ago

@gardener-ci-robot: Closing this issue.

In response to [this](https://github.com/gardener/gardener/issues/8810#issuecomment-2305424531): >The Gardener project currently lacks enough active contributors to adequately respond to all issues. >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

voelzmo commented 1 month ago

/reopen

This is still a thing we haven't solved. HTTP/2 requests are unevenly distributed across KAPI instances, causing very uneven load and issues with vertical scaling that are hard to solve with different means

gardener-prow[bot] commented 1 month ago

@voelzmo: Reopened this issue.

In response to [this](https://github.com/gardener/gardener/issues/8810#issuecomment-2402151565): >/reopen > >This is still a thing we haven't solved. HTTP/2 requests are unevenly distributed across KAPI instances, causing very uneven load. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

voelzmo commented 1 month ago

@timebertt In case you're still thinking about the --goaway-chance flag, @hendrikKahl and I recently found an interesting blog containing some more implications: https://newsroom.aboutrobinhood.com/goaway-chance-chronicles-of-client-connection-lost/

and regarding

Interesting, I will try this out. So far, I assumed that sending a GOAWAY will cause the client to establish a new TLS connection and send all future requests to a new server (only in-fligth requests, i.e. long running requests like watches will stick). However, this won't distribute concurrent requests across API server instances, but rather randomly jump from one server to the next one in a regular fashion. I might be mistaken though.

I think this is exactly how it works. So as long as you have some number of clients >1 responsible for the uneven load, this flag might be able to help and re-distribute the clients across the KAPI instances. If the high load is caused by 1 client, this doesn't do much and just moves the client to another KAPI instance.

timebertt commented 1 month ago

/lifecycle frozen

gardener / gardener