Open timebertt opened 1 year ago
Things that still need to be discussed:
Please add your feedback to this issue :)
shouldn't the kube-apiserver --goaway-chance
flag prevent this?
--goaway-chance float To prevent HTTP/2 clients from getting stuck on a single apiserver, randomly close a connection (GOAWAY). The client's other in-flight requests won't be affected, and the client will reconnect, likely landing on a different apiserver after going through the load balancer again. This argument sets the fraction of requests that will be sent a GOAWAY. Clusters with single apiservers, or which don't use a load balancer, should NOT enable this. Min is 0 (off), Max is .02 (1/50 requests); .001 (1/1000) is a recommended starting point.
Interesting, I will try this out.
So far, I assumed that sending a GOAWAY
will cause the client to establish a new TLS connection and send all future requests to a new server (only in-fligth requests, i.e. long running requests like watches will stick). However, this won't distribute concurrent requests across API server instances, but rather randomly jump from one server to the next one in a regular fashion.
I might be mistaken though.
The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
/lifecycle stale
The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle rotten
/remove-lifecycle rotten
I still want to try out the mentioned API server flag.
The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
/lifecycle stale
The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle rotten
The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
/close
@gardener-ci-robot: Closing this issue.
/reopen
This is still a thing we haven't solved. HTTP/2 requests are unevenly distributed across KAPI instances, causing very uneven load and issues with vertical scaling that are hard to solve with different means
@voelzmo: Reopened this issue.
@timebertt In case you're still thinking about the --goaway-chance
flag, @hendrikKahl and I recently found an interesting blog containing some more implications: https://newsroom.aboutrobinhood.com/goaway-chance-chronicles-of-client-connection-lost/
and regarding
Interesting, I will try this out. So far, I assumed that sending a
GOAWAY
will cause the client to establish a new TLS connection and send all future requests to a new server (only in-fligth requests, i.e. long running requests like watches will stick). However, this won't distribute concurrent requests across API server instances, but rather randomly jump from one server to the next one in a regular fashion. I might be mistaken though.
I think this is exactly how it works. So as long as you have some number of clients >1 responsible for the uneven load, this flag might be able to help and re-distribute the clients across the KAPI instances. If the high load is caused by 1 client, this doesn't do much and just moves the client to another KAPI instance.
/lifecycle frozen
How to categorize this issue?
/area control-plane networking scalability /kind enhancement
Context
Since GEP-08, traffic to Shoot API servers is passed on by the istio ingress gateway to the respective shoot API server by looking at the TLS SNI header (see control plane endpoints). In this architecture, TLS connections are terminated by individual API server instances. After establishing a TLS connection, all requests sent over this connection end up on the same API server instance. I.e., the istio ingress gateway performs L4 load balancing only but doesn't distribute individual requests across API server instances.
Usual HTTP/2 implementations (e.g., in Go) use a single TCP/TLS connection as long as
MAX_CONCURRENT_STREAMS
is not reached (this was basically the promise of HTTP/2 – to reuse a single L4 connection for many concurrent streams). With this, a typical controller based on client-go will send all of its API requests to a single API server instance. By deactivating HTTP/2 however, client-go will open a pool of TCP/TLS connections instead and distribute API requests across these L4 connections, which realizes a good distribution across API server instances because the istio ingress gateway balances the load on this layer.Problem Statement
With the current architecture, we can't make use of the HTTP/2 protocol in shoot API requests. In fact, activating HTTP/2 can come with a performance and scalability penalty over HTTP/1.1. However, HTTP/2 is used by default in most clients like client-go. We can observe an unequal distribution of API requests across API servers, especially in clusters with "big" operators performing most of the API requests. This comes with the potential of overloading individual API server instances while other instances are idling.
As a consequence, the efficiency of autoscaling the API server vertically is reduced, because the resource footprint of instances can differ significantly. VPA works best with equally sized instances. Also, when API servers are terminated/rolled, individual TLS connections are destroyed. This leads to clients reconnecting to other instances and flooding one of them with requests (especially re-list and watch requests) instead of distributing requests across other healthy instances.
Ideas
This is not a fully-fledged proposal yet. The issue should serve as a starting point for discussing the problem and ideas. Based on the feedback, we could create a full proposal later on.
We could introduce L7 load balancing for shoot API servers, i.e., multiplexing HTTP/2 streams from a single TLS connection to multiple instances. For this, we would need to terminate TLS connections earlier in the network flow in a proxy before the API server. This proxy could either be the existing istio ingress gateway (global proxy) or an additional proxy per shoot control plane (local proxy). This proxy would open backend connections to all API server instances and multiplex incoming streams over these backend connections.
In addition to presenting the expected server certificate to the client, the proxy would also need to translate L4 authentication (TLS client cert) to L7 authentication information, i.e. put client certificate CN/O into the
--requestheader-*-headers
headers configured in the API server (similar to https://github.com/envoyproxy/envoy/issues/6601). Envoy already supports the XFCC header for this, but the API server doesn't understand the XFCC header (https://github.com/kubernetes/kubernetes/issues/78252). Probably, Envoy can still be configured to pass information in the format expected by the API server using a wasm plugin.