Open istvanballok opened 2 years ago
Initially we explored parsing the etcd debug logs and added up the response sizes of the range requests, broken down by the resource type. We used a heuristic to guess the resource type from the etcd object key string.
This approach requires a custom etcd image. In the meantime, new metrics have been added to the api server in Kubernetes 1.23.
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/CHANGELOG/CHANGELOG-1.23.md#feature-6 The kube-apiserver's Prometheus metrics have been extended with some that describe the costs of handling LIST requests. They are as follows. ... https://github.com/kubernetes/kubernetes/pull/104983
We shall explore those metrics and possibly extend them with a new metric that does not report the number of the fetched objects, but rather the total size of the fetched objects. That should be helpful to reason about the network bandwidth usage between the api server and etcd.
We could get the response size from the size of the data variable, around here: https://github.com/kubernetes/kubernetes/blob/5b489e2846a7fb959252dc5a04fe21ec844e9fad/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L773-L778.
What would you like to be added
Provide metrics for end users so that they can check which api server requests contribute to the network bandwidth usage between the api server and etcd.
Why is this needed
The api server uses in-memory filtering for list requests with label selectors, so that client requests that seem to be reasonable and have a small response size can still incur a high bandwidth usage in the "backend": between the api server and etcd. (See https://github.com/gardener/gardener/issues/5374)
We have seen that when the network link between the api server and etcd is saturated, multiple components start to fail.
The goal of this issue is to provide metrics for shoot owners so that they can identify the clients that contribute to the excessive network usage and can optimize their requests accordingly.