Investigate why installing 700+ CRDs causing degradation of performance in apiserver

muvaf commented 2 years ago

What problem are you facing?

Today if you run kubectl apply -f package/crds in provider-tf-aws, your cluster gets really slow. In GKE, kubectl command just stops after like 50 CRDs.

How could Terrajet help solve your problem?

We have some ideas around sharding the controllers and API types, allowing customers to install only a set of them. But we haven't identified the actual problem. So, we need to make sure we know the root cause of the problem before choosing a solution so that we have that problem in mind for future designs.

chlunde commented 2 years ago

I suspect kubectl behaves slowly not because the control plane responds slowly, but because it self-throttles with a hardcoded QPS, and downloads CRDs frequently (but not for every command, there is a cache).

Example output w/ provider-aws, flux, crossplane and a few other types, 191 CRDs in total:

... $ kubectl get managed
I0909 20:16:19.782205  893807 request.go:665] Waited for 1.128844238s due to client-side throttling, not priority and fairness, request: GET:https://....../apis/helm.toolkit.fluxcd.io/v2beta1?timeout=32s

References for throttling: https://github.com/kubernetes/client-go/blob/master/rest/request.go#L584-L593 https://github.com/kubernetes/kubectl/issues/773

Background about kubectl discovery cache: https://www.reddit.com/r/kubernetes/comments/bpfi48/why_does_kubectl_get_abc_take_10x_as_long_to/enuhn5v/?utm_source=reddit&utm_medium=web2x&context=3 "Discovery burst" of 100 reqs: https://github.com/kubernetes/cli-runtime/blob/233e5fcb7112e0ab966b6a1e941a152416372ba4/pkg/genericclioptions/config_flags.go#L371

Please also note that kubectl on aws w/ aws eks update-kubeconfig is slow , (hack/workaround), this is also unrelated but makes the EKS control plane feel sluggish.

muvaf commented 2 years ago

@chlunde but the slowness in apiserver is experienced after kubectl apply operation is completed as well.

ulucinar commented 2 years ago

I did some experiments using provider-tf-aws & provider-tf-azure with the full set of resources generated for both. provider-tf-aws has 765 CRDs and provider-tf-azure has 658 CRDs if full set of supported resources are generated (we skip generating certain resources).

Experiment Setup # 1:

The experiments have been performed on a darwin_arm64 machine with 8 CPU cores. A control plane consisting of etcd, kube-apiserver and kube-controller-manager with native (darwin_arm64) binaries was used for the experiments. Debug-level logs for the etcd were enabled using the --log-level=debug command-line option. And for the API server a log-level verbosity of 7 was used with the --v=7 command-line option. CPU utilization and physical memory consumption metrics were collected for the control plane components. And CPU profiling has been performed for kube-apiserver during the experiments. Please note that verbose logging enabled for the etcd and kube-apiserver will contribute to higher CPU utilization, which I have not quantified. However, during the experiments I have collected CPU utilization and memory metrics before the CRDs are registered to establish a baseline and to observe the impacts of registering large numbers of CRDs on the control plane components. Please also note that the providers are not running in these sets of experiments as we first want to focus on the impact of registering large numbers of CRDs and want to avoid the cost of watches from the providers in these sets of experiments. After a warm-up period of 3m. to establish a baseline, all the ~765 CRDs from provider-tf-aws are registered. Kubernetes version is v1.21.0.

Registering 765 CRDs of provider-tf-aws

CPU profiling data collected from kube-apiserver during these experiments reveal that while we are trying to register 100s of CRDs, considerable CPU time (for instance, in the data provided which covers a period of 1 hour, 42.74% of CPU time) is spent while updating the OpenAPI spec served at the /openapi/v2 endpoint due to expensive Swagger JSON marshal (accounts for a total of the 17.74% of CPU time spent), proto binary serialization (accounts for a total of 17.88%) and JSON unmarshaling by the json-iterator library (frozenConfig.Unmarshal, which accounts for a total of 18.16%). Please also note that we cause a high heap churn as expected, runtime.memclrNoHeapPointers accounts for a total of 39.34% of CPU time sampled.

The following figure shows CPU utilization for etcd, kube-apiserver and kube-controller-manager. The kubectl command that's run for registering the 765 CRDs completes in ~1 min. However, kube-apiserver exhibits a high CPU utilization as it publishes the OpenAPI spec for ~20 min. After this period of high CPU load, kube-apiserver has an increased baseline CPU utilization as expected because there are watches in place for the registered CRDs and there are other periodic tasks being run in the background. cpu-time-preazure

State of the Art for the Established Kubernetes Scalability Thresholds:

Unfortunately, the Kubernetes Scalability thresholds file from sig-scalability group does not consider CRDs per cluster as a dimension, although it establishes the base guidelines for understanding issues around Kubernetes scalability. However, here sig-api-machinery group suggests 500 as a maximum limit for the scalability target on the # of CRDs per cluster. They also note in the same document that this suggested limit of 500 CRDs is not due to API call latency SLOs but because the background OpenAPI publishing is very expensive as we also observed in our experiments.

Summary

Further tests are needed to measure API latency but I do not expect the high number of registered CRDs would by itself increase latency causing violations of Kubernetes API call latency SLOs, excluding high saturation cases (i.e., kube-apiserver or some other control plane component starve for CPU). An interesting question is whether we currently violate latency SLOs with the Terraform providers during synchronous operations like observing remote infrastructure via Terraform CLI. We had better keep an eye on that. Selectively registering CRDs and selectively starting related controllers is certainly a more scalable approach. We should also keep an eye on the Kubernetes scalability thresholds document and probably honor the suggested maximum limit of 500 CRDs as a scale target in all cases (or any scalability threshold that will be established in the future). In the next experiment, I'd like to discuss the additional overhead introduced when even more CRDs are introduced (like installing a second provider to the cluster) to reason about whether the overhead increases linearly with the # of CRDs.

I have some other experiments whose results I will publish in separate comments to this issue.

ulucinar commented 2 years ago

Experiment Setup # 2:

In a cluster where all provider-tf-aws CRDs are available, additionally register the 658 CRDs of provider-tf-azure with kubectl. Here, we would like to observe if an issue described here for Kubernetes Endpoints object exists for CRDs during OpenAPI spec publishing. Again, providers are not started. As it can be seen in the below figure, although the initial registration and publishing of the 765 provider-tf-aws CRDs took ~20 min, registration of the additional 658 provider-tf-azure CRDs took ~74 min, although there was no CPU saturation: cpu-time-postazure Due to client-side throttling reported by kubectl as @chlunde describes here (and maybe due to some other factors I have not investigated), provisioning with kubectl of the last 658 CRDs takes more time (~75 s), which corresponds to an increase of 145% compared to the initial set of 765 CRDs. Please note that as mentioned above, kubectl performs client-side throttling.

Summary

As expected, as we increase the number of CRDs in the cluster, it becomes more expensive to compute the OpenAPI spec described in this comment per CRD. With a back-of-the-envelope calculation, kube-apiserver had spent an average of 20 min / 765=1.57 s per provider-tf-aws CRD for publishing but for each provider-tf-azure CRD, average time spent has increased to 74 min / 658=6.75 s, which corresponds to a ~430% increase. I have not collected CPU profiling data for this experiment and the assumption is that, kube-apiserver has again been busy with publishing the additional provider-tf-azure CRDs as described here in the spotted high utilization interval of 74 min.

crossplane / terrajet