fluxcd / kustomize-controller

The GitOps Toolkit Kustomize reconciler
https://fluxcd.io
Apache License 2.0
253 stars 181 forks source link

client-side throttling #757

Closed mateusz-lubanski-sinch closed 6 months ago

mateusz-lubanski-sinch commented 2 years ago

Error message:

Throttling logs for kustomize-controller:

kubectl logs -n flux-system -f deployments/kustomize-controller | grep 'Waited for'
I1107 10:10:55.887548       7 request.go:682] Waited for 1.044697004s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/authorization.k8s.io/v1beta1?timeout=32s
I1107 10:11:05.913711       7 request.go:682] Waited for 6.196340718s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/iam.aws.crossplane.io/v1alpha1?timeout=32s
I1107 10:11:25.824441       7 request.go:682] Waited for 1.045998726s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/mq.aws.crossplane.io/v1alpha1?timeout=32s
I1107 10:11:35.862387       7 request.go:682] Waited for 3.446082493s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/acmpca.aws.crossplane.io/v1beta1?timeout=32s
I1107 10:11:45.900275       7 request.go:682] Waited for 2.794960921s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/crd.k8s.amazonaws.com/v1alpha1?timeout=32s
I1107 10:11:55.938719       7 request.go:682] Waited for 6.845825892s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/kafka.aws.crossplane.io/v1alpha1?timeout=32s
I1107 10:12:05.966524       7 request.go:682] Waited for 2.846021857s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/apiextensions.k8s.io/v1beta1?timeout=32s
I1107 10:12:15.981709       7 request.go:682] Waited for 6.146203539s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/image.toolkit.fluxcd.io/v1alpha1?timeout=32s
I1107 10:12:25.992888       7 request.go:682] Waited for 6.546170792s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/prometheusservice.aws.crossplane.io/v1alpha1?timeout=32s
I1107 10:12:36.038011       7 request.go:682] Waited for 1.446133559s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/notification.aws.crossplane.io/v1alpha1?timeout=32s
...

From all deployed FLUX controllers throttling logs occurs only on kustomize-controller

 kubectl get deployments.apps -n flux-system
helm-controller               1/1     1            1           413d
image-automation-controller   1/1     1            1           413d
image-reflector-controller    1/1     1            1           413d
kustomize-controller          1/1     1            1           413d
notification-controller       1/1     1            1           413d
source-controller             1/1     1            1           413d
ww-gitops-weave-gitops        1/1     1            1           160d

Additional context

kustomize-controller version:

kubectl get deployments.apps -n flux-system kustomize-controller -o json | jq '.spec.template.spec.containers[].image'
"ghcr.io/fluxcd/kustomize-controller:v0.30.0"

--kube-api-burst argument passed to container

kubectl get deployments.apps -n flux-system kustomize-controller -o json | jq '.spec.template.spec.containers[].args'
[
  "--events-addr=http://notification-controller.flux-system.svc.cluster.local./",
  "--watch-all-namespaces",
  "--log-level=info",
  "--log-encoding=json",
  "--enable-leader-election",
  "--kube-api-burst=250"
]

We faced above issue after deploying crossplane with provider-aws to our cluster which added bunch of new CRD's Today we have 173 API GroupVersions on our cluster.:

kubectl api-versions | wc -l
173

Client side throttling issue is explained in detail here: https://github.com/crossplane/crossplane/blob/master/design/one-pager-crd-scaling.md#client-side-throttling

Expected behavior

After setting --kube-api-burst container argument throttling logs should disappear or Waited for time should be close to 1s (e.g. Waited for 1.045801429s due to client-side throttling)

kingdonb commented 2 years ago

Could you confirm if you have already seen this doc:

https://fluxcd.io/flux/cheatsheets/bootstrap/#increase-the-number-of-workers

Ref: how to increase the settings for performance tuning in Flux, you can follow this doc as a more general guide to customizing Flux:

https://fluxcd.io/flux/installation/#customize-flux-manifests

(but the first link includes a reference about your specific inquiry, how to set --kube-api-burst)

It sounds like from your report, you tried setting this value and it did not have the desired effect. Could you clarify this detail please? The suggested value in our docs related to performance tuning is 1000, I see the crossplane docs suggest 300, (maybe try a higher value?)

It is also possible that some changes in the latest version of Flux have impacted the behavior in an unexpected way. Is this a new behavior that you just noticed in the latest version v0.30.0 of Kustomize Controller? There is a new thing (ref: https://github.com/fluxcd/kustomize-controller/pull/745) in the behavior of this latest release, trying to ascertain whether it's related or not.

mateusz-lubanski-sinch commented 1 year ago

Thanks @kingdonb for quick answer

Yes that's correct, I tried use --kube-api-burst settings and did not have the desired effect. Based on my calculations from https://github.com/crossplane/crossplane/blob/master/design/one-pager-crd-scaling.md#client-side-throttling 250 should be sufficient sufficient.

To be sure I just updated both --kube-api-burst and --kube-api-qps settings to recommended values from https://fluxcd.io/flux/cheatsheets/bootstrap/#increase-the-number-of-workers but I still can see lot of throttling errors in kustomize-controller:

I1116 08:28:56.890662       7 request.go:682] Waited for 3.047259872s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/vpcresources.k8s.aws/v1beta1?timeout=32s
I1116 08:29:06.915595       7 request.go:682] Waited for 2.196979557s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/external-secrets.io/v1alpha1?timeout=32s
I1116 08:29:16.934432       7 request.go:682] Waited for 4.946498296s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/secrets.crossplane.io/v1alpha1?timeout=32s
I1116 08:29:27.411427       7 request.go:682] Waited for 1.0474381s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/rds.aws.crossplane.sinch.com/v1alpha1?timeout=32s
I1116 08:29:37.415356       7 request.go:682] Waited for 3.597358097s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/crd.projectcalico.org/v1?timeout=32s
I1116 08:29:47.444735       7 request.go:682] Waited for 6.245726056s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/database.aws.crossplane.io/v1beta1?timeout=32s

I also tried downgrading kustomize-controller (to v0.28.0) to make sure that latest features had no impact on throttling issue but I can see in logs same throttling errors

Kubernetes version: v1.21.14-eks-fb459a0

mateusz-lubanski-sinch commented 1 year ago

@kingdonb do you have maybe any other advice?

stefanprodan commented 1 year ago

Bumping the rate limits has no effect on newer Kubernetes versions due to https://github.com/fluxcd/pkg/pull/270

I guess this will be solved by using the new AggregatedDiscoveryEndpoint https://github.com/kubernetes/enhancements/issues/3352. We’ll need to revisit this in 6 months time after that flag becomes GA

mateusz-lubanski-sinch commented 1 year ago

is there anything which can be done on older kubernetes versions? As of today we are running on EKS 1.21 and soon we will upgrade to 1.22

stefanprodan commented 6 months ago

This is now solved upstream with Aggregated Discovery being made GA in Kubernetes 1.30. On Kubernetes 1.30 and newer, Flux will no longer spam calls to discover all available APIs, instead it will do a single call.