fluxcd / kustomize-controller

The GitOps Toolkit Kustomize reconciler
https://fluxcd.io
Apache License 2.0
253 stars 181 forks source link

kustomize-controller concurrency reduces performance #816

Closed dhumphries-sainsburys closed 1 year ago

dhumphries-sainsburys commented 1 year ago

We have increasingly been hitting problems with the apply times for our kustomizations in our EKS clusters. This is partly due to the design we have for our main gitrepository which fires webhooks to tell the kustomize-controller to reconcile ~200 kustomizations at the same time. We are in the process of further breaking this down to reduce the storm that is caused but in the meantime as we had done no tuning of the controllers we thought it worth looking into that as we can see we have a problem with concurrency not meeting what we currently need. I am aware of https://fluxcd.io/flux/cheatsheets/bootstrap/#increase-the-number-of-workers and have been using this as the basis for my config to stop things becoming too painful for our users but have been hitting problems with this in practise as increasing the number of concurrent threads impacts the speed of each individual thread by a greater amount than the benefit of increase.

These are roughly average values for a kustomize completion based on the example grafana dashboards (from one of the install docs somewhere) depending on the number of threads (kube values increased at the same rate).
4 threads - 10-15 seconds 8 threads - 25-45 seconds ... 20 threads - 3-4 minutes

As you can see increasing the concurrency actually gives us a worse performance overall as we may be processing 20 objects at once but each one is taking a lot longer to complete. Have tried throwing more resource at it but it doesn't seem to change anything, Is there additional tuning that i am missing that can be done to make the controller scale linearly or are we just hitting some internal limitation?

flux version: 0.39.0 Example kustomization:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  labels:
    kustomize.toolkit.fluxcd.io/name: bosun-infra-generated-kustomizations
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: generated-resources-trial-dan-test
  namespace: flux-system
spec:
  force: false
  interval: 10m
  path: ./platform/generated/dev-ie-02/trial-suresh-gollamudi
  postBuild:
    substituteFrom:
      - kind: ConfigMap
        name: terraform-outputs
        optional: false
  prune: true
  sourceRef:
    kind: GitRepository
    name: bosun-resources
  timeout: 5m
stefanprodan commented 1 year ago

I suspect this is the side effect of CPU throttling, can you please try it without setting CPU limits or set the limits to 4 CPUs, also make sure the machine where the controller runs has enough free CPU.

dhumphries-sainsburys commented 1 year ago

Just as an update i did an initial test of this last night and the results looked positive although i couldn't force the load on flux to be great enough to consistently test as it was out of hours. Going to try and pull some colleagues in to ramp up load and see if we can get some good data

dhumphries-sainsburys commented 1 year ago

With a bit more testing it appears that increasing or even removing the limit has no effect when at load although the generally higher times do seem to have gone so this is a partial improvement but unfortunately doesn't tackle the biggest part of the problem we were seeing of the individual threads taking longer the more threads are configured.

The left of the screenshot is with the values at Concurrency: 4 CPU limit: 4

The right is with Concurrency: 8 CPU limit: 8 CPU Request: 6 (just to ensure nothing else causes throttling)

Screenshot 2023-03-24 at 14 35 13

During the spike with 8 concurrency the CPU doesn't come near a point of throttling so i don't think this is the problem unfortunately

Screenshot 2023-03-24 at 15 21 20
stefanprodan commented 1 year ago

If vertical scaling doesn't work for you, you can now use sharding and horizontal scaling, please see the docs here: https://fluxcd.io/flux/cheatsheets/sharding/