Closed dhumphries-sainsburys closed 1 year ago
I suspect this is the side effect of CPU throttling, can you please try it without setting CPU limits or set the limits to 4 CPUs, also make sure the machine where the controller runs has enough free CPU.
Just as an update i did an initial test of this last night and the results looked positive although i couldn't force the load on flux to be great enough to consistently test as it was out of hours. Going to try and pull some colleagues in to ramp up load and see if we can get some good data
With a bit more testing it appears that increasing or even removing the limit has no effect when at load although the generally higher times do seem to have gone so this is a partial improvement but unfortunately doesn't tackle the biggest part of the problem we were seeing of the individual threads taking longer the more threads are configured.
The left of the screenshot is with the values at Concurrency: 4 CPU limit: 4
The right is with Concurrency: 8 CPU limit: 8 CPU Request: 6 (just to ensure nothing else causes throttling)
During the spike with 8 concurrency the CPU doesn't come near a point of throttling so i don't think this is the problem unfortunately
If vertical scaling doesn't work for you, you can now use sharding and horizontal scaling, please see the docs here: https://fluxcd.io/flux/cheatsheets/sharding/
We have increasingly been hitting problems with the apply times for our kustomizations in our EKS clusters. This is partly due to the design we have for our main gitrepository which fires webhooks to tell the kustomize-controller to reconcile ~200 kustomizations at the same time. We are in the process of further breaking this down to reduce the storm that is caused but in the meantime as we had done no tuning of the controllers we thought it worth looking into that as we can see we have a problem with concurrency not meeting what we currently need. I am aware of https://fluxcd.io/flux/cheatsheets/bootstrap/#increase-the-number-of-workers and have been using this as the basis for my config to stop things becoming too painful for our users but have been hitting problems with this in practise as increasing the number of concurrent threads impacts the speed of each individual thread by a greater amount than the benefit of increase.
These are roughly average values for a kustomize completion based on the example grafana dashboards (from one of the install docs somewhere) depending on the number of threads (kube values increased at the same rate).
4 threads - 10-15 seconds 8 threads - 25-45 seconds ... 20 threads - 3-4 minutes
As you can see increasing the concurrency actually gives us a worse performance overall as we may be processing 20 objects at once but each one is taking a lot longer to complete. Have tried throwing more resource at it but it doesn't seem to change anything, Is there additional tuning that i am missing that can be done to make the controller scale linearly or are we just hitting some internal limitation?
flux version: 0.39.0 Example kustomization: