Closed Shaked closed 1 month ago
So i did some digging into Helm SDK source code. The culprit seems to be the getCapabilities which invalidates the client CRD cache and then queries the Kubernetes API to get all CRDs. This function is called at upgrade, and to make things even worse, getCapabilities
is called again in renderResources so basically all CRDs are loaded 4 times into memory for each upgrade: here and here.
This not only fills helm-controller memory, it also puts a huge pressure on Kubernetes API when running helm-controller with a high --concurrent
number.
Not sure how this can be avoided and still have the Helm Capabilities feature working. I see that we could pass our own Capabilities, so maybe we could cache them globally in helm-controller and only refresh them when we install CRDs, but CRDs can also be in templates so we risk breaking Helm Capabilities and also the render logic which relies on the getCapabilities result...
@Shaked to validate my assumptions, you could modify helm-controller to load the Capabilities at startup only, then run your test and see if the memory usage drops.
Some good new 🎉 A combination of improvements in Flux 2.3 and Kubernetes API 1.29/1.30 make this issue less impactful.
Compared to Flux 2.2 and Kubernetes 1.28 where large number of CRDs would drive helm-controller into OOM, in Flux 2.3 and Kubernetes 1.29, even with 500 CRDs, helm-controller reconciles 1K HelmReleases in under 9 minutes when configured with concurrent 10, 2CPU and 1GB RAM limits. Benchmark results here: https://github.com/fluxcd/flux-benchmark/pull/6
Hey folks,
I am running flux on a AKS cluster
Server Version: v1.27.3
with:I have been experiencing a memory issue with the helm controller, to the point where I faced OOMKilled a couple of times a day.
I have followed the advanced debugging instructions to profile the controller and got some interesting results:
After posting this on the Slack channel, @stefanprodan suggested that it is related to the amount of CRDs (or their size), since Helm SDK uses all of the CRDs for discovery purposes and there's no way to disable that.
To test this issue, I have created https://github.com/fluxcd/flux-benchmark/pull/4 which automatically installs N CRDs on a k8s cluster and runs the controller against it. While running this on my M2, I tried 500 CRDs with 100 HR and at some point I think I crossed 1 CPU. Managed to catch this screenshot:
I also ran this on a Azure AMD D2as_v5 node.
Once I did this, I tried to increase the limits to 2 cpu and 2Gi memory, moved the helm-controller to a more powerful node (4vcpu/16Gb mem) and also made sure that the helm-controller doesn't share the same node as Prometheus/Grafana/cert-manager, the restarts count decreased but were still happening:
Currently I managed to stop the restarts by increasing the limits again using 2cpu and 3Gi memory.
While I think that removing the cpu limit might help, the origin of this issue is directly related to @stefanprodan suggestion regarding the Helm SDK and the way it uses the installed CRDs and how GC works.
Extra info