[BUG] memory usage grows exponentially when there are lots of CRDs

Shaked commented 8 months ago

Hey folks,

I am running flux on a AKS cluster Server Version: v1.27.3 with:

60 HelmReleases
108 CRDs

I have been experiencing a memory issue with the helm controller, to the point where I faced OOMKilled a couple of times a day.

I have followed the advanced debugging instructions to profile the controller and got some interesting results:

(pprof) top10
Showing nodes accounting for 668.27MB, 89.98% of 742.69MB total
Dropped 280 nodes (cum <= 3.71MB)
Showing top 10 nodes out of 113
      flat  flat%   sum%        cum   cum%
  335.59MB 45.19% 45.19%   335.59MB 45.19%  reflect.New
  112.06MB 15.09% 60.28%   112.06MB 15.09%  google.golang.org/protobuf/internal/impl.consumeStringValidateUTF8
   82.07MB 11.05% 71.33%    82.07MB 11.05%  io.ReadAll
   41.01MB  5.52% 76.85%   105.53MB 14.21%  k8s.io/kube-openapi/pkg/util/proto.(*Definitions).parseKind
   20.50MB  2.76% 79.61%       29MB  3.91%  k8s.io/kube-openapi/pkg/util/proto.(*Definitions).parsePrimitive
   18.01MB  2.43% 82.03%    18.01MB  2.43%  github.com/go-openapi/swag.(*NameProvider).GetJSONNames
   17.50MB  2.36% 84.39%    33.01MB  4.44%  k8s.io/kube-openapi/pkg/util/proto.VendorExtensionToMap
      15MB  2.02% 86.41%       15MB  2.02%  google.golang.org/protobuf/internal/impl.consumeStringSliceValidateUTF8
   14.01MB  1.89% 88.30%    54.03MB  7.27%  k8s.io/kube-openapi/pkg/validation/spec.(*Schema).UnmarshalNextJSON
   12.51MB  1.68% 89.98%    12.51MB  1.68%  reflect.mapassign0

After posting this on the Slack channel, @stefanprodan suggested that it is related to the amount of CRDs (or their size), since Helm SDK uses all of the CRDs for discovery purposes and there's no way to disable that.

To test this issue, I have created https://github.com/fluxcd/flux-benchmark/pull/4 which automatically installs N CRDs on a k8s cluster and runs the controller against it. While running this on my M2, I tried 500 CRDs with 100 HR and at some point I think I crossed 1 CPU. Managed to catch this screenshot:

I also ran this on a Azure AMD D2as_v5 node.

The yellow line shows a test with HR=100 and CRD=100, while limits where set to 1 CPU, 2Gi memory.
The next lines are from the same experiment, using HR=100 and CRD=150

Once I did this, I tried to increase the limits to 2 cpu and 2Gi memory, moved the helm-controller to a more powerful node (4vcpu/16Gb mem) and also made sure that the helm-controller doesn't share the same node as Prometheus/Grafana/cert-manager, the restarts count decreased but were still happening:

Currently I managed to stop the restarts by increasing the limits again using 2cpu and 3Gi memory.

While I think that removing the cpu limit might help, the origin of this issue is directly related to @stefanprodan suggestion regarding the Helm SDK and the way it uses the installed CRDs and how GC works.

Extra info

2vcpu/8gb memory node
helm-controller limits:
```
limits:
cpu: 1000m
memory: 2Gi
```
Flux's extra arguments
```
--concurrent=10
--requeue-dependency=5s
```

Flux version

$ flux version
flux: v2.2.3
distribution: flux-v2.2.3
helm-controller: v0.37.4
kustomize-controller: v1.2.2
notification-controller: v1.2.4
source-controller: v1.2.4

stefanprodan commented 7 months ago

So i did some digging into Helm SDK source code. The culprit seems to be the getCapabilities which invalidates the client CRD cache and then queries the Kubernetes API to get all CRDs. This function is called at upgrade, and to make things even worse, getCapabilities is called again in renderResources so basically all CRDs are loaded 4 times into memory for each upgrade: here and here.

This not only fills helm-controller memory, it also puts a huge pressure on Kubernetes API when running helm-controller with a high --concurrent number.

Not sure how this can be avoided and still have the Helm Capabilities feature working. I see that we could pass our own Capabilities, so maybe we could cache them globally in helm-controller and only refresh them when we install CRDs, but CRDs can also be in templates so we risk breaking Helm Capabilities and also the render logic which relies on the getCapabilities result...

@Shaked to validate my assumptions, you could modify helm-controller to load the Capabilities at startup only, then run your test and see if the memory usage drops.

stefanprodan commented 7 months ago

Some good new 🎉 A combination of improvements in Flux 2.3 and Kubernetes API 1.29/1.30 make this issue less impactful.

Compared to Flux 2.2 and Kubernetes 1.28 where large number of CRDs would drive helm-controller into OOM, in Flux 2.3 and Kubernetes 1.29, even with 500 CRDs, helm-controller reconciles 1K HelmReleases in under 9 minutes when configured with concurrent 10, 2CPU and 1GB RAM limits. Benchmark results here: https://github.com/fluxcd/flux-benchmark/pull/6

fluxcd / helm-controller

[BUG] memory usage grows exponentially when there are lots of CRDs #923

Extra info