fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.55k stars 606 forks source link

source-controller Deployment v0.35.1 needs higher memory limit #3620

Closed praxiscode closed 1 year ago

praxiscode commented 1 year ago

Describe the bug

Running Flux2 v0.40.0 in our cluster resulted in the source-controller pod getting OOMKilled into a CrashLoopBackoff state.

Adjusting its spec.template.spec.containers[].resources.limits.memory from 1Gi to 2Gi and then to 3Gi alleviated the condition. We've applied a Kustomization to make it persist, but perhaps the default value might need to be raised.

Steps to reproduce

Apply the flux install command to the cluster, notice that the source-controller pod is CrashLoopBackoff due to OOMKiller.

Expected behavior

I would expect the source-controller pod not to get OOMKilled.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v0.40.0

Flux check

N/A

Git provider

N/A

Container Registry provider

N/A

Additional context

N/A

Code of Conduct

stefanprodan commented 1 year ago

@praxiscode can you please post here the output of flux stats -A

praxiscode commented 1 year ago
RECONCILERS             RUNNING FAILING SUSPENDED       STORAGE
GitRepository           16      2       0               1.1 MiB
OCIRepository           0       0       0               -
HelmRepository          38      0       0               58.0 MiB
HelmChart               40      2       0               2.0 MiB
Bucket                  0       0       0               -
Kustomization           22      2       0               -
HelmRelease             40      3       0               -
Alert                   0       0       0               -
Provider                0       0       0               -
Receiver                7       2       0               -
ImageUpdateAutomation   15      0       0               -
ImagePolicy             58      5       0               -
ImageRepository         43      4       0               -
stefanprodan commented 1 year ago

Thanks, can you also post here kubectl -n flux-system top pods.

praxiscode commented 1 year ago
NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               1m           216Mi
image-automation-controller-86dcc5b9cc-gsktx   1m           49Mi
image-reflector-controller-8845fc876-2kv7t     10m          125Mi
kustomize-controller-7b55798444-ptnv7          1m           103Mi
notification-controller-5fb846755d-9szsz       2m           27Mi
source-controller-75b89899cf-n4qxb             25m          228Mi

I'm wishing I'd run this when the pod was crashlooping.

stefanprodan commented 1 year ago

Can you please enable Helm repositories caching like showed here: https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-repositories-caching please set --helm-cache-max-size=40, this should help a lot with the memory spikes.

praxiscode commented 1 year ago

Here's k top pods immediately after a k rollout restart deploy/source-controller:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               14m          84Mi
image-automation-controller-86dcc5b9cc-gsktx   204m         49Mi
image-reflector-controller-8845fc876-2kv7t     12m          127Mi
kustomize-controller-7b55798444-ptnv7          7m           100Mi
notification-controller-5fb846755d-9szsz       4m           27Mi
source-controller-56d9974784-5c9hr             463m         553Mi
praxiscode commented 1 year ago

@stefanprodan , should I also remove the patch to /spec/template/spec/containers/0/resources/limits/memory?

stefanprodan commented 1 year ago

Leave it in, and after the restart, please run top in loop to capture the memory spike, so we see if it's better with the cache on. Then if the spike is under 1Gi you can remove the memory bump.

stefanprodan commented 1 year ago

To test it, delete the source-controller pod a couple of times.

praxiscode commented 1 year ago

With just the --helm-cache-max-size=40 value, I see the following:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               2m           75Mi
image-automation-controller-86dcc5b9cc-gsktx   2m           49Mi
image-reflector-controller-8845fc876-2kv7t     30m          129Mi
kustomize-controller-7b55798444-ptnv7          15m          102Mi
notification-controller-5fb846755d-9szsz       1m           28Mi
source-controller-6d7c7b7c94-89nll             13m          682Mi
praxiscode commented 1 year ago

Re-adding --helm-cache-ttl=60 and --helm-cache-purge-interval=5m (as shown on linked page) resulted as follows:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               5m           134Mi
image-automation-controller-86dcc5b9cc-gsktx   2m           50Mi
image-reflector-controller-8845fc876-2kv7t     11m          125Mi
kustomize-controller-7b55798444-ptnv7          1m           116Mi
notification-controller-5fb846755d-9szsz       2m           28Mi
source-controller-75487584c8-lh5l8             37m          672Mi
stefanprodan commented 1 year ago

Ok so the initial limit of 1Gi could be put back now, you may want to increase the size to match your repo count. Can you please tell me if the OOM start happening with 0.40, did it happen before with 0.39?

praxiscode commented 1 year ago
NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               25m          82Mi
image-automation-controller-86dcc5b9cc-gsktx   1m           48Mi
image-reflector-controller-8845fc876-2kv7t     10m          129Mi
kustomize-controller-7b55798444-ptnv7          8m           114Mi
notification-controller-5fb846755d-9szsz       2m           28Mi
source-controller-b65475969-hsj68              12m          1719Mi
stefanprodan commented 1 year ago

Which patch are you using now?

praxiscode commented 1 year ago

That's using the following:

patches:
  - target:
      kind: Deployment
      name: source-controller
    patch: |
      - op: replace
        path: /spec/template/spec/containers/0/resources/limits/memory
        value: 3Gi
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-max-size=40
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-ttl=60m
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-purge-interval=5m
praxiscode commented 1 year ago
NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               2m           76Mi
image-automation-controller-86dcc5b9cc-gsktx   32m          44Mi
image-reflector-controller-8845fc876-2kv7t     11m          162Mi
kustomize-controller-7b55798444-ptnv7          257m         119Mi
notification-controller-5fb846755d-9szsz       2m           28Mi
source-controller-75487584c8-mpx44             3m           1509Mi
stefanprodan commented 1 year ago

Can you please say from which version you've upgrade? I guess this didn't happen before, to find the culprit it would be helpful to know which version was Ok.

hiddeco commented 1 year ago

To add to the information Stefan requested, it would be useful to know the memory spike from the previous version of the source-controller you were on (without caching enabled). This would ensure it's really due to changes in the latest release, and not because of some other issue.

Downgrading the image in the Deployment should work fine in combination with the newer CRD (the changes to the Artifact will just be ignored, and the other controllers know how to work with the older version).

If you observe a difference there, it would be good to collect a heap profile as well from the latest version (or just collect in advance before downgrading). Instructions on how to do this can be found here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile

All this information together should allow us to get to the bottom of the issue. Thank you for your patience and help so far :bow:

praxiscode commented 1 year ago

We hadn't paid it much mind before, as it wasn't an issue before. The cluster was running fluxcd v0.34.0 previously.

Downgrading would be a bit of an issue due to our use of multi-tenancy; we would need to unconfigure this cluster from the main configuration. Not insurmountable, but rather inconvenient.

Would it be possible just to patch the container image instead? As mentioned, this is source-controller container image v0.35.1. If patching the image is acceptable, then to which version should it be patched? v0.29.0 is the image used in fluxcd v0.34.0

hiddeco commented 1 year ago

Would it be possible just to patch the container image instead?

If I am following you correctly that's precisely what I meant with

Downgrading the image in the Deployment should work fine [..]

I would start with patching the image to v0.34.0, as there have been some other changes between v0.29.0 and this controller version of which I am less sure about compatibility, and this should be sufficient to give us a clue if it's due to recently rolled out changes.

praxiscode commented 1 year ago

OK. I'll add

      - op: replace
        path: /spec/template/spec/containers/0/image
        value: ghcr.io/fluxcd/source-controller:v0.34.0

to the other patches on the source-controller deployment.

praxiscode commented 1 year ago

v0.34.0 source-controller:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               1m           87Mi
image-automation-controller-86dcc5b9cc-gsktx   1m           43Mi
image-reflector-controller-8845fc876-2kv7t     20m          144Mi
kustomize-controller-7b55798444-ptnv7          210m         119Mi
notification-controller-5fb846755d-9szsz       2m           29Mi
source-controller-7654ffc8db-478lw             2m           323Mi
praxiscode commented 1 year ago

After killing the running pod to force a restart:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               14m          80Mi
image-automation-controller-86dcc5b9cc-gsktx   146m         50Mi
image-reflector-controller-8845fc876-2kv7t     10m          143Mi
kustomize-controller-7b55798444-ptnv7          3m           104Mi
notification-controller-5fb846755d-9szsz       6m           29Mi
source-controller-7654ffc8db-czmnr             961m         501Mi
hiddeco commented 1 year ago

Okay, can you please collect a heap profile from v0.40.0 and share this (either here by dropping it in a .tar.gz, or by sending it via mail to hidde@weave.works)?

Also, at which moment was the following taken (right after boot, or after running it for some time)? https://github.com/fluxcd/flux2/issues/3620#issuecomment-1440283463

praxiscode commented 1 year ago

Erm... Do you mean the v0.35.1 source-controller pod, the v.0.34.0 one, or some other component?

The high readings are taken within two minutes of pod start.

hiddeco commented 1 year ago

Sorry about not being more explicit, I meant the v0.35.1 source-controller pod which comes with v0.40.0 of Flux :-)

praxiscode commented 1 year ago

After reverting the patch (resulting in v0.35.1:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               6m           92Mi
image-automation-controller-86dcc5b9cc-gsktx   1m           48Mi
image-reflector-controller-8845fc876-2kv7t     10m          160Mi
kustomize-controller-7b55798444-ptnv7          2m           113Mi
notification-controller-5fb846755d-9szsz       3m           28Mi
source-controller-75487584c8-scfvf             542m         1510Mi
praxiscode commented 1 year ago

source-controller v0.35.1 spiked up to 970Mi at around 134 minutes, but now it's back down to 380Mi

praxiscode commented 1 year ago

source-controller v0.35.1 at 3 hours, 13 minutes:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               1m           208Mi
image-automation-controller-86dcc5b9cc-gsktx   1m           52Mi
image-reflector-controller-8845fc876-2kv7t     23m          144Mi
kustomize-controller-7b55798444-ptnv7          448m         118Mi
notification-controller-5fb846755d-9szsz       1m           28Mi
source-controller-75487584c8-scfvf             5m           1644Mi
hiddeco commented 1 year ago

Thanks for providing the heap dumps via mail. I think I have identified the core of the issue, which has been addressed in https://github.com/fluxcd/source-controller/pull/1035. A release candidate is available as ghcr.io/fluxcd/source-controller:rc-e5c0313d.

praxiscode commented 1 year ago

Numbers from rc-e5c0313d at 76 seconds:

NAME                                           CPU(cores)   MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh               5m           83Mi
image-automation-controller-86dcc5b9cc-gsktx   33m          49Mi
image-reflector-controller-8845fc876-2kv7t     25m          139Mi
kustomize-controller-7b55798444-ptnv7          1m           108Mi
notification-controller-5fb846755d-9szsz       3m           28Mi
source-controller-6897b77d4f-pks84             23m          315Mi
praxiscode commented 1 year ago

Down to 218Mi at 4 minutes.

Holding steady ± 2Mi after ten minutes.

hiddeco commented 1 year ago

Looks like that's back to how it used to be. :tada:

If you omitted the --helm-cache-* flags from your configuration again, you may want to try to play around with them to see if it further improves the resource usage of the pod (should help with spikes when HelmChart objects check the repositories for new chart versions).

praxiscode commented 1 year ago

I left the other patches in place; just updated the image name.

patches:
  - target:
      kind: Deployment
      name: source-controller
    patch: |
      - op: replace
        path: /spec/template/spec/containers/0/resources/limits/memory
        value: 3Gi
      - op: replace
        path: /spec/template/spec/containers/0/image
        value: ghcr.io/fluxcd/source-controller:rc-e5c0313d
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-max-size=40
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-ttl=60m
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-purge-interval=5m
hiddeco commented 1 year ago

I would maybe increase the --helm-cache-max-size (and lower the cache-ttl depending on the interval of your charts and repositories), to allow new indexes to be added to the cache while the previous ones have not been purged yet. As this is completely dependent on the purging mechanism (and separate from the garbage collection of the storage).

hiddeco commented 1 year ago

Official release (ghcr.io/fluxcd/source-controller:v0.35.2) is now available with Flux v0.40.1.