Closed praxiscode closed 1 year ago
@praxiscode can you please post here the output of flux stats -A
RECONCILERS RUNNING FAILING SUSPENDED STORAGE
GitRepository 16 2 0 1.1 MiB
OCIRepository 0 0 0 -
HelmRepository 38 0 0 58.0 MiB
HelmChart 40 2 0 2.0 MiB
Bucket 0 0 0 -
Kustomization 22 2 0 -
HelmRelease 40 3 0 -
Alert 0 0 0 -
Provider 0 0 0 -
Receiver 7 2 0 -
ImageUpdateAutomation 15 0 0 -
ImagePolicy 58 5 0 -
ImageRepository 43 4 0 -
Thanks, can you also post here kubectl -n flux-system top pods
.
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 1m 216Mi
image-automation-controller-86dcc5b9cc-gsktx 1m 49Mi
image-reflector-controller-8845fc876-2kv7t 10m 125Mi
kustomize-controller-7b55798444-ptnv7 1m 103Mi
notification-controller-5fb846755d-9szsz 2m 27Mi
source-controller-75b89899cf-n4qxb 25m 228Mi
I'm wishing I'd run this when the pod was crashlooping.
Can you please enable Helm repositories caching like showed here: https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-repositories-caching please set --helm-cache-max-size=40
, this should help a lot with the memory spikes.
Here's k top pods immediately after a k rollout restart deploy/source-controller:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 14m 84Mi
image-automation-controller-86dcc5b9cc-gsktx 204m 49Mi
image-reflector-controller-8845fc876-2kv7t 12m 127Mi
kustomize-controller-7b55798444-ptnv7 7m 100Mi
notification-controller-5fb846755d-9szsz 4m 27Mi
source-controller-56d9974784-5c9hr 463m 553Mi
@stefanprodan , should I also remove the patch to /spec/template/spec/containers/0/resources/limits/memory
?
Leave it in, and after the restart, please run top
in loop to capture the memory spike, so we see if it's better with the cache on. Then if the spike is under 1Gi you can remove the memory bump.
To test it, delete the source-controller pod a couple of times.
With just the --helm-cache-max-size=40
value, I see the following:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 2m 75Mi
image-automation-controller-86dcc5b9cc-gsktx 2m 49Mi
image-reflector-controller-8845fc876-2kv7t 30m 129Mi
kustomize-controller-7b55798444-ptnv7 15m 102Mi
notification-controller-5fb846755d-9szsz 1m 28Mi
source-controller-6d7c7b7c94-89nll 13m 682Mi
Re-adding --helm-cache-ttl=60
and --helm-cache-purge-interval=5m
(as shown on linked page) resulted as follows:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 5m 134Mi
image-automation-controller-86dcc5b9cc-gsktx 2m 50Mi
image-reflector-controller-8845fc876-2kv7t 11m 125Mi
kustomize-controller-7b55798444-ptnv7 1m 116Mi
notification-controller-5fb846755d-9szsz 2m 28Mi
source-controller-75487584c8-lh5l8 37m 672Mi
Ok so the initial limit of 1Gi could be put back now, you may want to increase the size to match your repo count. Can you please tell me if the OOM start happening with 0.40, did it happen before with 0.39?
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 25m 82Mi
image-automation-controller-86dcc5b9cc-gsktx 1m 48Mi
image-reflector-controller-8845fc876-2kv7t 10m 129Mi
kustomize-controller-7b55798444-ptnv7 8m 114Mi
notification-controller-5fb846755d-9szsz 2m 28Mi
source-controller-b65475969-hsj68 12m 1719Mi
Which patch are you using now?
That's using the following:
patches:
- target:
kind: Deployment
name: source-controller
patch: |
- op: replace
path: /spec/template/spec/containers/0/resources/limits/memory
value: 3Gi
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-max-size=40
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-ttl=60m
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-purge-interval=5m
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 2m 76Mi
image-automation-controller-86dcc5b9cc-gsktx 32m 44Mi
image-reflector-controller-8845fc876-2kv7t 11m 162Mi
kustomize-controller-7b55798444-ptnv7 257m 119Mi
notification-controller-5fb846755d-9szsz 2m 28Mi
source-controller-75487584c8-mpx44 3m 1509Mi
Can you please say from which version you've upgrade? I guess this didn't happen before, to find the culprit it would be helpful to know which version was Ok.
To add to the information Stefan requested, it would be useful to know the memory spike from the previous version of the source-controller you were on (without caching enabled). This would ensure it's really due to changes in the latest release, and not because of some other issue.
Downgrading the image in the Deployment should work fine in combination with the newer CRD (the changes to the Artifact will just be ignored, and the other controllers know how to work with the older version).
If you observe a difference there, it would be good to collect a heap profile as well from the latest version (or just collect in advance before downgrading). Instructions on how to do this can be found here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile
All this information together should allow us to get to the bottom of the issue. Thank you for your patience and help so far :bow:
We hadn't paid it much mind before, as it wasn't an issue before. The cluster was running fluxcd v0.34.0 previously.
Downgrading would be a bit of an issue due to our use of multi-tenancy; we would need to unconfigure this cluster from the main configuration. Not insurmountable, but rather inconvenient.
Would it be possible just to patch the container image instead? As mentioned, this is source-controller container image v0.35.1. If patching the image is acceptable, then to which version should it be patched? v0.29.0 is the image used in fluxcd v0.34.0
Would it be possible just to patch the container image instead?
If I am following you correctly that's precisely what I meant with
Downgrading the image in the Deployment should work fine [..]
I would start with patching the image to v0.34.0
, as there have been some other changes between v0.29.0
and this controller version of which I am less sure about compatibility, and this should be sufficient to give us a clue if it's due to recently rolled out changes.
OK. I'll add
- op: replace
path: /spec/template/spec/containers/0/image
value: ghcr.io/fluxcd/source-controller:v0.34.0
to the other patches on the source-controller deployment.
v0.34.0
source-controller:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 1m 87Mi
image-automation-controller-86dcc5b9cc-gsktx 1m 43Mi
image-reflector-controller-8845fc876-2kv7t 20m 144Mi
kustomize-controller-7b55798444-ptnv7 210m 119Mi
notification-controller-5fb846755d-9szsz 2m 29Mi
source-controller-7654ffc8db-478lw 2m 323Mi
After killing the running pod to force a restart:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 14m 80Mi
image-automation-controller-86dcc5b9cc-gsktx 146m 50Mi
image-reflector-controller-8845fc876-2kv7t 10m 143Mi
kustomize-controller-7b55798444-ptnv7 3m 104Mi
notification-controller-5fb846755d-9szsz 6m 29Mi
source-controller-7654ffc8db-czmnr 961m 501Mi
Okay, can you please collect a heap profile from v0.40.0
and share this (either here by dropping it in a .tar.gz
, or by sending it via mail to hidde@weave.works
)?
Also, at which moment was the following taken (right after boot, or after running it for some time)? https://github.com/fluxcd/flux2/issues/3620#issuecomment-1440283463
Erm... Do you mean the v0.35.1
source-controller pod, the v.0.34.0
one, or some other component?
The high readings are taken within two minutes of pod start.
Sorry about not being more explicit, I meant the v0.35.1
source-controller pod which comes with v0.40.0
of Flux :-)
After reverting the patch (resulting in v0.35.1
:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 6m 92Mi
image-automation-controller-86dcc5b9cc-gsktx 1m 48Mi
image-reflector-controller-8845fc876-2kv7t 10m 160Mi
kustomize-controller-7b55798444-ptnv7 2m 113Mi
notification-controller-5fb846755d-9szsz 3m 28Mi
source-controller-75487584c8-scfvf 542m 1510Mi
source-controller v0.35.1
spiked up to 970Mi at around 134 minutes, but now it's back down to 380Mi
source-controller v0.35.1
at 3 hours, 13 minutes:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 1m 208Mi
image-automation-controller-86dcc5b9cc-gsktx 1m 52Mi
image-reflector-controller-8845fc876-2kv7t 23m 144Mi
kustomize-controller-7b55798444-ptnv7 448m 118Mi
notification-controller-5fb846755d-9szsz 1m 28Mi
source-controller-75487584c8-scfvf 5m 1644Mi
Thanks for providing the heap dumps via mail. I think I have identified the core of the issue, which has been addressed in https://github.com/fluxcd/source-controller/pull/1035. A release candidate is available as ghcr.io/fluxcd/source-controller:rc-e5c0313d
.
Numbers from rc-e5c0313d
at 76 seconds:
NAME CPU(cores) MEMORY(bytes)
helm-controller-68f955bc4b-fhvvh 5m 83Mi
image-automation-controller-86dcc5b9cc-gsktx 33m 49Mi
image-reflector-controller-8845fc876-2kv7t 25m 139Mi
kustomize-controller-7b55798444-ptnv7 1m 108Mi
notification-controller-5fb846755d-9szsz 3m 28Mi
source-controller-6897b77d4f-pks84 23m 315Mi
Down to 218Mi at 4 minutes.
Holding steady ± 2Mi after ten minutes.
Looks like that's back to how it used to be. :tada:
If you omitted the --helm-cache-*
flags from your configuration again, you may want to try to play around with them to see if it further improves the resource usage of the pod (should help with spikes when HelmChart
objects check the repositories for new chart versions).
I left the other patches in place; just updated the image name.
patches:
- target:
kind: Deployment
name: source-controller
patch: |
- op: replace
path: /spec/template/spec/containers/0/resources/limits/memory
value: 3Gi
- op: replace
path: /spec/template/spec/containers/0/image
value: ghcr.io/fluxcd/source-controller:rc-e5c0313d
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-max-size=40
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-ttl=60m
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-purge-interval=5m
I would maybe increase the --helm-cache-max-size
(and lower the cache-ttl
depending on the interval of your charts and repositories), to allow new indexes to be added to the cache while the previous ones have not been purged yet. As this is completely dependent on the purging mechanism (and separate from the garbage collection of the storage).
Official release (ghcr.io/fluxcd/source-controller:v0.35.2
) is now available with Flux v0.40.1.
Describe the bug
Running Flux2 v0.40.0 in our cluster resulted in the source-controller pod getting OOMKilled into a CrashLoopBackoff state.
Adjusting its spec.template.spec.containers[].resources.limits.memory from 1Gi to 2Gi and then to 3Gi alleviated the condition. We've applied a Kustomization to make it persist, but perhaps the default value might need to be raised.
Steps to reproduce
Apply the flux install command to the cluster, notice that the source-controller pod is CrashLoopBackoff due to OOMKiller.
Expected behavior
I would expect the source-controller pod not to get OOMKilled.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
v0.40.0
Flux check
N/A
Git provider
N/A
Container Registry provider
N/A
Additional context
N/A
Code of Conduct