fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
7.16k stars 660 forks source link

Source-controller enters eviction loop due to memory pressure #5343

Open Nicola-Sergio opened 1 month ago

Nicola-Sergio commented 1 month ago

Describe the bug

Hi everyone,

I'm observing an issue where the source-controller starts in a healthy state (1/1 Running), but after an initial OOMKilled event, it enters a loop where Kubernetes continuously creates new pods that are almost immediately Evicted.

Over time, this leads to a large number of failed source-controller pods accumulating in the flux-system namespace.

The situation is the following:

Image

Steps to reproduce

I'm running a single AKS cluster which hosts three separate development environments, each for a different project.

Each project is managed via its own Git repository, and I've structured Flux in the following way:

kubectl get helmrelease --all-namespaces:

flux-system       application-gateway         67d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-application-gateway/microservice-1.0.5.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-application-gateway/microservice-1.0.5.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       dapr                        90d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-dapr/dapr-1.15.3.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-dapr/dapr-1.15.3.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       tservite                    294d   False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-tservite/tservite-0.2.0.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-tservite/tservite-0.2.0.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       fluentd                     42d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-fluentd/fluentd-6.5.13.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-fluentd/fluentd-6.5.13.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       mongo                       204d   False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-mongo/mongodb-15.6.26.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-mongo/mongodb-15.6.26.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       santeramoinco               156d   False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-santeramoinco/microservice-1.0.5.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-santeramoinco/microservice-1.0.5.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       santeramoinco-worker        157d   False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-santeramoinco-worker/microservice-1.0.5.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-santeramoinco-worker/microservice-1.0.5.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       rabbitmq                    198d   False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-rabbitmq/rabbitmq-14.7.0.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-rabbitmq/rabbitmq-14.7.0.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       redis                       216d   False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-redis/redis-20.1.4.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-redis/redis-20.1.4.tgz": dial tcp 11.4.65.61:80: connect: connection refused
flux-system       stubsapi                    67d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-stubsapi/microservice-1.0.5.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/flux-system-stubsapi/microservice-1.0.5.tgz": dial tcp 11.4.65.61:80: connect: connection refused
implementation    flowwi                      34d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -flowwi/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -flowwi/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused
implementation    tastermata                  36d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -tastermata/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -tastermata/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused
implementation    monitoring                  44d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -monitoring/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -monitoring/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused
implementation    papago                      44d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -papago/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -papago/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused
implementation    grammelottecnologigateway   44d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -grammelottecnologigateway/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -grammelottecnologigateway/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused
implementation    posimiton                   44d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -posimiton/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -posimiton/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused
implementation    posimiton-worker            44d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -posimiton-worker/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/implementation -posimiton-worker/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused
mountebank        mountebank                  22d    False   failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/mountebank-mountebank/microservice-1.0.6+5cf11bd40c32.tgz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/mountebank-mountebank/microservice-1.0.6+5cf11bd40c32.tgz": dial tcp 11.4.65.61:80: connect: connection refused

Are all 18 helmreleases

flux stats:

RECONCILERS             RUNNING FAILING SUSPENDED       STORAGE
GitRepository           4       0       0               545.3 KiB
OCIRepository           0       0       0               -
HelmRepository          4       0       0               26.4 MiB
HelmChart               18      0       0               414.0 KiB
Bucket                  0       0       0               -
Kustomization           4       3       0               -
HelmRelease             10      10      0               -
Alert                   2       0       0               -
Provider                2       0       0               -
Receiver                0       0       0               -
ImageUpdateAutomation   0       0       0               -
ImagePolicy             4       0       0               -
ImageRepository         4       0       0               -

Would it be possible to estimate the RAM usage of the source-controller in my case, similar to what @stefanprodan explained here?

Expected behavior

None

Screenshots and recordings

No response

OS / Distro

Ubuntu 22.04.3 LTS

Flux version

v0.41.2

Flux check

► checking prerequisites ✗ flux 0.41.2 <2.5.1 (new version is available, please upgrade) ✔ Kubernetes 1.28.3 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.30.0 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.30.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.25.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.34.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.32.1 ✗ source-controller: deployment not ready ► ghcr.io/fluxcd/source-controller:v0.35.2 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta2 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta2 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta2 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta2 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta2 ✔ receivers.notification.toolkit.fluxcd.io/v1beta2 ✗ check failed

Git provider

No response

Container Registry provider

No response

Additional context

kubectl get nodes -o wide:

NAME                              STATUS   ROLES   AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-default-xxxx-vmss000006   Ready    agent   217d   v1.28.3   11.0.1.166    <none>        Ubuntu 22.04.3 LTS   5.15.0-1053-azure   containerd://1.7.5-1
aks-default-xxxx-vmss000007   Ready    agent   217d   v1.28.3   11.0.0.199    <none>        Ubuntu 22.04.3 LTS   5.15.0-1053-azure   containerd://1.7.5-1
aks-default-xxxx-vmss000008   Ready    agent   217d   v1.28.3   11.0.0.69     <none>        Ubuntu 22.04.3 LTS   5.15.0-1053-azure   containerd://1.7.5-1
aks-default-xxxx-vmss000009   Ready    agent   217d   v1.28.3   11.0.0.10     <none>        Ubuntu 22.04.3 LTS   5.15.0-1053-azure   containerd://1.7.5-1

Code of Conduct

stefanprodan commented 1 month ago

You are using Flux v0.41.2 which reached end-of-life almost 2 years ago. Upgrade to Flux 2.5 and if the problem persists report it here, but on that version no one can help you.

Nicola-Sergio commented 1 month ago

Ok, I will update it as soon as possible. Could you help me at this point if is possible?

Would it be possible to estimate the RAM usage of the source-controller in my case, similar to what @stefanprodan explained here

stefanprodan commented 1 month ago

After you upgrade to Flux 2.5, configure the Helm index caching and with the default 1GB RAM limit it should work fine.

Docs here: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

Nicola-Sergio commented 1 month ago

Does it will work even whether helm charts are in GitRepository rather than HelmRepository?

stefanprodan commented 1 month ago

Does it will work even whether helm charts are in GitRepository rather than HelmRepository?

I don't think the OOM is related to the Git operations but to Helm.

Each of these Kustomization resources has a spec.interval set to 1 minute, so changes are pulled frequently.

The Kustomization interval has nothing to do with the Git pull frequency, see the recommend settings here: https://fluxcd.io/flux/components/kustomize/kustomizations/#recommended-settings