Open Andrea-Gallicchio opened 2 years ago
The image-reflector-controller
has nothing to do with Helm. Can you please post here kubectl describe deployment
for the controller that runs into OOM.
Name: image-reflector-controller
Namespace: flux-system
CreationTimestamp: Thu, 23 Dec 2021 11:29:24 +0100
Labels: app.kubernetes.io/instance=flux-system
app.kubernetes.io/part-of=flux
app.kubernetes.io/version=v0.30.2
control-plane=controller
kustomize.toolkit.fluxcd.io/name=flux-system
kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations: deployment.kubernetes.io/revision: 6
Selector: app=image-reflector-controller
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=image-reflector-controller
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Service Account: image-reflector-controller
Containers:
manager:
Image: ghcr.io/fluxcd/image-reflector-controller:v0.18.0
Ports: 8080/TCP, 9440/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--events-addr=http://notification-controller.flux-system.svc.cluster.local./
--watch-all-namespaces=true
--log-level=info
--log-encoding=json
--enable-leader-election
Limits:
cpu: 100m
memory: 640Mi
Requests:
cpu: 50m
memory: 384Mi
Liveness: http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
RUNTIME_NAMESPACE: (v1:metadata.namespace)
Mounts:
/data from data (rw)
/tmp from temp (rw)
Volumes:
temp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: image-reflector-controller-db97c765d (1/1 replicas created)
Events: <none>
@Andrea-Gallicchio can you confirm whether just before the OOM occurred there was anything abnormal in the logs?
We regularly reproduce the problem
Before OOM kill there is nothing unusual, it's just regular scanning for new tags
2023-12-26T06:45:47+04:00 {"level":"info","ts":"2023-12-26T02:45:47.803Z","msg":"Latest image tag for 'public.ecr.aws/gravitational/teleport-distroless' resolved to 14.2.4","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"teleport","namespace":"flux-system"},"namespace":"flux-system","name":"teleport","reconcileID":"4f4771ff-7dd2-4b8e-9803-075f0a2460c4"}
2023-12-26T06:45:41+04:00 {"level":"info","ts":"2023-12-26T02:45:41.332Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"7a100653-c4f0-45c6-aac6-a15f09f01de6"}
2023-12-26T06:45:41+04:00 {"level":"info","ts":"2023-12-26T02:45:41.312Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"f0a9b715-e8ed-46fd-972a-e4852b2746a2"}
2023-12-26T06:45:41+04:00 {"level":"info","ts":"2023-12-26T02:45:41.296Z","msg":"no new tags found, next scan in 5m0s","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"9c4644d4-bed0-4c0a-ab90-d74fa197f61c"}
But the main problem is that after the OOM kill, the container can't recover and enters the CrashloopBackOff state
Here are the logs for the container starting after the OOM kill
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.414Z","logger":"runtime","msg":"attempting to acquire leader lease flux-system/image-reflector-controller-leader-election...\n"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.413Z","msg":"starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.309Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.308Z","logger":"setup","msg":"starting manager"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.302Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: Deleting empty file: /data/000004.vlog
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: Set nextTxnTs to 1657
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: Discard stats nextEmptySlot: 0
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: All 0 tables opened in 0s
We are seeing the same issue, despite increasing memory requests and limits: currently at 512M/1G.
Nothing in the logs just before the OOMKilled (which happened on Mon, 06 May 2024 22:17:39 +0100
).
Describe the bug
I run Flux on AWS EKS 1.21.5. I've noticed that after the last Flux update, sometimes happens that the
image-reflector-controller
pod is restarted due toOOM Killed
, even if it has a high CPU and Memory Request/Limit. The number of Helm Releases is between 30 and 40.Steps to reproduce
N/A
Expected behavior
I expect
image-reflector-controller
to not restart due toOOM Killed
.Screenshots and recordings
No response
OS / Distro
N/A
Flux version
v0.31.3
Flux check
► checking prerequisites ✔ Kubernetes 1.21.12-eks-a64ea69 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.21.0 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.22.1 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.18.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.25.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.23.5 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.24.4 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta1 ✔ buckets.source.toolkit.fluxcd.io/v1beta1 ✔ gitrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta1 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta1 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta1 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta1 ✔ providers.notification.toolkit.fluxcd.io/v1beta1 ✔ receivers.notification.toolkit.fluxcd.io/v1beta1 ✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct