fluxcd / kustomize-controller

The GitOps Toolkit Kustomize reconciler
https://fluxcd.io
Apache License 2.0
254 stars 181 forks source link

CrashLoopBackOff OOMKilled for kustomize-controller #725

Closed sofiasikandar123 closed 2 years ago

sofiasikandar123 commented 2 years ago

We created a Flux configuration in our Kubernetes cluster. We keep getting an issue with the Kustomize-Controller pod getting stuck in a CrashLoopBackOff state. The logs aren't pointing to any particular root cause for the crash.

Pod status:

$ kubectl get pods -n flux-system
NAME                                       READY   STATUS             RESTARTS        AGE
fluxconfig-agent-7bbdd4f98f-b6gpk          2/2     Running            0               2d22h
fluxconfig-controller-cc788b88f-wjs9l      2/2     Running            0               2d22h
helm-controller-67c6cf57b-qmgrz            1/1     Running            0               2d22h
kustomize-controller-7cfb84c5fd-bg94s      0/1     CrashLoopBackOff   21 (114s ago)   2d22h
notification-controller-5485c8d468-xgvg9   1/1     Running            0               2d15h
source-controller-95c44bbf8-f9cht          1/1     Running            0               2d22h

Pod events:

$ kubectl describe pod kustomize-controller-7cfb84c5fd-bg94s -n flux-system
...
Events:1   <none>         
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------1   <none>         
  Normal   Pulled   30m (x13 over 86m)   kubelet  Container image "mcr.microsoft.com/oss/fluxcd/kustomize-control
ler:v0.27.1" already present on machine1   <none>         
  Normal   Created  30m (x13 over 86m)   kubelet  Created container manager
  Normal   Started  30m (x13 over 86m)   kubelet  Started container manager1   <none>         
  Warning  BackOff  67s (x306 over 85m)  kubelet  Back-off restarting failed container

Pod logs:

$ kubectl logs kustomize-controller-7cfb84c5fd-bg94s -n flux-system
{"level":"info","ts":"2022-09-09T17:51:36.868Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-09-09T17:51:36.870Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-09-09T17:51:36.871Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-09-09T17:51:36.871Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0909 17:51:36.973225       7 leaderelection.go:248] attempting to acquire leader lease flux-system/kustomize-controller-leader-election...
I0909 17:52:21.794522       7 leaderelection.go:258] successfully acquired lease flux-system/kustomize-controller-leader-election
{"level":"info","ts":"2022-09-09T17:52:21.795Z","logger":"controller.kustomization","msg":"Starting EventSource","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind source: *v1beta2.Kustomization"}
{"level":"info","ts":"2022-09-09T17:52:21.795Z","logger":"controller.kustomization","msg":"Starting EventSource","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind source: *v1beta2.OCIRepository"}
{"level":"info","ts":"2022-09-09T17:52:21.795Z","logger":"controller.kustomization","msg":"Starting EventSource","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-09-09T17:52:21.795Z","logger":"controller.kustomization","msg":"Starting EventSource","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind source: *v1beta2.Bucket"}
{"level":"info","ts":"2022-09-09T17:52:21.795Z","logger":"controller.kustomization","msg":"Starting Controller","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization"}
{"level":"info","ts":"2022-09-09T17:52:21.901Z","logger":"controller.kustomization","msg":"Starting workers","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","worker count":4}
{"level":"info","ts":"2022-09-09T17:52:21.901Z","logger":"controller.kustomization","msg":"All dependencies are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"main-config4-data-api","namespace":"main-config4"}
{"level":"info","ts":"2022-09-09T17:52:21.925Z","logger":"controller.kustomization","msg":"Dependencies do not meet ready condition, retrying in 30s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"main-config5-frontend","namespace":"main-config5"}
{"level":"info","ts":"2022-09-09T17:52:21.930Z","logger":"controller.kustomization","msg":"All dependencies are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"smilr-data-api","namespace":"smilr"}
{"level":"info","ts":"2022-09-09T17:52:21.932Z","logger":"controller.kustomization","msg":"Dependencies do not meet ready condition, retrying in 30s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"smilr-frontend","namespace":"smilr"}
{"level":"info","ts":"2022-09-09T17:52:21.935Z","logger":"controller.kustomization","msg":"All dependencies are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"main-config-data-api","namespace":"main-config"}
{"level":"info","ts":"2022-09-09T17:52:22.856Z","logger":"controller.kustomization","msg":"server-side apply completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"smilr-data-api","namespace":"smilr","output":{"Deployment/default/data-api":"unchanged","Service/default/data-api":"unchanged"}}
{"level":"info","ts":"2022-09-09T17:52:22.898Z","logger":"controller.kustomization","msg":"server-side apply completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"main-config-data-api","namespace":"main-config","output":{"Deployment/default/data-api":"configured","Service/default/data-api":"configured"}}
{"level":"info","ts":"2022-09-09T17:52:22.903Z","logger":"controller.kustomization","msg":"server-side apply completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"main-config4-data-api","namespace":"main-config4","output":{"Deployment/default/data-api":"configured","Service/default/data-api":"configured"}}
stefanprodan commented 2 years ago

Can you post here the logs from the previous container instance, kubectl logs --previous

sofiasikandar123 commented 2 years ago

Here they are. Thoughts?

$ kubectl logs kustomize-controller-7cfb84c5fd-bg94s --previous -n flux-system
{"level":"info","ts":"2022-09-15T14:57:33.854Z","logger":"controller-runtime.metrics","msg":"Metrics server
 is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-09-15T14:57:33.856Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-09-15T14:57:33.857Z","msg":"Starting server","path":"/metrics","kind":"metrics",
"addr":"[::]:8080"}
{"level":"info","ts":"2022-09-15T14:57:33.858Z","msg":"Starting server","kind":"health probe","addr":"[::]:
9440"}
I0915 14:57:33.959090       6 leaderelection.go:248] attempting to acquire leader lease flux-system/kustomi
ze-controller-leader-election...
I0915 14:58:16.309400       6 leaderelection.go:258] successfully acquired lease flux-system/kustomize-cont
roller-leader-election
{"level":"info","ts":"2022-09-15T14:58:16.309Z","logger":"controller.kustomization","msg":"Starting EventSo
urce","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind sou
rce: *v1beta2.Kustomization"}
{"level":"info","ts":"2022-09-15T14:58:16.309Z","logger":"controller.kustomization","msg":"Starting EventSo
urce","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind sou
rce: *v1beta2.OCIRepository"}
{"level":"info","ts":"2022-09-15T14:58:16.309Z","logger":"controller.kustomization","msg":"Starting EventSo
urce","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind sou
rce: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-09-15T14:58:16.309Z","logger":"controller.kustomization","msg":"Starting EventSo
urce","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","source":"kind sou
rce: *v1beta2.Bucket"}
{"level":"info","ts":"2022-09-15T14:58:16.309Z","logger":"controller.kustomization","msg":"Starting Control
ler","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization"}
{"level":"info","ts":"2022-09-15T14:58:16.410Z","logger":"controller.kustomization","msg":"Starting workers
","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","worker count":4}
{"level":"info","ts":"2022-09-15T14:58:16.411Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config10-data-api","namespace":"main-config10"}
{"level":"info","ts":"2022-09-15T14:58:16.411Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config4-frontend","namespace":"main-config4"}
{"level":"info","ts":"2022-09-15T14:58:16.445Z","logger":"controller.kustomization","msg":"Dependencies do 
not meet ready condition, retrying in 30s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kin
d":"Kustomization","name":"main-config5-frontend","namespace":"main-config5"}
{"level":"info","ts":"2022-09-15T14:58:17.241Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config4-mongodb","namespace":"main-config4","output":{"Deployment/default/mongodb":"unchanged","Service/d
efault/database":"unchanged"}}
{"level":"info","ts":"2022-09-15T14:58:17.270Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config10-data-api","namespace":"main-config10","output":{"Deployment/default/data-api":"configured","Serv
ice/default/data-api":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:17.398Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config-mongodb","namespace":"main-config","output":{"Deployment/default/mongodb":"configured","Service/de
fault/database":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:17.475Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config4-frontend","namespace":"main-config4","output":{"Deployment/default/frontend":"unchanged","Ingress
{"level":"info","ts":"2022-09-15T14:58:17.491Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.079400252s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config4-mongodb","namespace":"main-config4","revision":"external-ip-issue/
f4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:17.493Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config4-data-api","namespace":"main-config4"}
{"level":"info","ts":"2022-09-15T14:58:17.528Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.117169472s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config10-data-api","namespace":"main-config10","revision":"external-ip-iss
ue/f4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:17.654Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.192463409s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config-mongodb","namespace":"main-config","revision":"external-ip-issue/f4
af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:17.657Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config-data-api","namespace":"main-config"}
{"level":"info","ts":"2022-09-15T14:58:17.685Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.273614025s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config4-frontend","namespace":"main-config4","revision":"external-ip-issue
/f4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:18.352Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config4-data-api","namespace":"main-config4","output":{"Deployment/default/data-api":"configured","Servic
e/default/data-api":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:18.433Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"smi
lr-mongodb","namespace":"smilr","output":{"Deployment/default/mongodb":"configured","Service/default/databa
se":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:18.560Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config-data-api","namespace":"main-config","output":{"Deployment/default/data-api":"configured","Service/
default/data-api":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:18.567Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.073585572s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config4-data-api","namespace":"main-config4","revision":"external-ip-issue
/f4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:18.568Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config10-frontend","namespace":"main-config10"}
{"level":"info","ts":"2022-09-15T14:58:18.648Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.107941545s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"smilr-mongodb","namespace":"smilr","revision":"external-ip/00e82d39c28b829dbd0c
0533b683aa8d691db736"}
{"level":"info","ts":"2022-09-15T14:58:18.678Z","logger":"controller.kustomization","msg":"Dependencies do 
not meet ready condition, retrying in 30s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kin
d":"Kustomization","name":"main-config5-data-api","namespace":"main-config5"}
{"level":"info","ts":"2022-09-15T14:58:18.680Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"smilr-data-api","namespace":"smilr"}
{"level":"info","ts":"2022-09-15T14:58:18.725Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.067581889s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config-data-api","namespace":"main-config","revision":"external-ip-issue/f
4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:18.730Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config-frontend","namespace":"main-config"}
{"level":"info","ts":"2022-09-15T14:58:19.509Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"smi
lr-data-api","namespace":"smilr","output":{"Deployment/default/data-api":"configured","Service/default/data
-api":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:19.682Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config10-frontend","namespace":"main-config10","output":{"Deployment/default/frontend":"configured","Ingr
ess/default/smilr":"configured","Service/default/frontend":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:19.721Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.041445431s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"smilr-data-api","namespace":"smilr","revision":"external-ip/00e82d39c28b829dbd0
c0533b683aa8d691db736"}
{"level":"info","ts":"2022-09-15T14:58:19.923Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.355282249s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config10-frontend","namespace":"main-config10","revision":"external-ip-iss
ue/f4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:19.925Z","logger":"controller.kustomization","msg":"All dependencies
 are ready, proceeding with reconciliation","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"smilr-frontend","namespace":"smilr"}
{"level":"info","ts":"2022-09-15T14:58:19.967Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config-frontend","namespace":"main-config","output":{"Deployment/default/frontend":"configured","Ingress/
default/smilr":"configured","Service/default/frontend":"configured"}}
{"level":"info","ts":"2022-09-15T14:58:20.047Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.316768619s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config-frontend","namespace":"main-config","revision":"external-ip-issue/f
4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:20.542Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"mai
n-config10-mongodb","namespace":"main-config10","output":{"Deployment/default/mongodb":"configured","Servic
e/default/database":"configured"}}
I0915 14:58:21.046324       6 request.go:601] Waited for 1.090126399s due to client-side throttling, not pr
iority and fairness, request: GET:https://10.96.0.1:443/apis/source.toolkit.fluxcd.io/v1beta2?timeout=32s
{"level":"info","ts":"2022-09-15T14:58:21.103Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 1.374601414s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"main-config10-mongodb","namespace":"main-config10","revision":"external-ip-issu
e/f4af09260f726795fdcc5be3ddef9e3bc3882567"}
{"level":"info","ts":"2022-09-15T14:58:21.203Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"smi
lr-frontend","namespace":"smilr","output":{"Namespace/default":"unchanged"}}
{"level":"info","ts":"2022-09-15T14:58:21.861Z","logger":"controller.kustomization","msg":"server-side appl
y completed","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"smi
lr-frontend","namespace":"smilr","output":{"ClusterRole/ingress-nginx":"unchanged","ClusterRole/ingress-ngi
nx-admission":"unchanged","ClusterRoleBinding/ingress-nginx":"unchanged","ClusterRoleBinding/ingress-nginx-
admission":"unchanged","ConfigMap/default/ingress-nginx-controller":"unchanged","Deployment/default/fronten
d":"configured","Deployment/default/ingress-nginx-controller":"unchanged","Ingress/default/smilr":"configur
ed","IngressClass/nginx":"unchanged","Job/default/ingress-nginx-admission-create":"unchanged","Job/default/
ingress-nginx-admission-patch":"unchanged","Role/default/ingress-nginx":"unchanged","Role/default/ingress-n
ginx-admission":"unchanged","RoleBinding/default/ingress-nginx":"unchanged","RoleBinding/default/ingress-ng
inx-admission":"unchanged","Service/default/frontend":"configured","Service/default/ingress-nginx-controlle
r":"unchanged","Service/default/ingress-nginx-controller-admission":"unchanged","ServiceAccount/default/ing
ress-nginx":"unchanged","ServiceAccount/default/ingress-nginx-admission":"unchanged","ValidatingWebhookConf
iguration/ingress-nginx-admission":"unchanged"}}
{"level":"info","ts":"2022-09-15T14:58:22.043Z","logger":"controller.kustomization","msg":"Reconciliation f
inished in 2.118346948s, next run in 10m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler ki
nd":"Kustomization","name":"smilr-frontend","namespace":"smilr","revision":"external-ip/00e82d39c28b829dbd0
c0533b683aa8d691db736"}
stefanprodan commented 2 years ago

I see no panic nor other errors in the logs you posted, something else must be going on. Can please post here:

kubectl -n flux-system get deployment kustomize-controller -oyaml
kubectl -n flux-system describe deployment kustomize-controller
sofiasikandar123 commented 2 years ago

Here you go! Thanks for your quick response.

$ kubectl -n flux-system get deployment kustomize-controller -oyaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    meta.helm.sh/release-name: flux
    meta.helm.sh/release-namespace: flux-system
  creationTimestamp: "2022-08-23T20:30:23Z"
  generation: 2
  labels:
    app.kubernetes.io/component: kustomize-controller
    app.kubernetes.io/instance: flux
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: microsoft.flux
    app.kubernetes.io/part-of: flux
    app.kubernetes.io/version: v0.33.0
    clusterconfig.azure.com/extension-version: 1.6.0
    control-plane: controller
  name: kustomize-controller
  namespace: flux-system
  resourceVersion: "5607736"
  uid: 50b552ef-7c39-4fa2-ab17-a2630fcbd567
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: kustomize-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: kustomize-controller
        app.kubernetes.io/name: microsoft.flux
    spec:
      containers:
      - args:
        - --events-addr=http://notification-controller.$(RUNTIME_NAMESPACE).svc.cluster.local./
        - --watch-all-namespaces=true
        - --log-level=info
        - --log-encoding=json
        - --enable-leader-election
        - --no-cross-namespace-refs=true
        - --no-remote-bases=true
        - --default-service-account=flux-applier
        env:
        - name: RUNTIME_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: mcr.microsoft.com/oss/fluxcd/kustomize-controller:v0.27.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: healthz
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: manager
        ports:
        - containerPort: 9440
          name: healthz
          protocol: TCP
        - containerPort: 8080
          name: http-prom
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: healthz
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 64Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: temp
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1337
      serviceAccount: kustomize-controller
      serviceAccountName: kustomize-controller
      terminationGracePeriodSeconds: 60
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      volumes:
      - emptyDir: {}
        name: temp
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-08-23T20:34:26Z"
    lastUpdateTime: "2022-09-06T19:38:22Z"
    message: ReplicaSet "kustomize-controller-7cfb84c5fd" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-09-15T15:09:19Z"
    lastUpdateTime: "2022-09-15T15:09:19Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
$ kubectl -n flux-system describe deployment kustomize-controller
Name:                   kustomize-controller
Namespace:              flux-system
CreationTimestamp:      Tue, 23 Aug 2022 20:30:23 +0000
Labels:                 app.kubernetes.io/component=kustomize-controller
                        app.kubernetes.io/instance=flux
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=microsoft.flux
                        app.kubernetes.io/part-of=flux
                        app.kubernetes.io/version=v0.33.0
                        clusterconfig.azure.com/extension-version=1.6.0
                        control-plane=controller
Annotations:            deployment.kubernetes.io/revision: 2
                        meta.helm.sh/release-name: flux
                        meta.helm.sh/release-namespace: flux-system
Selector:               app=kustomize-controller
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=kustomize-controller
                    app.kubernetes.io/name=microsoft.flux
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  kustomize-controller
  Containers:
   manager:
    Image:       mcr.microsoft.com/oss/fluxcd/kustomize-controller:v0.27.1
    Ports:       9440/TCP, 8080/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --events-addr=http://notification-controller.$(RUNTIME_NAMESPACE).svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
      --no-cross-namespace-refs=true
      --no-remote-bases=true
      --default-service-account=flux-applier
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:   (v1:metadata.namespace)
    Mounts:
      /tmp from temp (rw)
  Volumes:
   temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   kustomize-controller-7cfb84c5fd (1/1 replicas created)
Events:          <none>
stefanprodan commented 2 years ago

Hmm really strange, there are no events for the deployment. Could it be that the Kubernetes node has issues? Please delete the pod then after the new one starts, as soon as it fails, post here kubectl describe pod.

azure-claudiu commented 2 years ago

Hi, I'm working with Sofia on this case. What happens with the kustomize-controller pod is that it actually gets OOMKilled. I tried to change its memory limit (from 1 GiB to 2 GiB), and still runs out of memory. I confirmed it happens because of memory ballooning by attaching to the pod during the short time that it runs, and noticing after about 20 seconds that the kustomize-controller process inside the container goes from 700 MiB to more than 3.2 GiB, and then the container gets killed with exit code 137.

Here's a snippet of describe pod kustomize-controller:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 12 Sep 2022 14:23:32 +0000
      Finished:     Mon, 12 Sep 2022 14:24:12 +0000
    Ready:          False
    Restart Count:  731
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   64Mi
stefanprodan commented 2 years ago

@azure-claudiu To fix the memory leak we need to reproduce it first. Can you please create a repo with the YAML files that make the controller behave like this?

azure-claudiu commented 2 years ago

Just to clarify, we created these Flux configs using an Azure extension. Steps are basically these here: https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/tutorial-use-gitops-flux2

Would you need just the yaml files, or also the flux configs created by this extension?

sofiasikandar123 commented 2 years ago

This is our file structure. There are three yaml files in our base/mongodb folder.

base/
  mongodb/
    kustomization.yaml
    mongodb-deployment.yaml
    mongodb-service.yaml

kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: default
resources:
  - mongo-deployment.yaml
  - mongo-service.yaml

mongo-deployment.yaml:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: mongodb
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mongodb
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      containers:
        - name: mongodb-container
          image: mongo:5.0
          imagePullPolicy: Always
          ports:
            - containerPort: 27017
          env:
            - name: MONGO_INITDB_ROOT_USERNAME
              value: admin
            - name: MONGO_INITDB_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mongo-creds
                  key: admin-password

mongo-service.yaml

kind: Service
apiVersion: v1
metadata:
  name: database
spec:
  type: ClusterIP
  selector:
    app: mongodb
  ports:
    - protocol: TCP
      port: 27017
      targetPort: 27017

Lastly, we created a GitOps configuration with the following command:

az k8s-configuration flux create -g YOUR_RESOURCE_GROUP \
-c YOUR_ARC_ENABLED_CLUSTER \
-n cluster-configs \
--namespace cluster-configs \
-t connectedClusters \
--scope cluster \
-u https://your.git.repository/repo \
--https-user=YOUR_USERNAME \
--https-key=YOUR_PASSWORD \
--branch YOUR_BRANCH_NAME \
--kustomization name=mongodb path=/base/mongodb prune=true 
stefanprodan commented 2 years ago

With those files I can’t reproduce the OOM. Can you please swap the image with the upstream one and see if it fails the same? If it does, then please do a heap dump and share with me. Set the controller image, in the its deployment to: ghcr.io/fluxcd/kustomize-controller:v0.28.0.

azure-claudiu commented 2 years ago

I changed the image; it still leaks memory: https://fluxissues1.blob.core.windows.net/videos/top-102.mp4

Here's a heap dump every second (roughly), until it dies: https://fluxissues1.blob.core.windows.net/heaps/heap-b.zip

stefanprodan commented 2 years ago

Can you please create a zip of your repo and share it with me, I also need the GitRepository and Flux all Kustomizations manifests from the cluster. You can reach out to me on CNCF Slack if the repo contains sensitive information and share it privately.

stefanprodan commented 2 years ago

I suspect that one of repositories used here is very large and that may cause the OOM as we need to load the content in memory to verify the checksum. Can you please exec into source-controller and post the output of:

$ kubectl -n flux-system exec -it source-controller-7b66c9d497-r4d68 -- sh
~ $ du -sh /data/*
160.0K  /data/helmchart
252.0K  /data/helmrepository
88.0K   /data/ocirepository
stefanprodan commented 2 years ago

I managed to reproduce the OOM with a repo containing over 100MB of dummy files. I guess we need to reject such an artifact in source-controller and error out to tell people to use .sourceingore to exclude files that are not made for Flux.

azure-claudiu commented 2 years ago

We definitely have a directory in the source tree that contains large ML model files. Here's the output from our source-controller:

~ $ du -sh /data/*
681.6M  /data/gitrepository
stefanprodan commented 2 years ago

Ok then mystery solved, add a .sourceignore file to that repo and exclude all paths which don't contain Kubernetes YAMLs. But still this is very problematic to download GB of unrelated data in-cluster, I suggest you create a dedicated repo for that app called <app-name>-deploy with only YAMLs and make Flux look at that.

A better option would be to push the manifests from that repo to ACR and let Flux sync the manifest from there, see https://fluxcd.io/flux/cheatsheets/oci-artifacts/

sofiasikandar123 commented 2 years ago

Thanks for your help, Stefan. That resolved it for us. Closing the issue now!