fluxcd / flux

Successor: https://github.com/fluxcd/flux2
https://fluxcd.io
Apache License 2.0
6.9k stars 1.08k forks source link

Operator managed secrets being pruned by flux #3619

Closed Maximebb closed 2 years ago

Maximebb commented 2 years ago

Describe the bug

We use flux (v2) to deploy applications that are bundled with operator managed resources. Specifically, we use the ECK operator to deploy elastic search clusters as well as RMQ cluster operator for rabbitmq clusters.

These two require custom resources refered to by our kustomization file which is synced by flux. Visually, The repo

cluster/
|- flux-app.yaml
|- application/
  |- kustomization.yaml
  |- rabbitmqcluster.yaml
  |- elasticcluster.yaml

flux.yaml

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
    name: app
spec:
    interval: 10m0s
    path: ./cluster/application
    prune: true
    force: true
    sourceRef:
        kind: GitRepository
        name: cluster-repository
    validation: client

kustomization.yaml

namespace: app-ns
resources:
  - rabbitmqcluster.yaml
  - elasticcluster.yaml

rabbitmqcluster.yaml

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
    name: rmq-worker
spec:
    resources:
        limits:
            cpu: 1000m
            memory: 1024Mi

elasticcluster.yaml

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
    name: elastic-server
spec:
  version: 8.2.2
  auth:
    roles:
    - secretName: elpha-elasticsearch-roles
    fileRealm:
    - secretName: elpha-elasticsearch-creds
  nodeSets:
    - name: default
      count: 2
      config:
        node.store.allow_mmap: true
      volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi

Both the rqm-worker and elastic-server produces more resources by the associated operator. In particular, each will generate a secret with default credentials our application code uses. As such, after reconciliation, there will be a rmq-worker-default-user and a elastic-server-es-elastic-user secret, both containing credentials.

Now, we're observing flux behaving differently with either resource. With RabbitMQ, it will garbage collect the secret once in a while (we haven't figured the trigger yet), but will never touch the Elastic Search one. In fact, we observed that the RabbitMQ secret has flux annotation and labels, while the ES one doesn't.

...
metadata:
  annotations:
    kustomize.toolkit.fluxcd.io/checksum: b44d23ff2dd3a5de295b8262fc110e02705d7086
  labels:
    app.kubernetes.io/component: rabbitmq
    app.kubernetes.io/name: rmq-worker
    app.kubernetes.io/part-of: rabbitmq
    kustomize.toolkit.fluxcd.io/name: app
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: rmq-worker-default-user
  ownerReferences:
    - apiVersion: rabbitmq.com/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: RabbitmqCluster
      name: rmq-worker
      uid: 6957e7f7-ee0f-44d4-8890-5a56961c5b9a
...

The trace that shows flux garbage collecting this secret:

{
  "level": "info",
  "ts": "2022-06-08T18:33:42.189Z",
  "logger": "controller.kustomization",
  "msg": "garbage collection completed: Secret/app-ns/rmq-worker-default-user deleted\nSecret/app-ns/rmq-worker-erlang-cookie deleted\n",
  "reconciler group": "kustomize.toolkit.fluxcd.io",
  "reconciler kind": "Kustomization",
  "name": "app",
  "namespace": "flux-system"
}

Our work around was to disable pruning on the kustomization, which is less than ideal. We're thinking there might be a difference in how the operator creates the resources, which may cause issues with flux pruning. Any clue as to why this is happening?

tagging @pattersongp

Steps to reproduce

  1. Install flux
  2. Install RabbitMQ (we use the helm chart) and ECK operators (deployed using the manifests directly)
  3. Deploy the resources mentionned above
  4. Trigger reconciliation of the rabbitmqcluster in some way

Expected behavior

Flux would not garbage collect the secret generated by the operator of the RabbitMQCluster resource.

Kubernetes version / Distro / Cloud provider

1.21 and 1.22 (tested on both)

Flux version

image: ghcr.io/fluxcd/kustomize-controller:v0.13.3

Git provider

Gitlab

Container Registry provider

Gitlab

Additional context

No response

Maintenance Acknowledgement

Code of Conduct

stefanprodan commented 2 years ago

This was fixed many months ago, kustomize-controller is now at v0.26. See here how to update https://github.com/fluxcd/flux2/discussions/1916

Maximebb commented 2 years ago

Oh, are you telling me that going ahead with our plan to upgrade flux would've fixed it? :D

Thanks for the fast response!

stefanprodan commented 2 years ago

In v1beta2 I've rewritten the garbage collector especially for issues like this, some controllers decided to copy the kustomize.toolkit.fluxcd.io/checksum to their own resources, thus making Flux think that they were in the repo at some point. in v1beta2 Flux no longer looks for annotations, instead it keeps it's own inventory of what things it manages and does a GC run only for those.