fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.36k stars 592 forks source link

Kustomize controller does not detect changes on a ressource #3552

Open schmidt-i opened 1 year ago

schmidt-i commented 1 year ago

Describe the bug

Changes on a HelmRelease manifest from a Git repo are not applied by the kustomize controller nor being found by flux diff

Steps to reproduce

  1. Have a git repo that is configured as a GitRepository source
  2. Have a kusomization configured that creates a HelmRelease
  3. Change the HelmRelease values section and remove a value from the list (in our case a multiline value)

Expected behavior

Changes are applied by the kustomize controller and the helm release is reconciled.

Screenshots and recordings

No response

OS / Distro

Linux

Flux version

v0.35.0

Flux check

► checking prerequisites ✔ Kubernetes 1.24.6 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.25.0 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.26.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.22.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.29.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.27.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.30.0 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta1 ✔ buckets.source.toolkit.fluxcd.io/v1beta1 ✔ gitrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta1 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta1 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta1 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta2 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta1 ✔ receivers.notification.toolkit.fluxcd.io/v1beta1 ✔ all checks passed

Git provider

GitHub (Enterprise)

Container Registry provider

No response

Additional context

The change in the HelmRelease is a removal of a multiline yaml configuration from the values section.

~/k8s$ flux diff kustomization blueprint --path .
✓  Kustomization diffing...

flux diff shows no difference in the current configuration to the applied configuration even.

Currently configured ressource in the cluster:

$ kubectl get helmreleases.helm.toolkit.fluxcd.io  -n blueprint prometheus-msteams -o yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  creationTimestamp: "2022-03-02T16:24:49Z"
  finalizers:
  - finalizers.fluxcd.io
  generation: 4
  labels:
    kustomize.toolkit.fluxcd.io/name: blueprint
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: prometheus-msteams
  namespace: blueprint
  resourceVersion: "223641439"
  uid: cb59fdb7-f414-4615-bc8b-a311858525a0
spec:
  chart:
    spec:
      chart: prometheus-msteams
      reconcileStrategy: ChartVersion
      sourceRef:
        kind: HelmRepository
        name: prometheus-msteams
        namespace: blueprint
      version: 1.3.1
  dependsOn:
  - name: prometheus-operator
  install:
    remediation:
      retries: 3
  interval: 1m
  releaseName: prometheus-msteams
  values:
    customCardTemplate: '{{ define "teams.card" }} { "@type": "MessageCard", "@context":
      "http://schema.org/extensions", "themeColor": "{{- if eq .Status "resolved"
      -}}2DC72D {{- else if eq .Status "firing" -}} {{- if eq .CommonLabels.severity
      "critical" -}}8C1A1A {{- else if eq .CommonLabels.severity "warning" -}}FFA500
      {{- else -}}808080{{- end -}} {{- else -}}808080{{- end -}}", "summary": "{{-
      if eq .CommonAnnotations.summary "" -}} {{- if eq .CommonAnnotations.message
      "" -}} {{- js .CommonLabels.cluster | reReplaceAll "_" " " | reReplaceAll "-"
      " " | reReplaceAll `\''` "''" -}} {{- else -}} {{- js .CommonAnnotations.message
      | reReplaceAll "_" " " | reReplaceAll "-" " " | reReplaceAll `\''` "''" -}}
      {{- end -}} {{- else -}} {{- js .CommonAnnotations.summary | reReplaceAll "_"
      " " | reReplaceAll "-" " " | reReplaceAll `\''` "''" -}} {{- end -}}", "title":
      "Prometheus Alert ({{ .Status }})", "sections": [ {{$externalUrl := .ExternalURL}}
      {{- range $index, $alert := .Alerts }}{{- if $index }},{{- end }} { "activityTitle":
      "[{{ js $alert.Annotations.description |  reReplaceAll "_" " " | reReplaceAll
      `\''` "''" }}]({{ $externalUrl }})", "facts": [ {{- range $key, $value := $alert.Annotations
      }} { "name": "{{ $key }}", "value": "{{ js $value | reReplaceAll "_" " " | reReplaceAll
      `\''` "''" }}" }, {{- end -}} {{$c := counter}}{{ range $key, $value := $alert.Labels
      }}{{if call $c}},{{ end }} { "name": "{{ $key }}", "value": "{{ js $value |
      reReplaceAll "_" " " | reReplaceAll `\''` "''" }}" } {{- end }} ], "markdown":
      true } {{- end }} ] } {{ end }}'
    metrics:
      serviceMonitor:
        enabled: true
        scrapeInterval: 30s
    replicaCount: 2
    resources:
      limits:
        cpu: 30m
  valuesFrom:
  - kind: ConfigMap
    name: prometheus-msteams-config-values
    optional: true
status:
  conditions:
  - lastTransitionTime: "2023-01-31T08:18:01Z"
    message: Release reconciliation succeeded
    reason: ReconciliationSucceeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-10-05T12:01:05Z"
    message: Helm upgrade succeeded
    reason: UpgradeSucceeded
    status: "True"
    type: Released
  helmChart: blueprint/blueprint-prometheus-msteams
  lastAppliedRevision: 1.3.1
  lastAttemptedRevision: 1.3.1
  lastAttemptedValuesChecksum: 4c81287381ac4d31719d9a83a6711baed5b92daf
  lastReleaseRevision: 3
  observedGeneration: 4

Configuration in the GitRepo:

$ kubectl kustomize . > /tmp/resource.yaml
$ grep -A 26 -B 9 "chart: prometheus-msteams" /tmp/resource.yaml 
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: prometheus-msteams
  namespace: blueprint
spec:
  chart:
    spec:
      chart: prometheus-msteams
      sourceRef:
        kind: HelmRepository
        name: prometheus-msteams
        namespace: blueprint
      version: 1.3.1
  dependsOn:
  - name: prometheus-operator
  install:
    remediation:
      retries: 3
  interval: 1m
  releaseName: prometheus-msteams
  values:
    metrics:
      serviceMonitor:
        enabled: true
        scrapeInterval: 30s
    replicaCount: 2
    resources:
      limits:
        cpu: 30m
  valuesFrom:
  - kind: ConfigMap
    name: prometheus-msteams-config-values
    optional: true
---

As you can see the value "customCardTemplate" is no longer present. However the kustomize controller does not identify any change here.

Code of Conduct

stefanprodan commented 1 year ago

Can you post here please the output of these commands:

schmidt-i commented 1 year ago

Sure

Can you post here please the output of these commands:

  • flux version

    $ flux version
    flux: v0.35.0
    helm-controller: v0.25.0
    image-automation-controller: v0.26.0
    image-reflector-controller: v0.22.0
    kustomize-controller: v0.29.0
    notification-controller: v0.27.0
    source-controller: v0.30.0
  • kubectl get helmreleases.helm.toolkit.fluxcd.io -n blueprint prometheus-msteams --show-managed-fields -o yaml

    
    $ kubectl get helmreleases.helm.toolkit.fluxcd.io  -n blueprint prometheus-msteams  --show-managed-fields -o yaml
    apiVersion: helm.toolkit.fluxcd.io/v2beta1
    kind: HelmRelease
    metadata:
    creationTimestamp: "2022-06-24T15:23:00Z"
    finalizers:
  • finalizers.fluxcd.io generation: 4 labels: kustomize.toolkit.fluxcd.io/name: blueprint kustomize.toolkit.fluxcd.io/namespace: flux-system managedFields:

  • apiVersion: helm.toolkit.fluxcd.io/v2beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: f:kustomize.toolkit.fluxcd.io/name: {} f:kustomize.toolkit.fluxcd.io/namespace: {} f:spec: f:chart: f:spec: f:chart: {} f:sourceRef: f:kind: {} f:name: {} f:namespace: {} f:version: {} f:dependsOn: {} f:install: f:remediation: f:retries: {} f:interval: {} f:releaseName: {} f:values: {} f:valuesFrom: {} manager: kustomize-controller operation: Apply time: "2022-10-11T16:08:39Z"

  • apiVersion: helm.toolkit.fluxcd.io/v2beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: {} v:"finalizers.fluxcd.io": {} manager: helm-controller operation: Update time: "2022-06-24T15:23:00Z"

  • apiVersion: helm.toolkit.fluxcd.io/v2beta1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: {} f:helmChart: {} f:lastAppliedRevision: {} f:lastAttemptedRevision: {} f:lastAttemptedValuesChecksum: {} f:lastReleaseRevision: {} f:observedGeneration: {} manager: helm-controller operation: Update subresource: status time: "2023-01-19T20:56:50Z" name: prometheus-msteams namespace: blueprint resourceVersion: "136463080" uid: d76ace06-4b6f-442a-bbac-dd22738adf9c spec: chart: spec: chart: prometheus-msteams reconcileStrategy: ChartVersion sourceRef: kind: HelmRepository name: prometheus-msteams namespace: blueprint version: 1.3.1 dependsOn:

  • name: prometheus-operator install: remediation: retries: 3 interval: 1m releaseName: prometheus-msteams values: customCardTemplate: '{{ define "teams.card" }} { "@type": "MessageCard", "@context": "http://schema.org/extensions", "themeColor": "{{- if eq .Status "resolved" -}}2DC72D {{- else if eq .Status "firing" -}} {{- if eq .CommonLabels.severity "critical" -}}8C1A1A {{- else if eq .CommonLabels.severity "warning" -}}FFA500 {{- else -}}808080{{- end -}} {{- else -}}808080{{- end -}}", "summary": "{{- if eq .CommonAnnotations.summary "" -}} {{- if eq .CommonAnnotations.message "" -}} {{- js .CommonLabels.cluster | reReplaceAll "" " " | reReplaceAll "-" " " | reReplaceAll \'' "''" -}} {{- else -}} {{- js .CommonAnnotations.message | reReplaceAll "" " " | reReplaceAll "-" " " | reReplaceAll \'' "''" -}} {{- end -}} {{- else -}} {{- js .CommonAnnotations.summary | reReplaceAll "" " " | reReplaceAll "-" " " | reReplaceAll \'' "''" -}} {{- end -}}", "title": "Prometheus Alert ({{ .Status }})", "sections": [ {{$externalUrl := .ExternalURL}} {{- range $index, $alert := .Alerts }}{{- if $index }},{{- end }} { "activityTitle": "[{{ js $alert.Annotations.description | reReplaceAll "" " " | reReplaceAll \'' "''" }}]({{ $externalUrl }})", "facts": [ {{- range $key, $value := $alert.Annotations }} { "name": "{{ $key }}", "value": "{{ js $value | reReplaceAll "" " " | reReplaceAll \'' "''" }}" }, {{- end -}} {{$c := counter}}{{ range $key, $value := $alert.Labels }}{{if call $c}},{{ end }} { "name": "{{ $key }}", "value": "{{ js $value | reReplaceAll "" " " | reReplaceAll \'' "''" }}" } {{- end }} ], "markdown": true } {{- end }} ] } {{ end }}' metrics: serviceMonitor: enabled: true scrapeInterval: 30s replicaCount: 2 resources: limits: cpu: 30m valuesFrom:

  • kind: ConfigMap name: prometheus-msteams-config-values optional: true status: conditions:

  • lastTransitionTime: "2023-01-19T20:56:50Z" message: Release reconciliation succeeded reason: ReconciliationSucceeded status: "True" type: Ready

  • lastTransitionTime: "2022-10-11T16:10:09Z" message: Helm upgrade succeeded reason: UpgradeSucceeded status: "True" type: Released helmChart: blueprint/blueprint-prometheus-msteams lastAppliedRevision: 1.3.1 lastAttemptedRevision: 1.3.1 lastAttemptedValuesChecksum: 4c81287381ac4d31719d9a83a6711baed5b92daf lastReleaseRevision: 3 observedGeneration: 4

stefanprodan commented 1 year ago

So fi you commit the HR without customCardTemplate it doesn't get removed? If so does the Kustomization reports any errors? You can check with flux get kustomization.

schmidt-i commented 1 year ago

exactly, the customCardTemplate is removed. There are no erros on the kustomizations:

$ flux get kustomization
NAME                REVISION        SUSPENDED   READY   MESSAGE                         
blueprint           7.8.0/2b0a442   False       True    Applied revision: 7.8.0/2b0a442 
blueprint-extras    7.8.0/2b0a442   False       True    Applied revision: 7.8.0/2b0a442 

The logs of the kustomize-controller tell that the HelmRelease is unchanged: "HelmRelease/blueprint/prometheus-msteams":"unchanged",

stefanprodan commented 1 year ago

Hmm but your using a tag 7.8.0, does it contain the customCardTemplate removal?

schmidt-i commented 1 year ago

Yes. This release containes the patch were the value was removed. Was the first thing I checked.

n0rad commented 1 year ago

Hello, we are hit by the same problem on different resources on multiple clusters from different providers for few weeks now. Some values keys in HR are not removed by the reconciliation.

We are currently running v0.38.2, on v1.24.8-gke.2000 and v1.23.6

The weird thing is that even if we go k apply directly the resource to get around the problem, the key is still not removed and kubectl reply with unchanged (except the last-applied-configuration). I suppose it may be linked to managed fields but I don't know enough this mechanism.

Now comparing managed fields on one resource that have the problem, only the 2 fields we are trying to remove are missing from the list.

makkes commented 1 year ago

I wasn't able to reproduce the issue on a kind cluster with Kubernetes 1.24.6 and Flux 0.35.0 from scratch so I suspect that a sequence of changes put the cluster in a state where this happens.

@schmidt-i are you able to reproduce the issue even on a fresh cluster?

choppedpork commented 1 year ago

I've just hit what looks like an identical issue today - do let me know if I should create a separate issue if it sounds like a different problem!

My scenario is that I've got a PrometheusRule with a couple of groups in it, something along these lines:

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cloud-admission-ctl-alerts-short-span
  namespace: cloud-admission-controller
  labels:
    target: alertmanager
spec:
  groups:
    - name: cloud-admission-controller
      rules:
        - alert: CloudAdmissionCtlDown
          (...)
    - name: cloud-admission-controller-probe
      rules:
        - alert: CloudAdmissionCtlProbeFailed
          (...)
        - alert: CloudAdmissionCtlProbeStale
          (...)
        - alert: CloudAdmissionCtlProbeHugelyStale
          (...)

I am decommissioning the CloudAdmissionCtlDown alert slowly across a fleet of clusters so I've created a json patch in kustomize for a few clusters like so:

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
(...)
patches:
  - target:
      group: monitoring.coreos.com
      version: v1
      kind: PrometheusRule
      name: cloud-admission-ctl-alerts-short-span
    patch: |-
      - op: remove
        path: /spec/groups/0

This patch is replicated in four separate places: one cluster and three accounts (for every account there's an overlay which is included by the clusters in that account). The result of this being applied by Flux is quite surprising because: a) even though a local kustomize build correctly drops that element, most of the clusters did not notice a change b) the behaviour is inconsistent - two of the clusters did actually apply it correctly!

After a while of scratching my head and trying different things (mostly making changes to the PrometheusRule in one of the affected clusters) I tried removing the group I want removed via a manual kubectl edit of the PrometheusRule - which not only worked but Flux did not revert that change! So this does indeed look like something is causing it to be ignored but only in certain circumstances.

@makkes's last question might actually be relevant here because the two clusters where the patch worked are pretty new (only built a couple of weeks ago) and thus have only had one version of Flux 2 deployed to them with no subsequent upgrades as well as never had any Flux 1 components deployed to them. The clusters where I'm experiencing the issue have had (and still do) Flux 1 deployed and have been through a few Flux 2 upgrades.

FWIW I'm deploying Flux2 using the community helm chart.

To make things even more interesting, just as I was writing this up I thought I should try making another edit to this PrometheusRule in the cluster where I've done the manual edit before (by adding a fake group with a fake alert) - to my surprise the next reconciliation has correctly removed the edit.

I'm in a position where I can actually leave things as they are for a few days so please do let me know if there's any further debugging I could do to triage this issue further.

Thanks!

jkotiuk commented 1 year ago

I'm hitting similar issue, running flux 0.40.2.

I've seen this on few helmReleases already, when I remove a key in git repository, it does not get deleted in cluster. The key is visible/deployed in helmrelease (also in helm get values).

I'm not sure if it is source controller that cache this key or helm/kustomization controller. I've tried to force reconcilation by flux reconcile hr --with-source but nothing changed. If I remove the key from helmrelease definition, flux will not restore it. I'm wondering where those keys are cached. I've killed all flux components so it should pick up clean state but the key was not removed.

We have 8 clusters and each cluster shares the same config, the behavior is random as on some clusters the key is removed properly.

I've also tried setting upgrade.preserveValues: false in helmrelease and then chaning some random value but that didn't remove old keys.

I'd like to know if there is any workaround that will force reinstall helm using clean values without removing resources itself.

The example of removed key from kube-prometheus-stack helm:

values:
  prometheus:
    prometheusSpec:
      image:
        tag: v2.41.0

After removing image block prometheus keeps deploying old version instead of v2.42.0

williambrode commented 1 year ago

We are on v0.38.3 and are seeing the same problem exactly as described in the first post. Its extremely concerning because it breaks the entire contract that flux 2 has (that it will apply the changes in the config). Would be much better if it at least had an error somewhere.

usmonster commented 1 year ago

Its extremely concerning because it breaks the entire contract that flux 2 has (that it will apply the changes in the config).

This is pretty serious. Is there a maintainer we can ping or a path to escalation? Thanks!

karrth commented 1 year ago

We are also seeing this on 0.41.2, and 2.0.0-rc.3. Values removed from a helm chart in the git repo are not getting removed from the deployed helm release.

caramcc commented 1 year ago

We're also seeing this on 2.0.0-rc.5.

Like jkotiuk mentioned, manually editing the helmrelease resource to remove the keys was a viable workaround — flux did not try to restore the previous values.

ebachle commented 1 year ago

I've also seen this issue on 0.41.2 recently.

My git diff looks like this (it's part of a patch):

diff --git a/clusters/dev-sandbox-redux/flux-components/vault/values.yaml b/clusters/dev-sandbox-redux/flux-components/vault/values.yaml
index 52f453ba..ff8d1e4f 100644
--- a/clusters/dev-sandbox-redux/flux-components/vault/values.yaml
+++ b/clusters/dev-sandbox-redux/flux-components/vault/values.yaml
@@ -6,7 +6,7 @@ metadata:
 spec:
   chart:
     spec:
-      version: "v0.19.0"
+      version: "v0.20.1"
   values:
     global:
       tlsDisable: false # Enable HTTPS (uses certificates from Cert-manager)
@@ -15,8 +15,6 @@ spec:
         - name: dockerhub
     server:
       repository: "public.ecr.aws/hashicorp/vault"
-      image:
-        tag: "1.9.3"
       extraArgs: "-config=/config/vault-config/config.hcl" # Get configuration from K8s secret (provisioned by Terraform)
       extraVolumes:
         - type: secret
@@ -76,7 +74,6 @@ spec:
     injector:
       agentImage:
         repository: "public.ecr.aws/hashicorp/vault"
-        tag: "1.9.2"
       replicas: 2
       annotations:
         cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

When I run flux build kustomization --path=./clusters/dev-sandbox-redux/flux-components/vault/ vault I get:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  labels:
    kustomize.toolkit.fluxcd.io/name: vault
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: vault
  namespace: hashicorp
spec:
  chart:
    spec:
      chart: vault
      sourceRef:
        kind: HelmRepository
        name: vault
        namespace: flux-system
      version: v0.20.1
  install:
    remediation:
      retries: 3
  interval: 1h0m0s
  releaseName: vault
  upgrade:
    crds: CreateReplace
  values:
    global:
      imagePullSecrets:
      - name: quay
      - name: dockerhub
      tlsDisable: false
    injector:
      agentImage:
        repository: public.ecr.aws/hashicorp/vault
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
      replicas: 2
    server:
      authDelegator:
        enabled: false
      dataStorage:
        enabled: false
      extraArgs: -config=/config/vault-config/config.hcl
      extraVolumes:
      - name: vault-service-cluster-zone-tls
        type: secret
      - name: vault-config
        path: /config
        type: secret
      ha:
        config: |
          storage "consul" {
            path = "vault"
            address = "HOST_IP:8500"
          }
          telemetry {
            dogstatsd_addr = "HOST_IP:8125"
          }
        disruptionBudget.maxUnavailable: 2
        enabled: true
        replicas: 5
      ingress:
        activeService: false
        annotations:
          external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"
          kubernetes.io/ingress.class: nginx-internal
          nginx.ingress.kubernetes.io/ssl-passthrough: "true"
        enabled: true
        hosts: "...redacted..."
      repository: public.ecr.aws/hashicorp/vault
      resources:
        limits:
          cpu: 1
          memory: 1Gi
        requests:
          cpu: 1
          memory: 1Gi
      service:
        annotations: {}
      updateStrategyType: RollingUpdate

So it's definitively not there in the build. But when I apply it, it doesn't prune the previously explicitly set image tag values from the HelmRelease.

They're also not detected by flux diff kustomization --path=./clusters/dev-sandbox-redux/flux-components/vault/ vault

.  Kustomization diffing...: running dry-run
.. Kustomization diffing...: processing inventory
✓  Kustomization diffing...

► HelmRelease/hashicorp/vault drifted

metadata.generation
  ± value change
    - 5
    + 6

spec.chart.spec.version
  ± value change
    - v0.19.0
    + v0.20.1

We end up having to manually edit the HelmRelease to remove these left-behind fields.

A copy of my current HelmRelease (before any changes) is here:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  creationTimestamp: "2023-03-29T19:17:31Z"
  finalizers:
  - finalizers.fluxcd.io
  generation: 5
  labels:
    kustomize.toolkit.fluxcd.io/name: vault
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: vault
  namespace: hashicorp
  resourceVersion: "1585885902"
  uid: b2397151-653c-46e9-8a70-fd7ee27e0977
spec:
  chart:
    spec:
      chart: vault
      reconcileStrategy: ChartVersion
      sourceRef:
        kind: HelmRepository
        name: vault
        namespace: flux-system
      version: v0.19.0
  install:
    remediation:
      retries: 3
  interval: 1h0m0s
  releaseName: vault
  upgrade:
    crds: CreateReplace
  values:
    global:
      imagePullSecrets:
      - name: quay
      - name: dockerhub
      tlsDisable: false
    injector:
      agentImage:
        repository: public.ecr.aws/hashicorp/vault
        tag: 1.9.2
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
      replicas: 2
    server:
      authDelegator:
        enabled: false
      dataStorage:
        enabled: false
      extraArgs: -config=/config/vault-config/config.hcl
      extraVolumes:
      - name: vault-service-cluster-zone-tls
        type: secret
      - name: vault-config
        path: /config
        type: secret
      ha:
        config: |
          storage "consul" {
            path = "vault"
            address = "HOST_IP:8500"
          }
          telemetry {
            dogstatsd_addr = "HOST_IP:8125"
          }
        disruptionBudget.maxUnavailable: 2
        enabled: true
        replicas: 5
      image:
        tag: 1.9.3
      ingress:
        activeService: false
        annotations:
          external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"
          kubernetes.io/ingress.class: nginx-internal
          nginx.ingress.kubernetes.io/ssl-passthrough: "true"
        enabled: true
        hosts: "...redacted..."
      repository: public.ecr.aws/hashicorp/vault
      resources:
        limits:
          cpu: 1
          memory: 1Gi
        requests:
          cpu: 1
          memory: 1Gi
      service:
        annotations: {}
      updateStrategyType: RollingUpdate
status:
  conditions:
  - lastTransitionTime: "2023-05-31T11:40:19Z"
    message: Release reconciliation succeeded
    reason: ReconciliationSucceeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-05-17T20:38:27Z"
    message: Helm upgrade succeeded
    reason: UpgradeSucceeded
    status: "True"
    type: Released
  helmChart: flux-system/hashicorp-vault
  lastAppliedRevision: 0.19.0
  lastAttemptedRevision: 0.19.0
  lastAttemptedValuesChecksum: 5635480e74c3e338b6741546341da74945b593ea
  lastReleaseRevision: 14
  observedGeneration: 5

I'm also happy to help debug this if that's of any use.

Thanks for looking into it!

kingdonb commented 1 year ago

Is the patch diff part of a Flux kustomization file?

https://github.com/fluxcd/flux2/pull/4062

You could be hitting this issue, which is fixed after 2.0.1 @ebachle - could you take a look at this and see if it sounds like your same issue?

The original report is from a very old version, unless we have an active reporter after 2.0.1 I think we should close it. If you can read the description of the linked PR, confirm the location of the patch, and check briefly if you think you should be using kustomization-file try that, and let us know if it solves your issue, then I think we can close this.

ebachle commented 1 year ago

Hey @kingdonb, this is going to be a bit of a long answer, but here's what I eventually found out.

No guarantees on this, but I almost certain the version we installed the HelmRelease with was kustomize-controller:v0.21.1. We're currently running kustomize-controller:v0.35.1, so it's possible the fix was somewhere in there.

What I ultimately have come to concude is the issue is in what fields Kubernetes thinks the kustomize-controller maintains in the manifest.

This is a before of the stuff around managedFields in the manifest:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  creationTimestamp: "2023-03-29T19:17:31Z"
  finalizers:
  - finalizers.fluxcd.io
  generation: 5
  labels:
    kustomize.toolkit.fluxcd.io/name: vault
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  managedFields:
  - apiVersion: helm.toolkit.fluxcd.io/v2beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:kustomize.toolkit.fluxcd.io/name: {}
          f:kustomize.toolkit.fluxcd.io/namespace: {}
      f:spec:
        f:chart:
          f:spec:
            f:chart: {}
            f:sourceRef:
              f:kind: {}
              f:name: {}
              f:namespace: {}
            f:version: {}
        f:install:
          f:remediation:
            f:retries: {}
        f:interval: {}
        f:releaseName: {}
        f:upgrade:
          f:crds: {}
        f:values: {}
    manager: kustomize-controller
    operation: Apply
    time: "2023-05-17T20:38:24Z"
  - apiVersion: helm.toolkit.fluxcd.io/v2beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"finalizers.fluxcd.io": {}
    manager: helm-controller
    operation: Update
    time: "2023-03-29T19:17:31Z"
  - apiVersion: helm.toolkit.fluxcd.io/v2beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:helmChart: {}
        f:lastAppliedRevision: {}
        f:lastAttemptedRevision: {}
        f:lastAttemptedValuesChecksum: {}
        f:lastReleaseRevision: {}
        f:observedGeneration: {}
    manager: helm-controller
    operation: Update
    subresource: status
    time: "2023-05-31T11:40:19Z"
  name: vault
  namespace: hashicorp
  resourceVersion: "1585885902"
  uid: b2397151-653c-46e9-8a70-fd7ee27e0977
spec:

When I apply a change that doesn't remove a field, really just making a random change (additive or mutating), but forces the kustomize-controller to re-reckon with the file, this is what it becomes.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  creationTimestamp: "2023-03-29T19:17:31Z"
  finalizers:
  - finalizers.fluxcd.io
  generation: 6
  labels:
    kustomize.toolkit.fluxcd.io/name: vault
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  managedFields:
  - apiVersion: helm.toolkit.fluxcd.io/v2beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:kustomize.toolkit.fluxcd.io/name: {}
          f:kustomize.toolkit.fluxcd.io/namespace: {}
      f:spec:
        f:chart:
          f:spec:
            f:chart: {}
            f:sourceRef:
              f:kind: {}
              f:name: {}
              f:namespace: {}
            f:version: {}
        f:install:
          f:remediation:
            f:retries: {}
        f:interval: {}
        f:releaseName: {}
        f:upgrade:
          f:crds: {}
        f:values:
          f:global:
            .: {}
            f:imagePullSecrets: {}
            f:tlsDisable: {}
          f:injector:
            .: {}
            f:agentImage:
              .: {}
              f:repository: {}
            f:annotations:
              .: {}
              f:cluster-autoscaler.kubernetes.io/safe-to-evict: {}
            f:replicas: {}
          f:server:
            .: {}
            f:authDelegator:
              .: {}
              f:enabled: {}
            f:dataStorage:
              .: {}
              f:enabled: {}
            f:extraArgs: {}
            f:extraVolumes: {}
            f:ha:
              .: {}
              f:config: {}
              f:disruptionBudget.maxUnavailable: {}
              f:enabled: {}
              f:replicas: {}
            f:ingress:
              .: {}
              f:activeService: {}
              f:annotations:
                .: {}
                f:external-dns.alpha.kubernetes.io/cloudflare-proxied: {}
                f:kubernetes.io/ingress.class: {}
                f:nginx.ingress.kubernetes.io/ssl-passthrough: {}
              f:enabled: {}
              f:hosts: {}
            f:repository: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:service:
              .: {}
              f:annotations:
                .: {}
                f:ad.datadoghq.com/service.check_names: {}
                f:ad.datadoghq.com/service.init_configs: {}
                f:ad.datadoghq.com/service.instances: {}
            f:updateStrategyType: {}
    manager: kustomize-controller
    operation: Apply
    time: "2023-07-19T15:31:24Z"
  - apiVersion: helm.toolkit.fluxcd.io/v2beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"finalizers.fluxcd.io": {}
    manager: helm-controller
    operation: Update
    time: "2023-03-29T19:17:31Z"
  - apiVersion: helm.toolkit.fluxcd.io/v2beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:helmChart: {}
        f:lastAppliedRevision: {}
        f:lastAttemptedRevision: {}
        f:lastAttemptedValuesChecksum: {}
        f:lastReleaseRevision: {}
        f:observedGeneration: {}
    manager: helm-controller
    operation: Update
    subresource: status
    time: "2023-07-19T15:31:24Z"
  name: vault
  namespace: hashicorp
  resourceVersion: "1749511002"
  uid: b2397151-653c-46e9-8a70-fd7ee27e0977
spec:

Namely that the values field is only known about at the top level before and each field below it is detailed afterwards:

        f:values: {}

After that point I'm able to modify the value of my image field in a separate commit and all works as expected.

I've reviewed the changelog of the kustomize-controller from v0.21.1 to v0.35.1 and haven't seen anything that stands out to why those values weren't stored as sub-objects of that managed field.

The other possibility is maybe it's something that change between our upgrade to 1.22 and 1.23 in this time? But I'd be hard pressed to find that one either.

I'd be curious if there's any ideas, but I did regardless want to share my findings in case anyone else finds themselves in this pickle.

I'm also not sure if there's a change that could be made to force this addition of new managed fields before applying the change. But that also feels rather risky of a change in general. Especially as releases after the GA may not have this issue.

Deep details

Some details on the exact change I made... I updated the chart version in one PR:

diff --git a/clusters/arryn-staging-redux/flux-components/vault/values.yaml b/clusters/arryn-staging-redux/flux-components/vault/values.yaml
index 7b375891..bca9bea2 100644
--- a/clusters/arryn-staging-redux/flux-components/vault/values.yaml
+++ b/clusters/arryn-staging-redux/flux-components/vault/values.yaml
@@ -6,7 +6,7 @@ metadata:
 spec:
   chart:
     spec:
-      version: "v0.19.0"
+      version: "v0.20.1"
   values:
     global:
       tlsDisable: false # Enable HTTPS (uses certificates from Cert-manager)

This resulted in this diff from flux diff

.  Kustomization diffing...: running dry-run
.. Kustomization diffing...: processing inventory
✓  Kustomization diffing...
► PriorityClass/global-cluster-critical drifted

metadata.labels.kustomize.toolkit.fluxcd.io/name
  ± value change
    - consul
    + vault

► Namespace/hashicorp drifted

metadata.labels.kustomize.toolkit.fluxcd.io/name
  ± value change
    - consul
    + vault

► HelmRelease/hashicorp/vault drifted

metadata.generation
  ± value change
    - 8
    + 9

spec.chart.spec.version
  ± value change
    - v0.19.0
    + v0.20.1

After that point the fields are managed as expected.

Then I made a separte PR to remove the image field I no longer want to differ from the default values. ANd this is once the values field was managed by each sub-field:

diff --git a/clusters/arryn-staging-redux/flux-components/vault/values.yaml b/clusters/arryn-staging-redux/flux-components/vault/values.yaml
index 4fcd2b0a..ff8d1e4f 100644
--- a/clusters/arryn-staging-redux/flux-components/vault/values.yaml
+++ b/clusters/arryn-staging-redux/flux-components/vault/values.yaml
@@ -15,8 +15,6 @@ spec:
         - name: dockerhub
     server:
       image:
         repository: "public.ecr.aws/hashicorp/vault"
-        tag: "1.9.3"
       extraArgs: "-config=/config/vault-config/config.hcl" # Get configuration from K8s secret (provisioned by Terraform)
       extraVolumes:
         - type: secret
@@ -76,7 +74,6 @@ spec:
     injector:
       agentImage:
         repository: "public.ecr.aws/hashicorp/vault"
-        tag: "1.9.2"
       replicas: 2
       annotations:
         cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

At that point my flux diff is:

.  Kustomization diffing...: running dry-run
.. Kustomization diffing...: processing inventory
✓  Kustomization diffing...
► HelmRelease/hashicorp/vault drifted

metadata.generation
  ± value change
    - 9
    + 10

spec.values.injector.agentImage
  - one map entry removed:
    tag: 1.9.2

spec.values.server
  - one map entry removed:
      tag: 1.9.3

And all things seem managed as expected, including the removal/update of the field.

kingdonb commented 1 year ago

I appreciate you sharing your findings! I just wanted to ensure I read the conclusion, you found that the upgrade did resolve the issue, though it sounds like you may have still had to force a change somehow to see the updated result in the end.

There were definitely updates in kustomize controller that affected how server side apply reconciles sub-structures in later versions, I'm not sure of the exact versions that would have included these changes. So long as you're able to work with the current state in GA, and since it sounds like you have (had) a repro of the issue on a version matching the report, if I understood all that correctly, then I believe based on your update we can close this issue.

Thanks again for reporting back @schmidt-i. Have I got that right?

chimisu commented 4 months ago

Hi,Does this issue be resolved in 2.0.1?I'm also facing this bug in flux 0.28.5

makkes commented 4 months ago

Hi,Does this issue be resolved in 2.0.1?I'm also facing this bug in flux 0.28.5

We haven't gotten any more feedback on this issue for a couple of months now. 0.28.5 is more than 2 years old so if you would like to help here, you could upgrade to the latest Flux version and see if the issue goes away.

karrth commented 4 months ago

I'm still seeing this issue on v2.2.3