fluxcd / helm-controller

The GitOps Toolkit Helm reconciler, for declarative Helming
https://fluxcd.io
Apache License 2.0
408 stars 163 forks source link

Re-applying unchanged HelmRelease YAML triggers helm upgrade action #403

Closed simingweng closed 11 months ago

simingweng commented 2 years ago

Here are the steps to reproduce the my observation.

  1. create a local KinD cluster as target for testing kind create cluster
  2. install FluxCD components flux install
  3. create a helm repository kubectl apply -f https://github.com/fluxcd/helm-controller/raw/main/config/samples/source_v1beta1_helmrepository.yaml
  4. create a helm release resource https://github.com/fluxcd/helm-controller/raw/main/config/samples/helm_v2beta1_helmrelease_helmrepository.yaml
  5. when the helm release becomes ready, the status shows lastReleaseRevision: 1
  6. now re-apply the same helm release kubectl apply -f https://github.com/fluxcd/helm-controller/raw/main/config/samples/helm_v2beta1_helmrelease_helmrepository.yaml
  7. it triggers a reconciliation, and a helm upgrade is performed, and results in lastReleaseRevision: 2
LAST SEEN   TYPE     REASON              OBJECT                               MESSAGE
17m         Normal   info                helmrelease/podinfo-helmrepository   HelmChart 'default/default-podinfo-helmrepository' is not ready
17m         Normal   info                helmrelease/podinfo-helmrepository   Helm install has started
17m         Normal   ScalingReplicaSet   deployment/podinfo-helmrepository    Scaled up replica set podinfo-helmrepository-7d5dc94c75 to 1
17m         Normal   info                helmrelease/podinfo-helmrepository   Helm install succeeded
7m25s       Normal   info                helmrelease/podinfo-helmrepository   Helm test succeeded
7m30s       Normal   info                helmrelease/podinfo-helmrepository   Helm upgrade has started
7m29s       Normal   info                helmrelease/podinfo-helmrepository   Helm upgrade succeeded

Is this behavior by design? Although the helm upgrade action doesn't modify any meaningful state in the cluster, does it make sense to advance the helm revision in such a scenario?

flux: v0.25.1
helm-controller: v0.15.0
kustomize-controller: v0.19.0
notification-controller: v0.20.1
source-controller: v0.20.1
simingweng commented 2 years ago

Here is the comparison between the two generations of the same resource from Kubernetes API server point of https://www.diffchecker.com/krOGkCeF, it seems the unnecessary reconciliation is caused by the interval field.

The interval field is of type metav1.Duration which embeds time.Duration. When an interval is specified as 5m in the YAML, it's unmarshaled then marshaled as 5m0s. Therefore, when the same YAML is applied again, it causes a spec change from 5m0s to 5m, resulting a new generation of the resource and a reconciliation.

Interestingly, once the same YAML is applied for the second time, the desired 5m string representation is preserved, therefore any further re-applying no longer produces new generation of resources. I guess it's due to the difference between Create and Update operation of the resource, but I'm not sure.

Having found the above, the workaround to avoid such an unnecessary helm upgrade action is to always include the seconds field in interval as well as any field of type metav1.Duration. It is important in our use case because we need to reliably know whether an upgrade has happened or not.

kingdonb commented 2 years ago

@simingweng I don't believe that 5m0s to 5m is the cause of a trouble here.

Not directly related to your issue, but this is hot and it's coming out in the next release of Flux, have you seen this yet:

$ flux diff ks 75-deis --path ./apps/hephy/
► HelmRelease/deis/hephy drifted

metadata.generation
  ± value change
    - 7
    + 8

spec.interval
  ± value change
    - 15m0s
    + 15m

This was just merged and so I thought to use it to test what you described, as it should be in the next Flux release; you can see that 15m0s and 15m do get applied as different values in the cluster with Server Side Apply. So one can infer, I think, that setting a new spec.interval also has the effect of reconciling the HelmRelease for an upgrade.

I was able to replicate something like the behavior that you described, although I'm not sure I can provide much in the way of analysis of your findings, I can tell you that the Kustomize controller behavior is about to see some big changes in Flux 0.26 and you very well might see a positive effect difference in the behavior there when it is released, likely soon. (Watch fluxcd/flux2#2308 for more details.)

Helm Controller has some quirks around timeout and helm --wait behavior due to the unpredictable behavior of clusters at runtime that results occasionally in a helmrelease that has failed and/or is stuck. In order to get the HelmRelease un-stuck, we generally prescribe making an update to the HelmRelease in the source and it triggers another reconcile/upgrade action.

I always thought that users had to make a change to the inputs that Helm considers in order to trigger the upgrade, so: values, chart (template, defaults), or secrets or configmaps listed in valuesFrom. Turns out Helm Controller isn't quite that picky, and any update to the HelmRelease itself will do. (That's helpful information, so thanks very much for pointing it out!) I had been telling folks that they can add random values to spec.values as a nonce, but that's a little bit untoward, I'd much rather edit something which can be clearly shown to have no effect on the outcome, like spec.interval.

So you understand, Helm Controller is not continuously upgrading the Helm Release in pure "continuous reconciling" fashion as one might anticipate according to the definition of GitOps, due to the difficult combination of behavior of Helm's lifecycle hooks, secret generators, and 3-way merging patch-update.

Anyway this is a long story, but YSK that users cannot generally trigger HelmRelease to reconcile an upgrade when there are no changes to apply by simply running flux reconcile helmrelease <theHelmRel> because Helm Controller will only do an upgrade when there is an input that has changed. Helm Controller cannot detect drift due to another long story (#186) so at each spec.interval it really only does checking for changes in inputs that should trigger an upgrade.

So, it may actually be desired behavior that changing interval reconciles an upgrade even though it clearly does not need to; although I can't say for sure that things will change one way or another, look for flux diff and other advancements in 0.26 to improve the overall visibility in and around the areas that this question targets. 👍

simingweng commented 2 years ago

@kingdonb Thanks for looking into it, but I'm not sure I'm 100% following you when you said it would actually be "desired behavior". I completely understood and agreed that, if there is indeed a change to the interval field, it should trigger a reconciliation because the "spec" has changed. However, in my case, I never made any change to my "HelmRelease" YAML file, the "interval" field in the file is always "5m".

The problem here is that when I kubectl apply such a YAML for the first time, the persisted value of the "interval" field of this "HelmRelease" resource in the API server has been mutated to "5m0s" due to the way "time.Duration" Go structure is marshaled. Now the server side has an "interval" of string "05m0s", and my YAML still has "interval" of string "5m".

Now if I apply my YAML one more time, I would assume nothing would have happened, since nothing has changed in my YAML at all. However, because of the above, the server thinks I'm changing "5m0s" to "5m", which is never intended from client point of view.

This is not helm-controller's fault by any means, and actually I don't think there's anything helm-controller can do here to mitigate. (or maybe sanitize the "interval" value to carry seconds always, in a mutating webhook?)

I just want to raise my findings here so that it no longer surprises other developer when they see the unnecessary and silent helm upgrade action when they apply the same "HelmRelease" resource twice in a row.

kingdonb commented 2 years ago

@simingweng I'm not sure I understood all of the details of your report, but you might be describing one of the kustomize-controller issues with strange behavior from the new Server-Side Apply, issues that are in process to (🤞) fix for Flux 0.26.

Your reply makes total sense to me; the behavior we described of helm-controller is desired, the Kustomize controller is not. Flux should reconcile a change only once, and it should be able to tell if two documents resolve to the same cluster state so that a notification is not raised in a loop, and the cluster is not "configured" repeatedly when there is no change. (We are agreed on all that.)

I would check if you see improvements from the next release of Flux, but it might also be worth opening a separate issue in Kustomize Controller. I'm just not sure if it would be a duplicate report over there, I haven't seen one similar, but that doesn't mean it's not the same issue as others we have observed. (Thanks again for your report!)

devopstales commented 1 year ago

Any update on this?

kingdonb commented 1 year ago

@devopstales Are you reproducing this issue on a current version of Flux? I don't think it's desired behavior, but is this causing you some issue? (Can you say a little bit more? This is an 18 month old issue, I doubt anyone is following it up now, but if you'll let us know about your workflow and how this is problematic, what version of Flux you are currently seeing an issue with, we can tell if it's something for Flux devs to pursue or not...)