Upgrading to Flagger built on main (commit id #133fdec) causes canary rollouts when no change in deployments

jdgeisler commented 4 months ago

Describe the bug

I've been working on an MR for issue https://github.com/fluxcd/flagger/issues/1646 and ran into this following bug when testing Flagger in my personal kubernetes cluster. I've also reached out in the slack channel here

When building my own docker image from main (commit id #133fdec), I am seeing canary rollouts triggered even though there were no changes to my canary deployment spec. As soon as Flagger was upgraded to this image, the canaries detected a new revision and began the analysis.

{"level":"info","ts":"2024-06-25T15:01:58.687Z","caller":"controller/controller.go:307","msg":"Synced fortio/fortio-server-deployment-2"}
{"level":"info","ts":"2024-06-25T15:02:08.802Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up fortio-server-deployment-2.fortio","canary":"fortio-server-deployment-2.fortio"}
{"level":"info","ts":"2024-06-25T15:02:38.791Z","caller":"controller/events.go:33","msg":"Starting canary analysis for fortio-server-deployment-2.fortio","canary":"fortio-server-deployment-2.fortio"}

To confirm, I also compared 1:1 the deployment spec and nothing changed. This should mean the calculated hash is the same, but for some reason the lastAppliedSpec hash in the canary was different.

No matter what, the docker image from v1.37 results in this hash in the canary
lastAppliedSpec: 5454669676
lastPromotedSpec: 5454669676

Then, when upgraded to the image built from main, it results in this new hash in the canary
lastAppliedSpec: 5475b95f98
lastPromotedSpec: 5475b95f98

For a sanity check, I also built a custom image from the last tag v1.37, and confirmed the canary analysis is not triggered when upgraded. I also confirmed that the hash remains the same.

To Reproduce

Build docker image from main (commit id #133fdec)
Upgrade Flagger to the new image in a Kubernetes cluster with canary deployments
See that the canary analysis is triggered

Expected behavior

It is expected that upgrading Flagger does not cause canary rollouts to be triggered if nothing changes in the canary deployments.

Additional context

Flagger version: main (commit id #133fdec)
Kubernetes version: 1.27
Service Mesh provider: istio 1.19.7

aryan9600 commented 4 months ago

hey, this is probably because of https://github.com/fluxcd/flagger/pull/1638. The upgrade to k8s 1.30 libs means that the sidecars are now used via the .initContainers[].restartPolicy. Thus, when Flagger restarts, it reads this field and produces a different checksum. Its an unavoidable side effect that's not triggered due to changes in Flagger itself, but because of upstream k8s libs. The release notes for the next version shall include a warning for the same.

cc: @stefanprodan

jdgeisler commented 4 months ago

Hey @aryan9600, thank you for the follow-up.

We have hundreds of workloads using Flagger, so when we upgrade this would cause all of the canaries to be triggered which is not ideal. Is there any way this can be avoided in the hash calculation of Flagger so that a dependency upgrade doesn't trigger a false rollout?

aryan9600 commented 4 months ago

i can't think of any clean way to avoid this.

stefanprodan commented 4 months ago

You could pause all canaries with .spec.suspend and enable them one by one when there is a change to the actual deployment. If you use Flux and all things are in Git, the commit that bumps the app version could also remove the suspend field.

lcooper40 commented 3 months ago

Hi @stefanprodan @aryan9600

Is there nothing that can be done around this issue?

I've got the exact same problem with 100's of canaries that can't all be started at once as they burst a rate limit for an external metrics provider used for the analysis.

I've tried setting suspend: true on a canary and we're using ArgoCD which doesn't see the suspend field as something it needs to remove when resyncing.

The only option we have on our hands at the moment is to release the canaries in batches following an upgrade which would be a very time consuming process regularly as we tend to take the very latest release when it becomes available.

snuggie12 commented 2 months ago

I'm not sure if there are other obvious reasons, but we've experienced this through plenty of other upgrades in the past. Maybe they also contained similar updates like 1.30, but I wanted to provide that data point if it is helpful.

Is there anything that could be done on the controller side to perform something similar to https://github.com/fluxcd/flagger/issues/1673#issuecomment-2232900287 ??

fluxcd / flagger