If a release is Aborted due to a failed AnalysisRun (or manual abort), and abortScaleDownDelaySeconds is set to 0 (to keep the canary replicaset alive indefinitely), then the canary Service can end up in a weird state. The rollouts controller starts to flip/flop the pod-template-hash on the canary service. In our case (ingress-nginx traffic routing), you can observe the flapping with curl using the canary header. Sometimes it serves the live release, sometimes it serves the canary release.
The bug also seems to negatively effect the argocd-application-controller (compute shoots up, log spam) and the Argo Dashboard UI slows to a crawl.
Up for debate, but certainly not what the current behavior is. Personally, I would like to see the canary service continue to route traffic to the canary deployment, so that further investigation can be made into the aborted Rollout.
Screenshots
Versionquay.io/argoproj/argo-rollouts:v1.5.1
Logs
argocd-application-controller example logs when in this state:
Checklist:
Describe the bug
If a release is Aborted due to a failed AnalysisRun (or manual abort), and
abortScaleDownDelaySeconds
is set to0
(to keep the canary replicaset alive indefinitely), then the canaryService
can end up in a weird state. The rollouts controller starts to flip/flop the pod-template-hash on the canary service. In our case (ingress-nginx traffic routing), you can observe the flapping with curl using the canary header. Sometimes it serves the live release, sometimes it serves the canary release.The bug also seems to negatively effect the argocd-application-controller (compute shoots up, log spam) and the Argo Dashboard UI slows to a crawl.
Worth noting we are using https://github.com/argoproj-labs/rollout-extension to embed the Rollouts UI into the Argo Dashboard due to the lack of auth/rbac in the Rollouts dashboard.
Additionally we use the ApplicationSet controller. Auto-sync is disabled.
To Reproduce
Given the following rollout:
Abort the rollout during the pause step.
Expected behavior
Up for debate, but certainly not what the current behavior is. Personally, I would like to see the canary service continue to route traffic to the canary deployment, so that further investigation can be made into the aborted Rollout.
Screenshots
Version
quay.io/argoproj/argo-rollouts:v1.5.1
Logs
argocd-application-controller
example logs when in this state:kubectl logs -n argocd deployment/argo-rollouts | grep rollout=sre-demo-app-unstable
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.