Open staffanselander opened 2 days ago
I'll be working on a patch which will allow us to delay traffic switching until the "old" replicaset is fully available. I'd be happy to contribute it upstream.
From my first "naive" look at the source code, I would add an additional condition after: https://github.com/argoproj/argo-rollouts/blob/7938e84db626d49d72d553ec6c1e33887fbf1ebb/rollout/service.go#L269-L278
No idea what the best naming for the configuration would be though:
.spec.rolloutWindow.requireAvailability: full
.spec.rolloutWindow.trafficSwitchWhenAvailability: full | partial
Please make some recommendations here :')
Is your rollout configured with dynamicStableScale: true
?
I would actually also check this code path: https://github.com/argoproj/argo-rollouts/blob/53c4f12d66620e9224d3810489b94cfe8f35b054/rollout/canary.go#L379 IsActive seems to possibly not be doing enough of a check.
Is your rollout configured with
dynamicStableScale: true
?
No, dynamicStableScale
is using the default value of false
.
I would actually also check this code path:
IsActive seems to possibly not be doing enough of a check.
Thank you for pointing that out, I'll have a look there as-well.
I wonder if #3878 will solve this issue as-well. I'll check it out.
Describe the bug
We had an incident last Sunday. A team rolled out a new release using the canary strategy provided by Argo Rollouts.
The canary finished successfully and eventually transitioned to stable. Afterwards, the team discovered a bug and decided to roll back the release.
As the previous deployment was within the "rollback window", traffic was switched as soon as a single replica in the "new" replicaset became available. However, the replicaset were still scaling up to match the number of replicas of the previous replicaset, thus not being able to handle the load and stopped responding.
The service in question has a fairly high and undetermenistic start-up time which makes this issue more visible.
To Reproduce
1: Create a
Rollout
resource using thecanary
strategyspec.rollbackWindow.revisions: 5
spec.revisionHistoryLimit: 5
trafficRouting
:trafficRouting.traefik.weightedTraefikServiceName: xxx
2: Rollout a new change
A change which starts a canary deployment and wait until it's fully promoted and the old replicaset is scaled down.
3: Rollback the change
A "rollback" or modifications which aligns with the old replicaset.
Note:
Expected behavior
I would expect the replicaset to become fully available before traffic is switched back to the "old" replicaset. Or rather, have an option which would allow this behaviour.
Version
v1.7.2
Logs
Logs are from a local environment where the issue was later on reproduced.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.