argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.78k stars 872 forks source link

traffic is switched before replicaset is fully available when using `rollbackWindow` #3941

Open staffanselander opened 2 days ago

staffanselander commented 2 days ago

Describe the bug

We had an incident last Sunday. A team rolled out a new release using the canary strategy provided by Argo Rollouts.

The canary finished successfully and eventually transitioned to stable. Afterwards, the team discovered a bug and decided to roll back the release.

As the previous deployment was within the "rollback window", traffic was switched as soon as a single replica in the "new" replicaset became available. However, the replicaset were still scaling up to match the number of replicas of the previous replicaset, thus not being able to handle the load and stopped responding.

The service in question has a fairly high and undetermenistic start-up time which makes this issue more visible.

To Reproduce

1: Create a Rollout resource using the canary strategy

2: Rollout a new change

A change which starts a canary deployment and wait until it's fully promoted and the old replicaset is scaled down.

3: Rollback the change

A "rollback" or modifications which aligns with the old replicaset.

Note:

Expected behavior

I would expect the replicaset to become fully available before traffic is switched back to the "old" replicaset. Or rather, have an option which would allow this behaviour.

Version

Logs

Logs are from a local environment where the issue was later on reproduced.

time="2024-11-13T12:00:33Z" level=info msg="Rollback within the window: 0 (5)" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Rollout completed update to revision 16 (57bd56cdd8): Rollback within window" event_reason=RolloutCompleted namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"podinfo\", Name:\"backend-podinfo\", UID:\"08eb9dd2-8504-4540-bb84-2177f0a22206\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"71866\", FieldPath:\"\"}): type: 'Normal' reason: 'RolloutCompleted' Rollout completed update to revision 16 (57bd56cdd8): Rollback within window"
time="2024-11-13T12:00:33Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":6,\"conditions\":[{\"lastTransitionTime\":\"2024-11-13T07:17:01Z\",\"lastUpdateTime\":\"2024-11-13T07:17:01Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-11-13T11:36:53Z\",\"lastUpdateTime\":\"2024-11-13T11:36:53Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-11-13T11:54:18Z\",\"lastUpdateTime\":\"2024-11-13T11:54:18Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-11-13T11:57:31Z\",\"lastUpdateTime\":\"2024-11-13T12:00:33Z\",\"message\":\"ReplicaSet \\\"backend-podinfo-57bd56cdd8\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2024-11-13T12:00:33Z\",\"lastUpdateTime\":\"2024-11-13T12:00:33Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"}],\"message\":null,\"phase\":\"Healthy\",\"readyReplicas\":6,\"stableRS\":\"57bd56cdd8\"}}" generation=22 namespace=podinfo resourceVersion=71866 rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="persisted to informer" generation=22 namespace=podinfo resourceVersion=71978 rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Reconciliation completed" generation=22 namespace=podinfo resourceVersion=71866 rollout=backend-podinfo time_ms=31.628003000000003
time="2024-11-13T12:00:33Z" level=info msg="Start processing" resource=podinfo/backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Processing completed" resource=podinfo/backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Started syncing rollout" generation=22 namespace=podinfo resourceVersion=71978 rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Cleaning up 6 old replicasets from revision history limit 5" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Looking to cleanup old replica sets" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Trying to cleanup replica set \"backend-podinfo-7b6f4759c9\"" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Enqueueing parent of podinfo/backend-podinfo-7b6f4759c9: Rollout podinfo/backend-podinfo"
time="2024-11-13T12:00:33Z" level=info msg="Switched selector for service 'backend-podinfo-stable' from '5bffc65f84' to '57bd56cdd8'" event_reason=SwitchService namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Reconciling TrafficRouting with type 'Traefik'" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="syncing service" namespace=podinfo rollout=backend-podinfo service=backend-podinfo-stable
time="2024-11-13T12:00:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"podinfo\", Name:\"backend-podinfo\", UID:\"08eb9dd2-8504-4540-bb84-2177f0a22206\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"71978\", FieldPath:\"\"}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'backend-podinfo-stable' from '5bffc65f84' to '57bd56cdd8'"
time="2024-11-13T12:00:33Z" level=info msg="Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:100,ServiceName:backend-podinfo-canary,PodTemplateHash:57bd56cdd8,},Stable:WeightDestination{Weight:0,ServiceName:backend-podinfo-stable,PodTemplateHash:5bffc65f84,},Additional:[]WeightDestination{},Verified:nil,}" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:backend-podinfo-canary,PodTemplateHash:57bd56cdd8,},Stable:WeightDestination{Weight:100,ServiceName:backend-podinfo-stable,PodTemplateHash:57bd56cdd8,},Additional:[]WeightDestination{},Verified:nil,}" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Traffic weight updated from 100 to 0" event_reason=TrafficWeightUpdated namespace=podinfo rollout=backend-podinfo

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

staffanselander commented 2 days ago

I'll be working on a patch which will allow us to delay traffic switching until the "old" replicaset is fully available. I'd be happy to contribute it upstream.

From my first "naive" look at the source code, I would add an additional condition after: https://github.com/argoproj/argo-rollouts/blob/7938e84db626d49d72d553ec6c1e33887fbf1ebb/rollout/service.go#L269-L278

No idea what the best naming for the configuration would be though:

.spec.rolloutWindow.requireAvailability: full
.spec.rolloutWindow.trafficSwitchWhenAvailability: full | partial

Please make some recommendations here :')

zachaller commented 1 day ago

Is your rollout configured with dynamicStableScale: true?

zachaller commented 1 day ago

I would actually also check this code path: https://github.com/argoproj/argo-rollouts/blob/53c4f12d66620e9224d3810489b94cfe8f35b054/rollout/canary.go#L379 IsActive seems to possibly not be doing enough of a check.

staffanselander commented 1 day ago

Is your rollout configured with dynamicStableScale: true?

No, dynamicStableScale is using the default value of false.

I would actually also check this code path:

https://github.com/argoproj/argo-rollouts/blob/53c4f12d66620e9224d3810489b94cfe8f35b054/rollout/canary.go#L379

IsActive seems to possibly not be doing enough of a check.

Thank you for pointing that out, I'll have a look there as-well.

staffanselander commented 17 hours ago

I wonder if #3878 will solve this issue as-well. I'll check it out.