argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.68k stars 839 forks source link

intermediate replicaset scaledown during ALB traffic shaping causes 503 responses #3617

Open schlags opened 3 months ago

schlags commented 3 months ago

Checklist:

Describe the bug

We are using the ArgoCD Image Updater to set image parameters for our application.

We are using the ALB ingress class, with the pingPong strategy

We are running Rollouts v1.6.6

We have readiness-gate-injection enabled on the pods in our namespace where this takes place

We have observed, very frequently, that if a new image is set for the rollout while it is progressing, it causes 503 responses to clients that were previously shaped to the canary at that time.

HTTP 503: The target groups for the load balancer have no registered targets. https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-503-issues

It typically goes like the following.

Assume image A is the stable, image B is currently canarying, and image C is set while image B is still progressing.

  1. image A is stable and directed to 100% traffic
  2. image B is spun up
  3. 25% of traffic is defined to go to image B
    • (Note: we have a pause step of 15 seconds here before running Analysis because we have observed it takes about this long for traffic to finally show up).
  4. image C gets discovered by the image updater and set for the rollout
  5. Log scaling down intermediate RS is written
    • There is a comment here // It is safe to scale the intermediate RS down, since no traffic is directed to it. but the following disproves that a bit
  6. 25% of our responses from the ALB are showing 503 (all responses from the Target Group).
  7. After some time (less than a minute), traffic returns to a normal state of no 503 errors.

I am wondering if we are missing something in our traffic shaping configuration to address this downtime on preemption in rollouts.

To Reproduce

  1. Set up a ping pong canary flow where traffic shapes to a certain percentage of traffic
  2. Set an image
  3. Once it is on the stage where traffic is directed to the new pods in some percentage, set a newer image
  4. Observe that traffic previously being served by the canary is now responding with 503
  5. Observe that this is only the case for less than a minute before traffic error rate returns to normal.

Expected behavior

All ingress rules should be confirmed to be reset to the state at the beginning of the canary before a new replicaset has gets set for the service selector. (Because otherwise, the ingress would send that percentage of traffic to a service/target group with no assigned IP's - hence the 503 response).

Towards the end of the log output below, you can see that we observe the selector swapping on the service to the new replicaset before restoring the desiredWeight to 0. I believe this is a bug for this reason, but open to ideas on how I'm approaching this incorrectly if this is the intended behavior.

time="2024-06-06T20:19:43Z" level=info msg="Switched selector for service 'myapp-pong-production' from '868f5bb4ff' to '85466d7d56'" event_reason=SwitchService namespace=myapp rollout=rollout-myapp-production

This comes first

time="2024-06-06T20:19:43Z" level=info msg="Updating Ingress `myapp-ingress` to desiredWeight '0'" event_reason=PatchingALBIngress namespace=weaver rollout=rollout-myapp-production

This comes second. EDIT: I see now that the timestamps on the log itself is identical to the previous (even though it was notably further down in my log view). Regardless (however) of these potentially happening at the same time, it seems like we still need to adjust the logic here to wait for the ingress controller to fully reconcile the change with AWS before proceeding.

Shouldn't the ingress be set to desired weight 0 before the service selector moves to a replicaset that hasn't even scaled up yet?

Screenshots

image

504 error responses above (totaling thousands) and log events for scaling down intermediate replicaset below.

Version

v1.6.6

Logs

There are 2,494 logs available between the first rollout and the preemption event, pasting the following logs that I figure are the most important below, but I could be excluding certain important things. Let me know if you need more information (I have access to these logs for a retention period of 14 days from opening this incident).


time="2024-06-06T20:18:11Z" level=info msg="Created AnalysisRun for step '8'" analysisrun=rollout-myapp-production-868f5bb4ff-8382-8 namespace=myapp rollout=rollout-myapp-production
...
time="2024-06-06T20:18:11Z" level=info msg="Step Analysis Run 'rollout-myapp-production-868f5bb4ff-8382-8' Status New: 'Running' Previous: ''" event_reason=AnalysisRunRunning namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Started syncing rollout" generation=19894 namespace=myapp resourceVersion=1685226121 rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Pod template change detected (new: 85466d7d56, old: 868f5bb4ff)" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Assuming 85466d7d56 for new replicaset pod hash" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Pod template change detected (new: 85466d7d56, old: 868f5bb4ff)" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Patched: {\"status\":{\"canary\":{\"currentBackgroundAnalysisRunStatus\":null,\"currentStepAnalysisRunStatus\":null},\"conditions\":[{\"lastTransitionTime\":\"2024-06-06T19:43:42Z\",\"lastUpdateTime\":\"2024-06-06T19:43:42Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},...}}" generation=19894 namespace=myapp resourceVersion=1685226121 rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Updating replica set 'rollout-myapp-production-85466d7d56' revision from 0 to 8383" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Created ReplicaSet rollout-myapp-production-85466d7d56" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:31Z" level=info msg="Enqueueing parent of myapp/rollout-myapp-production-85466d7d56: Rollout myapp/rollout-myapp-production"
time="2024-06-06T20:19:31Z" level=info msg="Enqueueing parent of myapp/rollout-myapp-production-85466d7d56: Rollout myapp/rollout-myapp-production"
time="2024-06-06T20:19:31Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"myapp\", Name:\"rollout-myapp-production\", UID:\"496260c1-8b29-4a7f-aee6-3bfefa9b7f26\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"1685226160\", FieldPath:\"\"}): type: 'Normal' reason: 'RolloutUpdated' Rollout updated to revision 8383"
time="2024-06-06T20:19:32Z" level=info msg="Rollout updated to revision 8383" event_reason=RolloutUpdated namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=info msg="Created ReplicaSet rollout-myapp-production-85466d7d56 (revision 8383)" event_reason=NewReplicaSetCreated namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=error msg="roCtx.reconcile err failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"rollout-myapp-production\": the object has been modified; please apply your changes to the latest version and try again" generation=19894 namespace=myapp resourceVersion=1685226143 rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=info msg="Reconciliation completed" generation=19894 namespace=myapp resourceVersion=1685226143 rollout=rollout-myapp-production time_ms=528.5647319999999
time="2024-06-06T20:19:32Z" level=error msg="rollout syncHandler error: failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"rollout-myapp-production\": the object has been modified; please apply your changes to the latest version and try again" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=info msg="rollout syncHandler queue retries: 6 : key \"myapp/rollout-myapp-production\"" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=info msg="Started syncing rollout" generation=19895 namespace=myapp resourceVersion=1685226200 rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=info msg="Synced revision on ReplicaSet 'rollout-myapp-production-85466d7d56' to '8383'" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=info msg="Rollout updated to revision 8383" event_reason=RolloutUpdated namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:32Z" level=info msg="Enqueueing parent of myapp/rollout-myapp-production-868f5bb4ff: Rollout myapp/rollout-myapp-production" {NOTE: A good 30 of identical ones of these in logs}
time="2024-06-06T20:19:33Z" level=error msg="roCtx.reconcile err error updating replicaset in syncEphemeralMetadata: Operation cannot be fulfilled on replicasets.apps \"rollout-myapp-production-868f5bb4ff\": the object has been modified; please apply your changes to the latest version and try again" generation=19895 namespace=myapp resourceVersion=1685226200 rollout=rollout-myapp-production
time="2024-06-06T20:19:33Z" level=info msg="Reconciliation completed" generation=19895 namespace=myapp resourceVersion=1685226200 rollout=rollout-myapp-production time_ms=747.9723889999999
...
time="2024-06-06T20:19:43Z" level=info msg="Cleaning up 5 old replicasets from revision history limit 4" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Looking to cleanup old replica sets" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Trying to cleanup replica set \"rollout-myapp-production-688f5f7678\"" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="syncing service" namespace=myapp rollout=rollout-myapp-production service=myapp-pong-production
time="2024-06-06T20:19:43Z" level=info msg="Switched selector for service 'myapp-pong-production' from '868f5bb4ff' to '85466d7d56'" event_reason=SwitchService namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Reconciling TrafficRouting with type 'ALB'" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="updating ALB Ingress" desiredWeight=0 ingress= namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Updating Ingress `myapp-ingress` to desiredWeight '0'" event_reason=PatchingALBIngress namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="updating ALB Ingress" desiredWeight=0 ingress= namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:25,ServiceName:,PodTemplateHash:868f5bb4ff,},Stable:WeightDestination{Weight:75,ServiceName:,PodTemplateHash:58565d575,},Additional:[]WeightDestination{},Verified:nil,}" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:,PodTemplateHash:85466d7d56,},Stable:WeightDestination{Weight:100,ServiceName:,PodTemplateHash:58565d575,},Additional:[]WeightDestination{},Verified:nil,}" namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Traffic weight updated from 25 to 0" event_reason=TrafficWeightUpdated namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:43Z" level=info msg="Canceling the analysis run 'rollout-myapp-production-868f5bb4ff-8382-8'" analysisrun=rollout-myapp-production-868f5bb4ff-8382-8 namespace=myapp rollout=rollout-myapp-production
time="2024-06-06T20:19:44Z" level=info msg="Scaled up ReplicaSet rollout-myapp-production-85466d7d56 (revision 8383) from 0 to 30" event_reason=ScalingReplicaSet namespace=myapp rollout=rollout-myapp-production
...
time="2024-06-06T20:19:44Z" level=info msg="Scaled down ReplicaSet rollout-myapp-production-868f5bb4ff (revision 8382) from 30 to 0" event_reason=ScalingReplicaSet namespace=myapp rollout=rollout-myapp-production

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

schlags commented 2 months ago

Unless I'm mistaken, this fix sounds like it's addressing a similar type of case where the rollout controller setting traffic to 100% isn't allowing enough time for it to verify the weight behind the scenes with the ALB.

If this was happening for the default behavior of the end of a rollout, I feel like it tracks that this issue where the default behavior of a preempted rollout would be needing a similar fix.