Using canary w/ dynamic stable scale causes failed requests at end of rollout

Bennett-Lynch commented 4 days ago

Describe the bug

Hello,

I recently enabled Istio-based traffic routing and the dynamicStableScale feature on my application. The feature works as described except it is causing a brief period of failed requests at the end of the rollout.

My spec is roughly as follows:

spec:
  strategy:
    canary:
      trafficRouting: { istio: { ... } }
      # Scale down stable replica set as the traffic weight increases to canary
      dynamicStableScale: true
      # Run background analysis throughout rollout
      analysis: ...
      steps:
        # Create 1 canary replica for testing purposes
        - setCanaryScale: { replicas: 1 }
        # Run in-line analysis against canary replica
        - analysis: ...
        # Returns to the default behavior of matching the canary traffic weight
        - setCanaryScale: { matchTrafficWeight: true }
        # Scale up the canary
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 75
        - pause: { duration: 5m }
        - setWeight: 100
        # Pause post-deployment to allow background analysis to evaluate 100% weight
        - pause: { duration: 5m }

What I'm repeatedly observing is that as soon as the canary weight is increased to 100%, the last stable replica will be scaled down and clients to the service will encounter "no healthy upstream" Istio errors for approximately 30 seconds. I believe this occurs at a rate proportional to the most recent weight increase, or about 25% in the above example. I suspect the 30 seconds may be partly due to Istio propagation delay but propagation is usually within a few seconds.

Is this intended behavior? Is there anything I can configure to allow my stable endpoint sufficient time to drain before fully scaling it down?

More specifically, I want the behavior of dynamicStableScale (so that I am not running 2x my baseline number of replicas), but I want to ensure that the stable replica set is left with at least 1 healthy/ready replica for a draining period after its traffic weight is set to 0%.

Configuration options I've looked at but don't seem to quite fit:

minReadySeconds (defaults to 0, so this may help, but what if canary reaches full replica size in a previous step due to rounding?)
maxUnavailable (defaults to 1, so does not appear to be used after going to 100%)
scaleDownDelaySeconds (default is 30s, but errors appear immediately after going to 100%)
minPodsPerReplicaSet (defaults to 1, so does not appear to be used after going to 100%)

Version

1.7.0

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

gghcode commented 4 days ago

This looks like the same issue as your issue.

Bennett-Lynch commented 3 days ago

Thanks @gghcode. It does seem related. Have you tried testing without a pause step after 100% weight, like @mybliss mentioned? Not that that's a viable workaround but it may provide some more clues to what is causing this to happen.

Bennett-Lynch commented 3 days ago

I was able to prevent the errors by removing the final pause step. I don't consider this a viable workaround since the final pause step is needed to evaluate the deployment at 100% weight. However it may be helpful information for diagnosing the underlying issue.

argoproj / argo-rollouts

Using canary w/ dynamic stable scale causes failed requests at end of rollout #3681