Open Bennett-Lynch opened 4 days ago
Thanks @gghcode. It does seem related. Have you tried testing without a pause step after 100% weight, like @mybliss mentioned? Not that that's a viable workaround but it may provide some more clues to what is causing this to happen.
I was able to prevent the errors by removing the final pause step. I don't consider this a viable workaround since the final pause step is needed to evaluate the deployment at 100% weight. However it may be helpful information for diagnosing the underlying issue.
Describe the bug
Hello,
I recently enabled Istio-based traffic routing and the
dynamicStableScale
feature on my application. The feature works as described except it is causing a brief period of failed requests at the end of the rollout.My spec is roughly as follows:
What I'm repeatedly observing is that as soon as the canary weight is increased to 100%, the last stable replica will be scaled down and clients to the service will encounter "no healthy upstream" Istio errors for approximately 30 seconds. I believe this occurs at a rate proportional to the most recent weight increase, or about 25% in the above example. I suspect the 30 seconds may be partly due to Istio propagation delay but propagation is usually within a few seconds.
Is this intended behavior? Is there anything I can configure to allow my stable endpoint sufficient time to drain before fully scaling it down?
More specifically, I want the behavior of
dynamicStableScale
(so that I am not running 2x my baseline number of replicas), but I want to ensure that the stable replica set is left with at least 1 healthy/ready replica for a draining period after its traffic weight is set to 0%.Configuration options I've looked at but don't seem to quite fit:
minReadySeconds
(defaults to 0, so this may help, but what if canary reaches full replica size in a previous step due to rounding?)maxUnavailable
(defaults to 1, so does not appear to be used after going to 100%)scaleDownDelaySeconds
(default is 30s, but errors appear immediately after going to 100%)minPodsPerReplicaSet
(defaults to 1, so does not appear to be used after going to 100%)Version
1.7.0
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.