Closed amitpd closed 2 years ago
This issue is coming when I am also using Traffic Management to send only 10% traffic to Canary pods. Without traffic management feature, maxSurge and maxUnavailable fields are working.
This is actually an intentional difference when traffic routing is being used with a mesh/ingress. When trafficRouting is being used, we scale up the canary stack to the same count as the stable stack, similar to the blue-green strategy.
The reason for this is that when a mesh or ingress like Istio is being used, users will be now be able to shift traffic much more rapidly/sporadically (e.g. go from 1% to 99% in an extreme example).
During design, it was an important requirement to be able to abort a rollout and immediately go from XX% traffic percentage back down to 0% instantly, without it becoming delayed/prevented by external factors such as replicaset and pod orchestration. So the decision was made to keep the canary stack size the same as the stable stack (e.g. simply keep replica counts in-sync), for the duration of the update.
Note that there is an important distinction between mesh-based canary vs. non-mesh canary, where canary weights were achieved using replica counts, and there was no other option to achieve percentage-based weights. Because non-mesh canary is essentially a controlled rollingUpdate, the requirement there is to support maxSurge, maxUnavailable. Readiness probes play a much more important role in that strategy.
@amitbitcse - is your ultimate goal such that that we do not scale up the canary stack until after the experiment at step-0 is successful?
One option I see, is that the rollout controller can lazily/intelligently delay the scale up of the canary stack, until it reaches the first non-zero weighted traffic. In your example, essentially it would scale up the canary stack from 0->4 replicas only after step-0 completed (it reached the end of the steps).
@jessesuen - The Canary stack scale up is happening only after the experiment is successful which is as expected.
The limitation with current implementation is the additional hardware resources required to accomodate all the Canary pods. If Canary pods goes to pending state due to h/w limitation, the baseline pods would never be terminated.
I have a theory as explained below to move from Blue-Green strategy to Rolling-Update while scaling up Canary pods.
Lets assume below prerequisites:
Current Behavior:
Proposed Changes:-
An app that does cache warming or other resource intensive tasks at startup might cause a thundering herd to hit caches/databases if a bunch of pods start at once. Even if going from 0% of traffic to 100% (e.g. using promote-full), it would be nice to allow incrementally starting pods instead of all at once. It's not exactly the same as maxSurge/maxUnavailable
but seems related.
@amitpd is this still an issue for you now that https://github.com/argoproj/argo-rollouts/issues/1029 has been merged?
@kostis-codefresh Nope. The #1029 fix works for the Canary behaviour I was looking for.
Has there been any resolution to this? As it is today there is no way to use Argo Rollouts, Canary, Traffic Management, and have a resource constrained system.
I need a case where during each step the stable set pods are reduced and removed (that is maxUnavailable is used) before the canary creates new pods (maxSurge is respected). Otherwise, I will be stuck in a deadlock situation where the new canary pods will never be able to be created (due to no available node to attach to) and stable will never scale down.
What have others done in cases like this?
Pre-requisites: Deploy below Rollout
Steps to reproduce:
Actual Behavior:
Expected/Desired Behavior: