Open MarkSRobinson opened 2 years ago
@MarkSRobinson I noticed a similar behavior and asked it as a question here - https://github.com/argoproj/argo-rollouts/discussions/2237
So with traffic routed canary the maxSuge configuration is not used at all as seen https://github.com/argoproj/argo-rollouts/blob/master/utils/replicaset/canary.go#L318-L358
I do know there have been some bugs related to dynamicStableScale that might correlate with this but will need to dig into it a bit more.
I do think it makes sense when dynamicStableScale is enable to support surge options but today that is not supported.
We need to do a better job document this, but maxSurge
, maxUnavailable
is not intended to be honored when using traffic routing (even with dynamicStableScale: true
) because any dramatic shift in traffic weights could cause a Rollout to overwhelm a ReplicaSet running too few pods.
Here is an example scenario of why it cannot be honored.:
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 100
trafficRouting:
# any traffic router
When we reach the setWeight: 100
step we are going from 0% traffic to 100% traffic (and nothing in between). Before we can shift 100% traffic to the canary, the canary ReplicaSet must be fully scaled up to 100% (10 pods). Any value less than 10 and we would be sending too much traffic to too few pods. So at the time when we shift from 0->100 to canary, we must run 20 Pods. This implies any value of maxSurge < 100% could not be accommodated.
The above scenario is the extreme example, but the same problem will exist with smaller shifts in traffic weights.
I should note that maxSurge and maxUnavailable is honored properly when trafficRouting
is not used because it behaves like Deployment
It might be worth covering the use case for services that don't have ingresses. I'm thinking of a service that pulls from kinesis. This is common enough that it'd be worth highlighting in the documentation.
It would be great to have an override traffic rollout strategy where we dynamically scale-down stable and scale-up canary in an orchestrated process, thereby capping our total running replicas (for e.g. when one has a stateful application that expects n replicas).
This obviously runs the risk of overloading stable whilst waiting for replicas to come up, but may be useful in specific situations! :)
I.e. Steady State: 4 Stable (100% Traffic), 0 Canary (0% Traffic). Reconciliation 1: 3 Stable (100% Traffic), 0 Canary (0% Traffic) Reconciliation 2: 3 Stable (75% Traffic), 1 Canary (25% Traffic) etc etc
This issue is stale because it has been open 60 days with no activity.
Checklist:
Describe the bug
When using a deployment with maxSurge: 0%, maxUnavailable: 100%, dynamicStableScale: true. Argo will run old and new pods at the same time. We've had some bug reports internally of services that have to run exactly 1 pod at a time, running 2 pods for a bit of time.
To Reproduce
Using this snippet of a rollout object:
When using this Rollout object, argo will run two different version of the pods for a time. Changing the image tag will briefly run 40 pods for a moment until the old ones are cleaned up.
Expected behavior
Argo should completely spin down the pods before spinning up new pods. Effectively this should mimic the
Recreate
strategy.Screenshots
You can see we're running 40 pods here briefly
Version
v1.2.2
Logs
Gist: https://gist.github.com/MarkSRobinson/eb3cdf3fa2e850df00fabc42cc0a0140
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.