argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.75k stars 866 forks source link

Argo doesn't support maxSurge: 0% correctly #2239

Open MarkSRobinson opened 2 years ago

MarkSRobinson commented 2 years ago

Checklist:

Describe the bug

When using a deployment with maxSurge: 0%, maxUnavailable: 100%, dynamicStableScale: true. Argo will run old and new pods at the same time. We've had some bug reports internally of services that have to run exactly 1 pod at a time, running 2 pods for a bit of time.

To Reproduce

Using this snippet of a rollout object:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  replicas: 20
  revisionHistoryLimit: 2
spec:
  replicas: 20
  revisionHistoryLimit: 2
  strategy:
    canary:
      analysis:
        args:
          - name: service
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['app.kubernetes.io/name']
          - name: commit
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['k8s.plaid.io/commit-sha']
          - name: stable-hash
            valueFrom:
              podTemplateHashValue: Stable
          - name: latest-hash
            valueFrom:
              podTemplateHashValue: Latest
        templates:
          - templateName: echoserver-worker-analysis
      canaryMetadata:
        labels:
          k8s.plaid.io/deployment-role: canary
      canaryService: echoserver-worker-canary
      dynamicStableScale: true
      maxSurge: 0%
      maxUnavailable: 100%
      stableMetadata:
        labels:
          k8s.plaid.io/deployment-role: stable
      stableService: echoserver-worker-stable
      steps:
        - setWeight: 100
        - pause:
            duration: 3m0s
      trafficRouting:
        smi:
          rootService: echoserver-worker
          trafficSplitName: echoserver-worker-traffic-split

When using this Rollout object, argo will run two different version of the pods for a time. Changing the image tag will briefly run 40 pods for a moment until the old ones are cleaned up.

Expected behavior

Argo should completely spin down the pods before spinning up new pods. Effectively this should mimic the Recreate strategy.

Screenshots

You can see we're running 40 pods here briefly

image

Version

v1.2.2

Logs

Gist: https://gist.github.com/MarkSRobinson/eb3cdf3fa2e850df00fabc42cc0a0140


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

Vignesh-Au commented 2 years ago

@MarkSRobinson I noticed a similar behavior and asked it as a question here - https://github.com/argoproj/argo-rollouts/discussions/2237

zachaller commented 2 years ago

So with traffic routed canary the maxSuge configuration is not used at all as seen https://github.com/argoproj/argo-rollouts/blob/master/utils/replicaset/canary.go#L318-L358

I do know there have been some bugs related to dynamicStableScale that might correlate with this but will need to dig into it a bit more.

I do think it makes sense when dynamicStableScale is enable to support surge options but today that is not supported.

jessesuen commented 2 years ago

We need to do a better job document this, but maxSurge, maxUnavailable is not intended to be honored when using traffic routing (even with dynamicStableScale: true) because any dramatic shift in traffic weights could cause a Rollout to overwhelm a ReplicaSet running too few pods.

Here is an example scenario of why it cannot be honored.:

spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 100
    trafficRouting:
      # any traffic router

When we reach the setWeight: 100 step we are going from 0% traffic to 100% traffic (and nothing in between). Before we can shift 100% traffic to the canary, the canary ReplicaSet must be fully scaled up to 100% (10 pods). Any value less than 10 and we would be sending too much traffic to too few pods. So at the time when we shift from 0->100 to canary, we must run 20 Pods. This implies any value of maxSurge < 100% could not be accommodated.

The above scenario is the extreme example, but the same problem will exist with smaller shifts in traffic weights.

jessesuen commented 2 years ago

I should note that maxSurge and maxUnavailable is honored properly when trafficRouting is not used because it behaves like Deployment

MarkSRobinson commented 2 years ago

It might be worth covering the use case for services that don't have ingresses. I'm thinking of a service that pulls from kinesis. This is common enough that it'd be worth highlighting in the documentation.

rhyswilliamsza commented 2 years ago

It would be great to have an override traffic rollout strategy where we dynamically scale-down stable and scale-up canary in an orchestrated process, thereby capping our total running replicas (for e.g. when one has a stateful application that expects n replicas).

This obviously runs the risk of overloading stable whilst waiting for replicas to come up, but may be useful in specific situations! :)

I.e. Steady State: 4 Stable (100% Traffic), 0 Canary (0% Traffic). Reconciliation 1: 3 Stable (100% Traffic), 0 Canary (0% Traffic) Reconciliation 2: 3 Stable (75% Traffic), 1 Canary (25% Traffic) etc etc

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.