Canary using trafficRouting does not respect maxSurge/maxUnavailable

amitpd commented 4 years ago

Pre-requisites: Deploy below Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 4
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - image: myapp:v1
        name: myapp
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          timeoutSeconds: 5
  strategy:
    canary:
      maxSurge: "25%"
      maxUnavailable: 0
      canaryService: myapp-canary
      stableService: myapp
      trafficRouting:
        nginx:
           primaryIngress: myapp-ingress
      steps:
      - experiment:
          duration: "2m"
          templates:
          - name: baseline
            specRef: stable
            replicas: 1
          - name: canary
            specRef: canary
          analyses:
          - name: success-rate
            templateName: success-rate
            args:
            - name: service-name
              value: myapp

Steps to reproduce:

Change image tag to v2. Experiment is started.
Wait for the analysis to complete and experiment duration to be over

Actual Behavior:

A new replica set is created for image v2 and all 4 pods are crested
Once all the 4 new pods are successfully running, the 4 pods of older replicaset is terminated at the same time.

Expected/Desired Behavior:

Since maxSurge is set to 25%, only one pod of new replicaset should be created
Since maxUnavailable is 0, one pod of older replicaset should be terminated only after the v2 pod is running.

amitpd commented 4 years ago

This issue is coming when I am also using Traffic Management to send only 10% traffic to Canary pods. Without traffic management feature, maxSurge and maxUnavailable fields are working.

jessesuen commented 4 years ago

This is actually an intentional difference when traffic routing is being used with a mesh/ingress. When trafficRouting is being used, we scale up the canary stack to the same count as the stable stack, similar to the blue-green strategy.

The reason for this is that when a mesh or ingress like Istio is being used, users will be now be able to shift traffic much more rapidly/sporadically (e.g. go from 1% to 99% in an extreme example).

During design, it was an important requirement to be able to abort a rollout and immediately go from XX% traffic percentage back down to 0% instantly, without it becoming delayed/prevented by external factors such as replicaset and pod orchestration. So the decision was made to keep the canary stack size the same as the stable stack (e.g. simply keep replica counts in-sync), for the duration of the update.

Note that there is an important distinction between mesh-based canary vs. non-mesh canary, where canary weights were achieved using replica counts, and there was no other option to achieve percentage-based weights. Because non-mesh canary is essentially a controlled rollingUpdate, the requirement there is to support maxSurge, maxUnavailable. Readiness probes play a much more important role in that strategy.

jessesuen commented 4 years ago

@amitbitcse - is your ultimate goal such that that we do not scale up the canary stack until after the experiment at step-0 is successful?

One option I see, is that the rollout controller can lazily/intelligently delay the scale up of the canary stack, until it reaches the first non-zero weighted traffic. In your example, essentially it would scale up the canary stack from 0->4 replicas only after step-0 completed (it reached the end of the steps).

amitpd commented 4 years ago

@jessesuen - The Canary stack scale up is happening only after the experiment is successful which is as expected.

The limitation with current implementation is the additional hardware resources required to accomodate all the Canary pods. If Canary pods goes to pending state due to h/w limitation, the baseline pods would never be terminated.

I have a theory as explained below to move from Blue-Green strategy to Rolling-Update while scaling up Canary pods.

Lets assume below prerequisites:

Using Nginx Ingress Controller (Haven't though yet how Service Mesh would behave)
Two Ingress objects created, one for Baseline and another for Canary
Two Services objects created, one for Baseline and another for Canary

Current Behavior:

When a Rollout object is created, an additional label selector with key "rollouts-pod-template-hash" and value of baseline pod template hash is added to both the Service objects. At this point of time, both the ingresses serve traffic to baseline pods
When Rollout object is updated (say image tag is changed), Canary starts and the value of "rollouts-pod-template-hash" key for the Canary Service is updated to canary pod template hash. Now, the canary ingress starts serving traffic only to the canary pods.
Once the Canary steps are successful, the Canary pods are scaled up to the same no of replicas as of Baseline pods.
Once all the Canary pods are healthy, value for "rollouts-pod-template-hash" key of Baseline Service is updated to canary pod template hash and all baseline pods are terminated. Now, both the ingresses start serving traffic to the Canary pods.

Proposed Changes:-

The Service objects should not be updated with additional label selector when a new Rollout object is created. Both the ingresses would still serve traffic to baseline pods.
The additional label selector with key "rollouts-pod-template-hash" should be added only when Canary steps are started. The Baseline Service would be pointing to baseline pods and Canary Service to Canary pods.
Once all the Canary steps are executed, remove the additional label selector with key "rollouts-pod-template-hash" from both Services and starts scaling up the Canary pods with Rolling-Update strategy. With the "rollouts-pod-template-hash" label selector gone, both the Services would be serving traffic to Baseline and Canary pods while Rolling-Update is in progress.

bpoland commented 3 years ago

An app that does cache warming or other resource intensive tasks at startup might cause a thundering herd to hit caches/databases if a bunch of pods start at once. Even if going from 0% of traffic to 100% (e.g. using promote-full), it would be nice to allow incrementally starting pods instead of all at once. It's not exactly the same as maxSurge/maxUnavailable but seems related.

kostis-codefresh commented 2 years ago

@amitpd is this still an issue for you now that https://github.com/argoproj/argo-rollouts/issues/1029 has been merged?

amitpd commented 2 years ago

@kostis-codefresh Nope. The #1029 fix works for the Canary behaviour I was looking for.

szittrower commented 6 months ago

Has there been any resolution to this? As it is today there is no way to use Argo Rollouts, Canary, Traffic Management, and have a resource constrained system.

I need a case where during each step the stable set pods are reduced and removed (that is maxUnavailable is used) before the canary creates new pods (maxSurge is respected). Otherwise, I will be stuck in a deadlock situation where the new canary pods will never be able to be created (due to no available node to attach to) and stable will never scale down.

What have others done in cases like this?

argoproj / argo-rollouts

Canary using trafficRouting does not respect maxSurge/maxUnavailable #430