Fully Promoting a 0-Replica Canary Rollout with Traffic Routing Causes a Panic

zimmertr commented 5 months ago

Describe the bug

My team uses Argo Rollouts with the canary strategy and traffic routing. Our strategy typically looks something like this:

canary:
  canaryService: example-service-canary
  stableService: example-service
  dynamicStableScale: true
  steps:
    - setCanaryScale:
        replicas: 1
    - pause: {}
    - setCanaryScale:
        matchTrafficWeight: true
    - setWeight: 20
    - setWeight: 40
    - setWeight: 60
    - setWeight: 80
    - setWeight: 100
  trafficRouting:
    istio:
      virtualServices:
        - name: example-service-backend
          routes:
            - example-service-stable
        - name: example-service-ingress
          routes:
            - example-service-stable

We also have a case where we occasionally have to scale down the number of replicas to 0. When this is done, and a change is made to the Rollout spec, a new ReplicaSet is created containing the changes with 1 replica because of setCanaryScale.replicas: 1. This is not ideal, but is not a bug. IMO it would be nice if there was a feature to disable the canary behavior when a Rollout has its replica count set to 0, or otherwise annotated with some sort of suspended boolean. But that is not my reason for creating this issue.

If you then Resume the Rollout steps, the Rollout completes successfully and then scales the replicas back down to zero.

However, if you instead Promote-Full the Rollout, an integer divide by zero exception is thrown. The exception is caught, but the Rollout effectively gets permanently stuck in this progressing state. And the Rollout Controller logs continuously log the caught exceptions.

Sync begins

time="2024-06-28T17:10:15Z" level=info msg="Started syncing rollout" generation=9 namespace=example-app resourceVersion=1218236008 rollout=example-service time="2024-06-28T17:10:15Z" level=info msg="delaying service switch from 588865bc5f to f45bd9bbb: ReplicaSet has zero availability" namespace=example-app rollout=example-service service=example-service time="2024-06-28T17:10:15Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=example-app rollout=example-service time="2024-06-28T17:10:15Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=example-app rollout=example-service time="2024-06-28T17:10:15Z" level=info msg="Reconciliation completed" generation=9 namespace=example-app resourceVersion=1218236008 rollout=example-service time_ms=1.3598189999999999

Exception occurs

time="2024-06-28T17:10:15Z" level=error msg="Recovered from panic: runtime error: integer divide by zero\ngoroutine 43159 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:151 +0x49\npanic({0x2c65f20?, 0x5358630?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/argoproj/argo-rollouts/rollout.(rolloutContext).reconcileTrafficRouting(0xc003fb2400)\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:206 +0x113d\ngithub.com/argoproj/argo-rollouts/rollout.(rolloutContext).rolloutCanary(0xc003fb2400)\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:57 +0x1d3\ngithub.com/argoproj/argo-rollouts/rollout.(rolloutContext).reconcile(0xc003fb2400)\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xd4\ngithub.com/argoproj/argo-rollouts/rollout.(Controller).syncHandler(0xc0006c8380, {0x39446d0, 0xc0004a0780}, {0xc00467c0c0, 0x1b})\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:437 +0x4d3\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:155 +0x7d\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x3957450?, 0xc00046ebe0}, {0x31bf658, 0x7}, 0xc0035dfe70, {0x39446d0?, 0xc0004a0780}, 0xc000a31860?, {0x2a9bfa0, 0xc00112a930})\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:159 +0x3ed\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x39446d0, 0xc0004a0780}, {0x3957450, 0xc00046ebe0}, {0x31bf658, 0x7}, 0x0?, 0xc0013ab860?)\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:178 +0xb5\ngithub.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:106\ngithub.com/argoproj/argo-rollouts/rollout.(Controller).Run.func1()\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:358 +0xb2\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:226 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xa01223201?, {0x3912ba0, 0xc001abd170}, 0x1, 0xc000915260)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:227 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000ab6fb0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:204 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(0x8dc6c5?, 0x0?, 0xc000ab6f10?)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:161 +0x1e\ncreated by github.com/argoproj/argo-rollouts/rollout.(Controller).Run in goroutine 43032\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:357 +0xa7\n" namespace=example-app rollout=example-service

Exception is recovered

time="2024-06-28T17:10:15Z" level=error msg="rollout syncHandler error: Recovered from Panic" namespace=example-app rollout=example-service time="2024-06-28T17:10:15Z" level=info msg="rollout syncHandler queue retries: 2 : key \"example-app/example-service\"" namespace=example-app rollout=example-service time="2024-06-28T17:10:15Z" level=error msg="Recovered from Panic" error=""

After reading the trace, it is clear that the cause for the exception occurs on this line when desiredWeight is calculated and one is using dynamicStableScale. It attempts to divide by zero when the number of replicas is 0. Likely there should be another nested conditional or some sort of error handling that manages this case. Presumably the desiredWeight should then instead be set to 100? Or just not modified from the existing state?

} else if c.rollout.Status.PromoteFull {
    // on a promote full, desired stable weight should be 0 (100% to canary),
    // But we can only increase canary weight according to available replica counts of the canary.
    // we will need to set the desiredWeight to 0 when the newRS is not available.
    if c.rollout.Spec.Strategy.Canary.DynamicStableScale {
        desiredWeight = (weightutil.MaxTrafficWeight(c.rollout) * c.newRS.Status.AvailableReplicas) / *c.rollout.Spec.Replicas
    } else if c.rollout.Status.Canary.Weights != nil {
        desiredWeight = c.rollout.Status.Canary.Weights.Canary.Weight
    }

    err := reconciler.RemoveManagedRoutes()
    if err != nil {
        return err
    }

To Reproduce

Configure a canary Rollout to use Traffic Routing with Istio and dynamicStableScale
Set the number of Replicas for the Rollout to 0
Modify an aspect of the Rollout spec to trigger a Rollout
Promote-Full the Rollout

Expected behavior

The Rollout would simply promote the change and probably not modify Istio weights at all.

Version

v1.7.0

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

zimmertr commented 5 months ago

If someone is willing to coach me on the preferred solution and anything else I should know for contributing to a Go project for the first time, I would be happy to try and create a PR to resolve this.

zachaller commented 5 months ago

I wonder if it would make sense to just not reconcile at all if there is zero replicas?

We have code here where we check for nil and bump to one I wonder if what behavior would be like if we say check for zero and bail on the reconcile. It's might be a bit odd to do that because what's it mean if you change the pods spec do you want a new replicaset with a zero count, maybe you do? Which would could maybe just reconcile replicates and not run steps etc. I would be curious what if any tests break with a change like that.

zimmertr commented 5 months ago

I think not reconciling until replicas > 0 makes sense. We would of course want a new ReplicaSet to be created with any changes made since scaling to 0, however.

In our case, we are attempting to be multi-region on our cloud provider for disaster recovery purposes. Services in the non-active region cannot yet tolerate failures to connect to other resources like databases that are not available when not in an active state. This is why we scale them to zero replicas. It is inconvenient for us that canary replicas are provisioned for that reason, as the pods just get stuck in CrashLoopBackOff. To mitigate this, we have to completely delete the Rollout (Actually, the entire Argo CD Application) from Kubernetes instead.

It was when trying to solve for this I discovered this bug.

argoproj / argo-rollouts

Fully Promoting a 0-Replica Canary Rollout with Traffic Routing Causes a Panic #3686