Open zimmertr opened 5 months ago
If someone is willing to coach me on the preferred solution and anything else I should know for contributing to a Go project for the first time, I would be happy to try and create a PR to resolve this.
I wonder if it would make sense to just not reconcile at all if there is zero replicas?
We have code here where we check for nil and bump to one I wonder if what behavior would be like if we say check for zero and bail on the reconcile. It's might be a bit odd to do that because what's it mean if you change the pods spec do you want a new replicaset with a zero count, maybe you do? Which would could maybe just reconcile replicates and not run steps etc. I would be curious what if any tests break with a change like that.
I think not reconciling until replicas > 0 makes sense. We would of course want a new ReplicaSet to be created with any changes made since scaling to 0, however.
In our case, we are attempting to be multi-region on our cloud provider for disaster recovery purposes. Services in the non-active region cannot yet tolerate failures to connect to other resources like databases that are not available when not in an active state. This is why we scale them to zero replicas. It is inconvenient for us that canary replicas are provisioned for that reason, as the pods just get stuck in CrashLoopBackOff
. To mitigate this, we have to completely delete the Rollout (Actually, the entire Argo CD Application) from Kubernetes instead.
It was when trying to solve for this I discovered this bug.
Describe the bug
My team uses Argo Rollouts with the canary strategy and traffic routing. Our strategy typically looks something like this:
We also have a case where we occasionally have to scale down the number of replicas to 0. When this is done, and a change is made to the Rollout spec, a new ReplicaSet is created containing the changes with 1 replica because of
setCanaryScale.replicas: 1
. This is not ideal, but is not a bug. IMO it would be nice if there was a feature to disable the canary behavior when a Rollout has its replica count set to 0, or otherwise annotated with some sort ofsuspended
boolean. But that is not my reason for creating this issue.If you then
Resume
the Rollout steps, the Rollout completes successfully and then scales the replicas back down to zero.However, if you instead
Promote-Full
the Rollout, aninteger divide by zero
exception is thrown. The exception is caught, but the Rollout effectively gets permanently stuck in this progressing state. And the Rollout Controller logs continuously log the caught exceptions.Sync begins
Exception occurs
Exception is recovered
After reading the trace, it is clear that the cause for the exception occurs on this line when
desiredWeight
is calculated and one is usingdynamicStableScale
. It attempts to divide by zero when the number of replicas is 0. Likely there should be another nested conditional or some sort of error handling that manages this case. Presumably thedesiredWeight
should then instead be set to 100? Or just not modified from the existing state?To Reproduce
dynamicStableScale
0
Promote-Full
the RolloutExpected behavior
The Rollout would simply promote the change and probably not modify Istio weights at all.
Version
v1.7.0
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.