Sudden reconcile failures/progress failure

meeech commented 6 months ago

So this is a fun one :D

have a rollout that has been just fine for ~24 hours
- the rollout was in a completed state (can see this based on notifications)
suddenly Rollouts starts sending 2 cancelled notifications. Then for the next 11 hours, starts randomly sending cancelled messages. usually in bursts (not just one)
in the logs, start seeing a lot of these messages - at a rate of ~ 3 logs every 10 secs:

time="2024-02-27T21:41:41Z" level=error msg="roCtx.reconcile err failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-v1-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-v1-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again" generation=33083 namespace=default resourceVersion=2609627010 rollout=service-x-v1

and these (which i assume is why we start seeing a lot of these cancelled notifs)

- `time="2024-02-27T21:55:33Z" level=warning msg="ReplicaSet \"service-x-v1-==64c6df4d6f==\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=service-x-v1`
- `time="2024-02-27T21:55:44Z" level=warning msg="ReplicaSet \"service-x-v1-==64c6df4d6f==\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=service-x-v1`

this went on for a hours. here is a sample - no real pattern timing wise:

...
Argo Rollouts Notifications
APP
7:46 PM
:red-build: service-x-v1 canary deployment was cancelled.
7:46
:red-build: service-x-v1 canary deployment was cancelled.
7:50
:red-build: service-x-v1 canary deployment was cancelled.
7:51
:red-build: service-x-v1 canary deployment was cancelled.
7:51
:red-build: service-x-v1 canary deployment was cancelled.

Argo Rollouts Notifications
APP
8:05 PM
:red-build: service-x-v1 canary deployment was cancelled.
8:05
:red-build: service-x-v1 canary deployment was cancelled.

Argo Rollouts Notifications
APP
8:16 PM
:red-build: service-x-v1 canary deployment was cancelled.
8:16
:red-build: service-x-v1 canary deployment was cancelled.
...

And then 11 hours after the first message, it just stopped. it started 16:55 -> ended 03:55 To be clear - there was no rollout in progress. We have more than 100 rollouts in cluster, but this was the only one where this started happened. (edited)

Can't reproduce. it just 'happened'.

Checklist:

[ ] I've included steps to reproduce the bug.
[x] I've included the version of argo rollouts.

Version

1.6.6

Logs

In the body. was just those over and over.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

zachaller commented 6 months ago

Did the controller restart around the time it was fixed?

meeech commented 6 months ago

As far as I can tell, it didn't.

Digging more, here a bit more info, which is weird:

These seem to be in a cluster together, and are also happening throughout - before, during, and after the bad notifs.

time="2024-02-28T08:56:57Z" level=error msg="roCtx.reconcile err failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again" 
time="2024-02-28T08:56:57Z" level=error msg="failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-02-28T08:56:57Z" level=error msg="rollout syncHandler error: failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again"

Also seeing these in the logs:

03:51:09.660    time="2024-02-28T08:51:09Z" level=warning msg="Removing abort condition from fully promoted rollout" namespace=default rollout=service-x

The first one lines up directly with the first 'wrong' notification received, and the last one in the logs lines up with the last 'wrong' notif received.

and same for

time="2024-02-28T08:55:43Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config:  : trigger 'on-rollout-aborted' is not configured"

Which is weird, since this trigger is configured. These Failed to run trigger messages are in log throughout when this was going on.

The removing abort condition... and all the failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent messages do stop - when a new rollout for that service starts hours later.

However, the "Failed to run trigger, trigger: on-rollout-aborted, ... messages stop at 3:55 - exactly when the last message to the slack channel was sent. And looking through the logs, and comparing with timestamps in slack, these messages only show in logs when I have those bad notifs in slack about the cancelled rollouts.

looking at when the notifs start, this coincides with first 'false' alarm notif:

16:55:33.475 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config:  : trigger 'on-rollout-aborted' is not configured"
16:55:33.479 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack pipelines-eng-alerts}, namespace config:  : trigger 'on-rollout-aborted' is not configured"
16:55:33.701 time="2024-02-27T21:55:33Z" level=warning msg="ReplicaSet \"service-x-64c6df4d6f\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=orb-service-compile-v1

so both those triggers it says aren't configured are configured.

but beyond that, nothing is standing out.

Laakso commented 4 months ago

I also stumbled upon this. Do you happen to have HPA? Only way I was able to fix this was to restart argo rollouts controller.

"level":"info","msg":"rollout syncHandler queue retries: 80 : key \"helloworld/helloworld\"","namespace":"helloworld","rollout":"helloworld","time":"2024-04-30T12:09:18Z"} time="2024-04-30T12:09:18Z" level=error msg="failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset helloworld-866f6f5d58: Operation cannot be fulfilled on replicasets.apps \"helloworld-866f6f5d58\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"

meeech commented 4 months ago

@Laakso Yes we had HPA on this service

Laakso commented 4 months ago

I am suspecting it has something to do with this issue. It is hard to verify though.

wizardist commented 2 months ago

We have encountered this issue recently.

We had ReplicaSet reporting 3 replicas (and there were three pods). HPA was reporting 6 replicas. Rollout was reporting 4 replicas.

Our way to remediate was to manually scale up the ReplicaSet to 12 replicas. The rollout controller got unstuck immediately.

Controller version 1.6.6.

wizardist commented 2 months ago

I think it should be addressed by #3559 in 1.7.0. What do you think, @zachaller?

zachaller commented 2 months ago

I think it should be addressed by #3559 in 1.7.0. What do you think, @zachaller?

Yea, I think it should.

jbrardport commented 2 months ago

@wizardist I'm seeing an issue similar to what you described.

I manually scaled up the ReplicaSet, but the Controller still seems stuck. Any advice?

edit: opened an issue: https://github.com/argoproj/argo-rollouts/issues/3709 also I'm on v1.7.1

gopkri24 commented 1 month ago

I am suspecting it has something to do with this issue. It is hard to verify though.

I can confirm I had similar issue once I bumped up the minReplicas in HPA

argoproj / argo-rollouts

Sudden reconcile failures/progress failure #3418