Open meeech opened 6 months ago
Did the controller restart around the time it was fixed?
As far as I can tell, it didn't.
Digging more, here a bit more info, which is weird:
These seem to be in a cluster together, and are also happening throughout - before, during, and after the bad notifs.
time="2024-02-28T08:56:57Z" level=error msg="roCtx.reconcile err failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again"
time="2024-02-28T08:56:57Z" level=error msg="failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-02-28T08:56:57Z" level=error msg="rollout syncHandler error: failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again"
Also seeing these in the logs:
03:51:09.660 time="2024-02-28T08:51:09Z" level=warning msg="Removing abort condition from fully promoted rollout" namespace=default rollout=service-x
The first one lines up directly with the first 'wrong' notification received, and the last one in the logs lines up with the last 'wrong' notif received.
and same for
time="2024-02-28T08:55:43Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config: : trigger 'on-rollout-aborted' is not configured"
Which is weird, since this trigger is configured. These Failed to run trigger
messages are in log throughout when this was going on.
The removing abort condition...
and all the failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent
messages do stop - when a new rollout for that service starts hours later.
However, the "Failed to run trigger, trigger: on-rollout-aborted, ...
messages stop at 3:55 - exactly when the last message to the slack channel was sent. And looking through the logs, and comparing with timestamps in slack, these messages only show in logs when I have those bad notifs in slack about the cancelled rollouts.
looking at when the notifs start, this coincides with first 'false' alarm notif:
16:55:33.475 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config: : trigger 'on-rollout-aborted' is not configured"
16:55:33.479 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack pipelines-eng-alerts}, namespace config: : trigger 'on-rollout-aborted' is not configured"
16:55:33.701 time="2024-02-27T21:55:33Z" level=warning msg="ReplicaSet \"service-x-64c6df4d6f\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=orb-service-compile-v1
so both those triggers it says aren't configured are configured.
but beyond that, nothing is standing out.
I also stumbled upon this. Do you happen to have HPA? Only way I was able to fix this was to restart argo rollouts controller.
"level":"info","msg":"rollout syncHandler queue retries: 80 : key \"helloworld/helloworld\"","namespace":"helloworld","rollout":"helloworld","time":"2024-04-30T12:09:18Z"} time="2024-04-30T12:09:18Z" level=error msg="failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset helloworld-866f6f5d58: Operation cannot be fulfilled on replicasets.apps \"helloworld-866f6f5d58\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
@Laakso Yes we had HPA on this service
I am suspecting it has something to do with this issue. It is hard to verify though.
We have encountered this issue recently.
We had ReplicaSet reporting 3 replicas (and there were three pods). HPA was reporting 6 replicas. Rollout was reporting 4 replicas.
Our way to remediate was to manually scale up the ReplicaSet to 12 replicas. The rollout controller got unstuck immediately.
Controller version 1.6.6.
I think it should be addressed by #3559 in 1.7.0. What do you think, @zachaller?
I think it should be addressed by #3559 in 1.7.0. What do you think, @zachaller?
Yea, I think it should.
@wizardist I'm seeing an issue similar to what you described.
I manually scaled up the ReplicaSet, but the Controller still seems stuck. Any advice?
edit: opened an issue: https://github.com/argoproj/argo-rollouts/issues/3709 also I'm on v1.7.1
I am suspecting it has something to do with this issue. It is hard to verify though.
I can confirm I had similar issue once I bumped up the minReplicas in HPA
So this is a fun one :D
and these (which i assume is why we start seeing a lot of these cancelled notifs)
this went on for a hours. here is a sample - no real pattern timing wise:
And then 11 hours after the first message, it just stopped. it started 16:55 -> ended 03:55 To be clear - there was no rollout in progress. We have more than 100 rollouts in cluster, but this was the only one where this started happened. (edited)
Can't reproduce. it just 'happened'.
Checklist:
Version
1.6.6
Logs
In the body. was just those over and over.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.