Closed pdeva closed 10 months ago
Hi @zachaller I am still seeing this the object has been modified; please apply your changes to the latest version
error with replicasets using 1.6.2 which includes the fix from #3091 :(
@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?
@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?
Yeah our rollout got stuck and I saw this message over and over. The behaviour we saw was:
setWeight: 1
) and nothing seemed to be happeningsetWeight: 2
-- still the new RS was stuck at 0 replicas (Rollout was marked as "Progressing")I'm also seeing a LOT of these in the controller logs with a very similar setup to @bpoland and we're also seeing rollouts get tingstuck unless we manually retry the failures
Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.
@pdeva Are you able to reliably reproduce this. Also your image has a bunch of issues with the VirtualService but then you also show a log line on the ReplicaSet so it could be something else also modifying the VirtualService
@zachaller We have policy agents but it seemed to work fine on v1.3.0 which we just upgraded from.
I managed to find a rollout stuck in progress because it seemed like it wasn't updating the replica count in the new replica set.
As a follow up we rolled back to v1.3.0 and everything started working again
Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.
We have a linkerd injector which adds a container, maybe that is related? Similar to @mclarke47 though, we have not experienced this previously (currently trying to upgrade from 1.4.1)
We are also seeing this happen a lot more. Yesterday HPA increased the number of replicas, but the Rollout did not bring up more pods. The Rollout object itself had the correct number set, it's just the new pods weren't coming up. Killing Argo Rollouts controller always fixes these stuck cases.
It's definitely happening a lot more with the 1.6 version than before.
Question, would something like HPA modifying the number of replicas count as something that modifies the replicaset and might cause this issue?
Here's an example. We started seeing these message at: 2023-11-14T22:10:00Z
time="2023-11-14T22:10:00Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" generation=670 namespace=public resourceVersion=432345210 rollout=public-collector-saas
time="2023-11-15T00:44:36Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" namespace=public rollout=public-collector-saas
And they continued, this is last one at 2023-11-15T00:44:37Z
time="2023-11-15T00:44:37Z" level=error msg="Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
That's two hours. And it only started working again when we killed the argo controller pod. Would it be possible to include in the message what changed? Perhaps that will lead to some clue as to why this is happening?
This is the first message (referencing the same replicaset) after the controller restarted:
time="2023-11-15T00:44:38Z" level=info msg="Enqueueing parent of public/public-collector-saas-56577cf9c-15: Rollout public/public-collector-saas"
Can you please explain a bit more about these conflicts that cause the "the object has been modified" errors? What is a common cause? How is the controller meant to deal with them? Presumably nothing was modifying this replicaset for 2 hours straight... is the idea that the controller modifies it, and then something else and also modifies it (maybe reverting something) and that's what the controller notices?
This is now happening to us daily so anything we can do to help figure this out, please let us know. We are on 1.6.2.
Thank you Dan
btw, we also run gatekeeper, but it only has one mutating webhook which has to do with HPA, this is what it looks like:
apiVersion: mutations.gatekeeper.sh/v1beta1
kind: Assign
metadata:
name: hpa-assign-default-scale-down
spec:
applyTo:
- groups: ["autoscaling"]
kinds: ["HorizontalPodAutoscaler"]
versions: ["v2beta2"]
match:
scope: Namespaced
kinds:
- apiGroups: ["*"]
kinds: ["HorizontalPodAutoscaler"]
location: "spec.behavior.scaleDown"
parameters:
pathTests:
- subPath: "spec.behavior.scaleDown"
condition: MustNotExist
assign:
value:
# wait this long the largest recommendation and then scale down to that
stabilizationWindowSeconds: 600 # default = 300
policies:
# Only take down ${value}% per ${periodSeconds}
- periodSeconds: 300 # default = 60
type: Percent
value: 10
So in theory this shouldn't be touching the replicaset at all. The other webhooks are constraints that have to do with labels and annotations, nothing that would mess with a running pod.
fwiw, this is what the APi is showing related to that stuck replicaset:
Finally, this is when the issue started, the view from the api controller:
Maybe this is helpful, this is the last succesful Update by the argo controller, followed by changes from other components, and then the failed updates from argo starting:
So that last event that happened before the errors started was HPA taking down one replica. Which maybe was the trigger and what change in the replicaset since argo saw it last, but somehow it didn't manage to reconcile that properly.
This is the HPA view:
It looks like it was trying to increase the number of replicas. I wonder if this is Argo and HPA fighting it out then?
Note that while HPA shows current and desired replicas = 124, the actual number of replicas was 112. So this is similar to what I saw a couple of days ago where HPA said "bring up more replicas" and argo did not.
I assume the "current replicas" comes from the controller (argo in this case). And I can confirm that I did see the Rollout object have the correct desired number of pods, while the number or running pods was smaller.
I want to just comment I think we are also seeing some issues as well with one of our clusters in regards to this so spending some time looking into it.
Do any of you use notifications within your rollouts specs? Trying to see if there is a correlation between notifications updating the replicate spec.
yes.
On Tue, Nov 28, 2023 at 11:56 AM Zach Aller @.***> wrote:
Do you guys use notifications within your rollouts specs?
— Reply to this email directly, view it on GitHub https://github.com/argoproj/argo-rollouts/issues/3080#issuecomment-1830294757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4PPZ6R2VINHAEXQQLJYDYGYJUDAVCNFSM6AAAAAA5VFK4BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQGI4TINZVG4 . You are receiving this because you commented.Message ID: @.***>
We don't use notifications currently.
No notification use here
I have faced the similar problems in our cluster.
When a Rollout is updated, both old and new ReplicaSet are running and then Rollout is stuck. I could see the following message in the status of Rollout.
old replicas are pending termination
Here is a snippet of kubectl get replicaset. Hash 676f9f555d
are new, and 7b7cdd9847
is old.
worker-676f9f555d-25xd9 1/1 Running 0 3h22m
worker-676f9f555d-dkmlt 1/1 Running 0 3h22m
worker-7b7cdd9847-d4xl6 1/1 Running 0 5h8m
I deleted the old ReplicaSet and then Rollout status became Healthy.
When a Rollout is updated, it becomes the degraded status even if all new pods are running. I could see the following message in the status of Rollout:
ProgressDeadlineExceeded: ReplicaSet "poller-64d95bc44b" has timed out progressing.
I could refresh the status of Rollout by restating the argo-rollouts-controller.
Thank you for fixing this (hopefully for good)! What is the eta for when this might make it into a release?
Just released it, can you try it out?
It will still log the conflict but rollouts should not get stuck anymore.
I also probably found the root cause of the conflicts just not sure how to deal with it yet, but they also should not cause any issues because they do get retried and we have had this code for a while now https://github.com/argoproj/argo-rollouts/issues/3218
I have updated argo-rollouts to v1.6.3 and this problem seems resolved. Thank you very much for the quick fix.
We've also upgraded and haven't seen the issue again since. Thanks!
Hi. We are still seeing this in 1.6.4. In this case, a new rollout was triggered, and show up in the UI, but did not start rolling out. I clicked the promote
button and then it went ahead.
hey folks, we're also seeing this on the latest Argo Rollouts version (the unreleased 1.7.x).
In our case, we have a process which annotates (and labels) Rollout
objects on each application deploy and we suspect the issue happens when:
Deployment
is modified (triggering a new rollout) and a new ReplicaSet
is created with the same labels/annotations as the original Rollout
(foo: bar
)Rollout
progressesRollout
object is modified with a new label (foo: foobar
) and Deployment
is modified (again, triggering a new rollout)In some cases, Argo Rollout Controller seems to lock up and stop reporting any data for that particular Rollout
(usually one out of 15 we're running at a time) and the solution is to restart the controller to force it to reconcile the Rollout
to the latest version it should be.
@NaurisSadovskis Did you also see the issue on 1.5, the logs would be different and not log the error but could you also see if rollouts got stuck?
I updated the controller to v1.6.4 and this problem occurs again. For a workaround, we run a CronJob to restart the controller every day.
@NaurisSadovskis would you be able to test a version with this patch: https://github.com/argoproj/argo-rollouts/pull/3272
@zachaller updated and the problem persists - more specifically controller is active, but it gets stuck on rolling out the new ReplicaSet. the @int128's solution of restarting the controller solves this again.
Just experienced this. 1.6.6. Definitely seems related to HPA; the replicaset was scaled up at the time.
edit: argocd 2.10.2, on EKS 1.28. Restarting the controller fixed it.
Does it make sense to reopen this?
We're also seeing this issue, using latest release.
rollouts: v1.6.6 still have the same issue
later it will timed out progressing
Good morning, the case must be reopened since with the new version of argo rollout v1.6.6 it continues to happen, in my case from time to time we have to restart argo rollout for it to be fixed. Annex logs.
This error happens with hpa and without hpa.
@zachaller can we reopen this issue? We are also continuing to hit it
We are facing the same issue on v1.6.6
Hey, coming here to say that we face the exact same issue.
When HPA that scales a rollout, this will come into conflict with the argo-rollouts controller.
I think this is the same issue so maybe we should keep posting over there: https://github.com/argoproj/argo-rollouts/issues/3316
Unfortunately, we still need to restart the controllers of our clusters every hour.
Here is an example of CronJob.
# Restart argo-rollouts every hour to avoid the following error:
# https://github.com/argoproj/argo-rollouts/issues/3080#issuecomment-1835809731
apiVersion: batch/v1
kind: CronJob
metadata:
name: argo-rollouts-periodically-restart
spec:
schedule: "0 * * * *" # every hour
jobTemplate:
spec:
backoffLimit: 2
ttlSecondsAfterFinished: 259200 # 3 days
template:
spec:
restartPolicy: Never
serviceAccountName: argo-rollouts-periodically-restart
containers:
- name: kubectl-rollout-restart
image: public.ecr.aws/bitnami/kubectl:1.29.1
command:
- kubectl
- --v=3
- --namespace=argo-rollouts
- rollout
- restart
- deployment/argo-rollouts
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- "ALL"
resources:
limits:
memory: 64Mi
requests:
cpu: 10m
memory: 64Mi
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: argo-rollouts-periodically-restart
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: argo-rollouts-periodically-restart
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: argo-rollouts-periodically-restart
subjects:
- kind: ServiceAccount
name: argo-rollouts-periodically-restart
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: argo-rollouts-periodically-restart
rules:
# https://stackoverflow.com/a/68980720
- apiGroups: ["apps"]
resources: ["deployments"]
resourceNames: ["argo-rollouts"]
verbs: ["get", "patch"]
We see the same issue on 1.6.6, with the rollout using HPA. Restarting the argo-rollouts pods, made the issue go away for the application. Logs from the rollouts controller.
time="2024-07-10T11:08:14Z" level=info msg="Started syncing rollout" generation=25660 namespace=my-application-namespace resourceVersion=4042592220 rollout=my-application
time="2024-07-10T11:08:14Z" level=info msg="No TrafficRouting Reconcilers found" namespace=my-application-namespace rollout=my-application
time="2024-07-10T11:08:14Z" level=error msg="roCtx.reconcile err failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset my-application-5767f75c6d: Operation cannot be fulfilled on replicasets.apps \"my-application-5767f75c6d\": the object has been modified; please apply your changes to the latest version and try again" generation=25660 namespace=my-application-namespace resourceVersion=4042592220 rollout=my-application
time="2024-07-10T11:08:14Z" level=info msg="Reconciliation completed" generation=25660 namespace=my-application-namespace resourceVersion=4042592220 rollout=my-application time_ms=16.470487
time="2024-07-10T11:08:14Z" level=error msg="rollout syncHandler error: failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset my-application-5767f75c6d: Operation cannot be fulfilled on replicasets.apps \"my-application-5767f75c6d\": the object has been modified; please apply your changes to the latest version and try again" namespace=my-application-namespace rollout=my-application
time="2024-07-10T11:08:14Z" level=info msg="rollout syncHandler queue retries: 672 : key \"my-application-namespace/my-application\"" namespace=my-application-namespace rollout=my-application
time="2024-07-10T11:08:14Z" level=error msg="failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset my-application-5767f75c6d: Operation cannot be fulfilled on replicasets.apps \"my-application-5767f75c6d\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
Rollouts v1.7.1 should fix this if people see this on v1.7.x please report?
Good afternoon, I am testing it with 1.7.1 at the moment there are few retries, I have not yet had to restart the argorollout pods.
time="2024-07-10T15:12:44Z" level=info msg="rollout syncHandler queue retries: 2 : key \"my-namespace/my-app\"" namespace=my-namespace rollout=my-app
Checklist:
Describe the bug updates to services in argo rollouts are failing suddenly with this msg for no reason. the only change we made was change the image tag of the Rollout
To Reproduce it fails and gets in this state when mutliple rollout image tags are updated at once. if we then do a
rollout retry
one service at a time, each service succeeds.Expected behavior Rollout should succeed. has no reason to fail since the only thing chaged is updated image tag
Screenshots
Version 1.6.0
Logs
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.