argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.76k stars 866 forks source link

Rollout failing with msg "the object has been modified; please apply your changes to the latest version" #3080

Closed pdeva closed 10 months ago

pdeva commented 1 year ago

Checklist:

Describe the bug updates to services in argo rollouts are failing suddenly with this msg for no reason. the only change we made was change the image tag of the Rollout

To Reproduce it fails and gets in this state when mutliple rollout image tags are updated at once. if we then do a rollout retry one service at a time, each service succeeds.

Expected behavior Rollout should succeed. has no reason to fail since the only thing chaged is updated image tag

Screenshots

Screenshot 2023-10-05 at 8 01 19 PM Screenshot 2023-10-05 at 8 05 02 PM

Version 1.6.0

Logs

roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps "pg-query-65bc4849f5": the object has been modified; please apply your changes to the latest version and try again
# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

bpoland commented 11 months ago

Hi @zachaller I am still seeing this the object has been modified; please apply your changes to the latest version error with replicasets using 1.6.2 which includes the fix from #3091 :(

zachaller commented 11 months ago

@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?

bpoland commented 11 months ago

@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?

Yeah our rollout got stuck and I saw this message over and over. The behaviour we saw was:

  1. The rollout was in a healthy state (or so it said)
  2. We updated the image on the Rollout (we are using workload referencing, in case that matters)
  3. Rollouts created a new ReplicaSet with the updated image, but the new RS had 0 replicas (even though our first step is setWeight: 1) and nothing seemed to be happening
  4. I manually promoted to the next step which is a pause and then to the next step setWeight: 2 -- still the new RS was stuck at 0 replicas (Rollout was marked as "Progressing")
  5. I tried manually scaling up the new RS to 1 replica. The new pod started but the steps did not progress
  6. I checked rollout controller logs and saw it complaining about an RS for an old revision. I manually deleted that RS and then the rollouts controller immediately picked up with the next step and the issue was resolved
mclarke47 commented 11 months ago

I'm also seeing a LOT of these in the controller logs with a very similar setup to @bpoland and we're also seeing rollouts get tingstuck unless we manually retry the failures

zachaller commented 11 months ago

Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.

zachaller commented 11 months ago

@pdeva Are you able to reliably reproduce this. Also your image has a bunch of issues with the VirtualService but then you also show a log line on the ReplicaSet so it could be something else also modifying the VirtualService

mclarke47 commented 11 months ago

@zachaller We have policy agents but it seemed to work fine on v1.3.0 which we just upgraded from.

I managed to find a rollout stuck in progress because it seemed like it wasn't updating the replica count in the new replica set.

Screenshot 2023-11-13 at 8 31 32 AM Screenshot 2023-11-13 at 8 31 51 AM Screenshot 2023-11-13 at 8 31 25 AM

controller logs

As a follow up we rolled back to v1.3.0 and everything started working again

bpoland commented 11 months ago

Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.

We have a linkerd injector which adds a container, maybe that is related? Similar to @mclarke47 though, we have not experienced this previously (currently trying to upgrade from 1.4.1)

DanTulovsky commented 11 months ago

We are also seeing this happen a lot more. Yesterday HPA increased the number of replicas, but the Rollout did not bring up more pods. The Rollout object itself had the correct number set, it's just the new pods weren't coming up. Killing Argo Rollouts controller always fixes these stuck cases.

It's definitely happening a lot more with the 1.6 version than before.

DanTulovsky commented 11 months ago

Question, would something like HPA modifying the number of replicas count as something that modifies the replicaset and might cause this issue?

DanTulovsky commented 11 months ago

Here's an example. We started seeing these message at: 2023-11-14T22:10:00Z

time="2023-11-14T22:10:00Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" generation=670 namespace=public resourceVersion=432345210 rollout=public-collector-saas

time="2023-11-15T00:44:36Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" namespace=public rollout=public-collector-saas

And they continued, this is last one at 2023-11-15T00:44:37Z

time="2023-11-15T00:44:37Z" level=error msg="Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"

That's two hours. And it only started working again when we killed the argo controller pod. Would it be possible to include in the message what changed? Perhaps that will lead to some clue as to why this is happening?

This is the first message (referencing the same replicaset) after the controller restarted:

time="2023-11-15T00:44:38Z" level=info msg="Enqueueing parent of public/public-collector-saas-56577cf9c-15: Rollout public/public-collector-saas"

Can you please explain a bit more about these conflicts that cause the "the object has been modified" errors? What is a common cause? How is the controller meant to deal with them? Presumably nothing was modifying this replicaset for 2 hours straight... is the idea that the controller modifies it, and then something else and also modifies it (maybe reverting something) and that's what the controller notices?

This is now happening to us daily so anything we can do to help figure this out, please let us know. We are on 1.6.2.

Thank you Dan

DanTulovsky commented 11 months ago

btw, we also run gatekeeper, but it only has one mutating webhook which has to do with HPA, this is what it looks like:

apiVersion: mutations.gatekeeper.sh/v1beta1
kind: Assign
metadata:
  name: hpa-assign-default-scale-down
spec:
  applyTo:
    - groups: ["autoscaling"]
      kinds: ["HorizontalPodAutoscaler"]
      versions: ["v2beta2"]
  match:
    scope: Namespaced
    kinds:
      - apiGroups: ["*"]
        kinds: ["HorizontalPodAutoscaler"]
  location: "spec.behavior.scaleDown"
  parameters:
    pathTests:
      - subPath: "spec.behavior.scaleDown"
        condition: MustNotExist
    assign:
      value:
        # wait this long the largest recommendation and then scale down to that
        stabilizationWindowSeconds: 600  # default = 300
        policies:
          # Only take down ${value}% per ${periodSeconds}
          - periodSeconds: 300  # default = 60
            type: Percent
            value: 10

So in theory this shouldn't be touching the replicaset at all. The other webhooks are constraints that have to do with labels and annotations, nothing that would mess with a running pod.

DanTulovsky commented 11 months ago

fwiw, this is what the APi is showing related to that stuck replicaset:

image

DanTulovsky commented 11 months ago

Finally, this is when the issue started, the view from the api controller:

image

DanTulovsky commented 11 months ago

Maybe this is helpful, this is the last succesful Update by the argo controller, followed by changes from other components, and then the failed updates from argo starting:

image

DanTulovsky commented 11 months ago

So that last event that happened before the errors started was HPA taking down one replica. Which maybe was the trigger and what change in the replicaset since argo saw it last, but somehow it didn't manage to reconcile that properly.

This is the HPA view:

image

It looks like it was trying to increase the number of replicas. I wonder if this is Argo and HPA fighting it out then?

DanTulovsky commented 11 months ago

Note that while HPA shows current and desired replicas = 124, the actual number of replicas was 112. So this is similar to what I saw a couple of days ago where HPA said "bring up more replicas" and argo did not.

I assume the "current replicas" comes from the controller (argo in this case). And I can confirm that I did see the Rollout object have the correct desired number of pods, while the number or running pods was smaller.

zachaller commented 11 months ago

I want to just comment I think we are also seeing some issues as well with one of our clusters in regards to this so spending some time looking into it.

zachaller commented 11 months ago

Do any of you use notifications within your rollouts specs? Trying to see if there is a correlation between notifications updating the replicate spec.

DanTulovsky commented 11 months ago

yes.

On Tue, Nov 28, 2023 at 11:56 AM Zach Aller @.***> wrote:

Do you guys use notifications within your rollouts specs?

— Reply to this email directly, view it on GitHub https://github.com/argoproj/argo-rollouts/issues/3080#issuecomment-1830294757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4PPZ6R2VINHAEXQQLJYDYGYJUDAVCNFSM6AAAAAA5VFK4BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQGI4TINZVG4 . You are receiving this because you commented.Message ID: @.***>

bpoland commented 11 months ago

We don't use notifications currently.

mclarke47 commented 11 months ago

No notification use here

int128 commented 11 months ago

I have faced the similar problems in our cluster.

1. Rollout is stuck while canary update

When a Rollout is updated, both old and new ReplicaSet are running and then Rollout is stuck. I could see the following message in the status of Rollout.

old replicas are pending termination

Here is a snippet of kubectl get replicaset. Hash 676f9f555d are new, and 7b7cdd9847 is old.

worker-676f9f555d-25xd9                      1/1     Running     0          3h22m
worker-676f9f555d-dkmlt                      1/1     Running     0          3h22m
worker-7b7cdd9847-d4xl6                      1/1     Running     0          5h8m

I deleted the old ReplicaSet and then Rollout status became Healthy.

2. Rollout status becomes Degraded even if pods are running

When a Rollout is updated, it becomes the degraded status even if all new pods are running. I could see the following message in the status of Rollout:

ProgressDeadlineExceeded: ReplicaSet "poller-64d95bc44b" has timed out progressing.

I could refresh the status of Rollout by restating the argo-rollouts-controller.

DanTulovsky commented 10 months ago

Thank you for fixing this (hopefully for good)! What is the eta for when this might make it into a release?

zachaller commented 10 months ago

Just released it, can you try it out?

It will still log the conflict but rollouts should not get stuck anymore.

zachaller commented 10 months ago

I also probably found the root cause of the conflicts just not sure how to deal with it yet, but they also should not cause any issues because they do get retried and we have had this code for a while now https://github.com/argoproj/argo-rollouts/issues/3218

int128 commented 10 months ago

I have updated argo-rollouts to v1.6.3 and this problem seems resolved. Thank you very much for the quick fix.

bpoland commented 10 months ago

We've also upgraded and haven't seen the issue again since. Thanks!

DanTulovsky commented 10 months ago

Hi. We are still seeing this in 1.6.4. In this case, a new rollout was triggered, and show up in the UI, but did not start rolling out. I clicked the promote button and then it went ahead.

image

image

NaurisSadovskis commented 10 months ago

hey folks, we're also seeing this on the latest Argo Rollouts version (the unreleased 1.7.x).

In our case, we have a process which annotates (and labels) Rollout objects on each application deploy and we suspect the issue happens when:

  1. Underlying Deployment is modified (triggering a new rollout) and a new ReplicaSet is created with the same labels/annotations as the original Rollout (foo: bar)
  2. Rollout progresses
  3. The original Rollout object is modified with a new label (foo: foobar) and Deployment is modified (again, triggering a new rollout)
  4. This triggers the errors we've seen above.

In some cases, Argo Rollout Controller seems to lock up and stop reporting any data for that particular Rollout (usually one out of 15 we're running at a time) and the solution is to restart the controller to force it to reconcile the Rollout to the latest version it should be.

zachaller commented 10 months ago

@NaurisSadovskis Did you also see the issue on 1.5, the logs would be different and not log the error but could you also see if rollouts got stuck?

int128 commented 10 months ago

I updated the controller to v1.6.4 and this problem occurs again. For a workaround, we run a CronJob to restart the controller every day.

zachaller commented 10 months ago

@NaurisSadovskis would you be able to test a version with this patch: https://github.com/argoproj/argo-rollouts/pull/3272

NaurisSadovskis commented 9 months ago

@zachaller updated and the problem persists - more specifically controller is active, but it gets stuck on rolling out the new ReplicaSet. the @int128's solution of restarting the controller solves this again.

ajhodgson commented 7 months ago

Just experienced this. 1.6.6. Definitely seems related to HPA; the replicaset was scaled up at the time.

edit: argocd 2.10.2, on EKS 1.28. Restarting the controller fixed it.

mclarke47 commented 7 months ago

Does it make sense to reopen this?

dodwmd commented 7 months ago

We're also seeing this issue, using latest release.

Danny5487401 commented 7 months ago

rollouts: v1.6.6 still have the same issue image

later it will timed out progressing

marcusio888 commented 7 months ago

Good morning, the case must be reopened since with the new version of argo rollout v1.6.6 it continues to happen, in my case from time to time we have to restart argo rollout for it to be fixed. Annex logs.

Screenshot 2024-03-18 at 09 52 32

This error happens with hpa and without hpa.

bpoland commented 7 months ago

@zachaller can we reopen this issue? We are also continuing to hit it

omer2500 commented 5 months ago

We are facing the same issue on v1.6.6

DimKanoute commented 5 months ago

Hey, coming here to say that we face the exact same issue.

When HPA that scales a rollout, this will come into conflict with the argo-rollouts controller.

bpoland commented 5 months ago

I think this is the same issue so maybe we should keep posting over there: https://github.com/argoproj/argo-rollouts/issues/3316

int128 commented 4 months ago

Unfortunately, we still need to restart the controllers of our clusters every hour.

Here is an example of CronJob.

# Restart argo-rollouts every hour to avoid the following error:
# https://github.com/argoproj/argo-rollouts/issues/3080#issuecomment-1835809731
apiVersion: batch/v1
kind: CronJob
metadata:
  name: argo-rollouts-periodically-restart
spec:
  schedule: "0 * * * *" # every hour
  jobTemplate:
    spec:
      backoffLimit: 2
      ttlSecondsAfterFinished: 259200 # 3 days
      template:
        spec:
          restartPolicy: Never
          serviceAccountName: argo-rollouts-periodically-restart
          containers:
            - name: kubectl-rollout-restart
              image: public.ecr.aws/bitnami/kubectl:1.29.1
              command:
                - kubectl
                - --v=3
                - --namespace=argo-rollouts
                - rollout
                - restart
                - deployment/argo-rollouts
              securityContext:
                allowPrivilegeEscalation: false
                capabilities:
                  drop:
                    - "ALL"
              resources:
                limits:
                  memory: 64Mi
                requests:
                  cpu: 10m
                  memory: 64Mi
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: argo-rollouts-periodically-restart
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argo-rollouts-periodically-restart
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argo-rollouts-periodically-restart
subjects:
  - kind: ServiceAccount
    name: argo-rollouts-periodically-restart
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argo-rollouts-periodically-restart
rules:
  # https://stackoverflow.com/a/68980720
  - apiGroups: ["apps"]
    resources: ["deployments"]
    resourceNames: ["argo-rollouts"]
    verbs: ["get", "patch"]
tasdikrahman commented 3 months ago

We see the same issue on 1.6.6, with the rollout using HPA. Restarting the argo-rollouts pods, made the issue go away for the application. Logs from the rollouts controller.

time="2024-07-10T11:08:14Z" level=info msg="Started syncing rollout" generation=25660 namespace=my-application-namespace resourceVersion=4042592220 rollout=my-application
time="2024-07-10T11:08:14Z" level=info msg="No TrafficRouting Reconcilers found" namespace=my-application-namespace rollout=my-application
time="2024-07-10T11:08:14Z" level=error msg="roCtx.reconcile err failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset my-application-5767f75c6d: Operation cannot be fulfilled on replicasets.apps \"my-application-5767f75c6d\": the object has been modified; please apply your changes to the latest version and try again" generation=25660 namespace=my-application-namespace resourceVersion=4042592220 rollout=my-application
time="2024-07-10T11:08:14Z" level=info msg="Reconciliation completed" generation=25660 namespace=my-application-namespace resourceVersion=4042592220 rollout=my-application time_ms=16.470487
time="2024-07-10T11:08:14Z" level=error msg="rollout syncHandler error: failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset my-application-5767f75c6d: Operation cannot be fulfilled on replicasets.apps \"my-application-5767f75c6d\": the object has been modified; please apply your changes to the latest version and try again" namespace=my-application-namespace rollout=my-application
time="2024-07-10T11:08:14Z" level=info msg="rollout syncHandler queue retries: 672 : key \"my-application-namespace/my-application\"" namespace=my-application-namespace rollout=my-application
time="2024-07-10T11:08:14Z" level=error msg="failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset my-application-5767f75c6d: Operation cannot be fulfilled on replicasets.apps \"my-application-5767f75c6d\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
zachaller commented 3 months ago

Rollouts v1.7.1 should fix this if people see this on v1.7.x please report?

marcusio888 commented 3 months ago

Good afternoon, I am testing it with 1.7.1 at the moment there are few retries, I have not yet had to restart the argorollout pods.

time="2024-07-10T15:12:44Z" level=info msg="rollout syncHandler queue retries: 2 : key \"my-namespace/my-app\"" namespace=my-namespace rollout=my-app