argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.69k stars 5.06k forks source link

Incorrect OutOfSync status after updating image and diff showing "containers: null" #16799

Open FredrikAugust opened 5 months ago

FredrikAugust commented 5 months ago

Checklist:

Describe the bug

We update a values.yaml file in a GitHub Action which then is provided to a Helm chart to set the image field of a Rollout from Argo Rollouts. We then run a sync operation on the Application (which stems from an ApplicationSet) which updates the running ReplicaSet to deploy with the new version.

After this sync is done, the application is still reported as OutOfSync, and by clicking the diff we can see that Argo CD reports that spec.containers should be set to null.

Screenshot 2024-01-09 at 22 26 34

It reports the live having containers: null and the desired not having the key containers at all. However, by clicking the actual Rollout in Argo CD and looking at the Live and Desired manifests, they both are correct, so frankly I can't tell where it got the values from.

By running sync once more, it goes away. It will come back after updating image once more however.

In addition to this, I'm experiencing that ArgoCD is having trouble marking Applications from the same ApplicationSet as OutOfSync when they are to a human eye clearly out of sync.

I've manually updated the values file to have image tag y, whereas x is currently deployed. I.e. live=x and desired=y. I can verify this (again) by looking at the Desired and Live tabs on the Rollout. The diff tab, however, shows no differences. Hard refresh and normal refresh doesn't work.

I've tried

I'm reporting these seemingly two bugs as one as it appears they share a common problem of a "misaligned state" in lack of better terms.

Clicking "sync" again syncs the application correctly to the actual desired version, but puts us back in the containers: null state.

Sometimes it also appears to get stuck on "Refreshing", but clicking refresh manually puts it back to in-sync. This goes for seemingly all applications.

To Reproduce

I'm unsure as it only recently started occurring, but we're using a git generator app set to generate a set of pretty common applications. We're using no ignoreDifferences nor RespectIgnoreDifferences (although we used to have this and ApplyOutOfSyncOnly enabled).

Expected behavior

It should correctly show applications in and out of sync, and diff and desired/live manifest should be consistent.

This is currently breaking our CD pipeline so if there's anything I can do to assist let me know.

Version

argocd: v2.9.3+6eba5be
helm:   v3.13.2+g2a2fb3b

Logs

│ "Refreshing app status (controller refresh requested), level (0)" application=argocd/kvist                                                                      │
│ "Refreshing app status (normal refresh requested), level (3)" application=argocd/init-notes                                                                     │
│ "Comparing app state (cluster: https://kubernetes.default.svc, namespace: init)" application=argocd/init-notes                                                  │
│ "No status changes. Skipping patch" application=argocd/kvist                                                                                                    │
│ "Reconciliation completed" application=argocd/kvist dest-name= dest-namespace=argocd dest-server="https://kubernetes.default.svc" fields.level=0 patch_ms=0 set │
│ "getRepoObjs stats" application=argocd/init-notes build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=314 unmarshal_ms=314 version_ms=0                 │
│ "Updated health status: Progressing -> Healthy" application=init-notes dest-namespace=init dest-server="https://kubernetes.default.svc" reason=ResourceUpdated  │
│ "Refreshing app status (controller refresh requested), level (0)" application=argocd/kvist                                                                      │
│ "Update successful" application=argocd/init-notes                                                                                                               │
│ "Reconciliation completed" application=argocd/init-notes dedup_ms=0 dest-name= dest-namespace=init dest-server="https://kubernetes.default.svc" diff_ms=14 fiel │
│ "No status changes. Skipping patch" application=argocd/kvist                                                                                                    │
│ "Reconciliation completed" application=argocd/kvist dest-name= dest-namespace=argocd dest-server="https://kubernetes.default.svc" fields.level=0 patch_ms=0 set │
FredrikAugust commented 5 months ago

I just saw this issue over at argo-rollouts

https://github.com/argoproj/argo-rollouts/issues/3281

Reading

In some scenarios rollouts-controller receives ownership of spec.template.spec.containers field and blocks other components from updating the rollout with server-side apply

We're using Argo Rollouts, and it appears that this might be the case. I see this block under managedfields

  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:rollout.argoproj.io/revision: {}
      f:spec:
        f:template:
          f:spec:
            f:containers: {}
    manager: rollouts-controller
    operation: Update
    time: "2024-01-10T09:26:56Z"

I've uploaded the entire rollout to help debug: https://gist.github.com/FredrikAugust/fe1bafb02e78f0cd8d309c460679fe85

FredrikAugust commented 5 months ago

I just set up an entire new cluster from zero using

argo cd       : v2.9.5+f943664
argo rollouts : v1.6.4+a312af9 

And the issue is still persisting so it shouldn't have anything to do with stale state or the like.

FredrikAugust commented 4 months ago

Are there anyone who could assist me in debugging this or point me to where the problem might be? It's still happening on all our clusters.

diranged commented 4 months ago

We're seeing this exact same issue on new clusters as well....

diranged commented 4 months ago

By chance - who here is using ArgoCD in a remote model vs local-in-cluster deploy?

FredrikAugust commented 4 months ago

in cluster here 

diranged commented 4 months ago

@FredrikAugust Interesting... we're seeing the problem primarily on remote-cluster setups... we have been migrating to a remote-cluster model over the last few weeks, and that's when we saw this creep up.

FredrikAugust commented 3 months ago

After syncing twice, the live tab of the Rollout shows the correct values. In case that is of help.

diranged commented 3 months ago

The only "fix" we have right now is to turn on selfHeal - which at least just resyncs as soon as Rollouts wipes out the field.. but I really hope we can get some traction on this issue at some point. :/

FredrikAugust commented 3 months ago

The only "fix" we have right now is to turn on selfHeal - which at least just resyncs as soon as Rollouts wipes out the field.. but I really hope we can get some traction on this issue at some point. :/

Do you know if this is a problem caused by Argo Rollouts?

taer commented 4 days ago

We're hitting this pretty consistently now.

Remote argo, kustomize image changes. Argo helm version 7.1.1 Argo workflow helm version 2.35.3

taer commented 1 minute ago

Actually, this is kinda a bit nastier than just an out of sync.

I just pushed a manifest adding a requests block to the resources of my container. But the diff was stuck in this state image

The real diff is

@@ -99,8 +98,10 @@
           timeoutSeconds: 1
         resources:
           limits:
-            cpu: "2"
+            cpu: 2
             memory: 1G
+          requests:
+            cpu: 200m
       enableServiceLinks: false
       nodeSelector:
         kubernetes.io/arch: arm64

but that's hidden behind this bad sync.

Why is the diff ignoring the containers block? Is it from the managed-fields thing? I do have this on my application so that argo and rollouts don't fight during the canary release.

ignoreDifferences:
  - group: '*'
    kind: '*'
    managedFieldsManagers:
      - rollouts-controller

Is that too broad?