serverSideDiff error when removing block from spec.containers[] in Helm chart

thecosmicfrog commented 2 days ago

Checklist:

[y] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[y] I've included steps to reproduce the bug.
[y] I've pasted the output of argocd version.

Describe the bug

I am seeing an error in Argo CD when upgrading a Helm chart from one version to another. The only difference between the Helm chart versions is that the new version removes the resources block from spec.template.spec.containers[0] in the Deployment object. I have noticed that removing other blocks (e.g. env) results in the same issue (so it is not just a resources problem).

The specific error is:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Additional details:

The sync options we have enabled are just ServerSideApply=true.
For compare options, ServerSideDiff is enforced on the server side by setting controller.diff.server.side: true in the argo-helm chart values.
After a certain period of time (indeterminate - a few minutes?), the app status goes to Healthy/Synced/Sync OK, which initially suggests it "fixed itself", however, you then notice that the "Last Sync" version is still the old Helm chart version. Then, clicking Refresh will re-trigger the error to appear. Triggering a manual Sync appears to actually fix the issue.

To Reproduce

I have built a Helm chart to reproduce this, with two versions (0.0.1 and 0.0.2). Updating will not work, as you will see.

Prerequisites:
- Argo CD installed to the kube-system namespace.
- Server-side diff enabled (controller.diff.server.side: true).
- Another namespace created called bug (to host the k8s objects).

Create a file called application.yaml to represent our Argo CD Application object:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
 name: test-app
spec:
 project: default
 source:
   repoURL: https://thecosmicfrog.github.io/helm-charts/
   path: .
   targetRevision: 0.0.1  # Baseline chart version:
                          # https://github.com/thecosmicfrog/helm-charts/tree/0.0.1/charts/test-app
   helm:
     releaseName: test-app
   chart: test-app
 destination:
   server: https://kubernetes.default.svc
   namespace: bug
 syncPolicy:
   automated:
     prune: true
     selfHeal: true
   syncOptions:
     - ServerSideApply=true
 revisionHistoryLimit: 10

Apply this to the cluster: kubectl apply -n kube-system -f application.yaml
Pods should come up without issue and everything should be correctly synced - as expected for a simple Deployment.

Update application.yaml to bump the Helm chart to the new version:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
 name: test-app
spec:
 project: default
 source:
   repoURL: https://thecosmicfrog.github.io/helm-charts/
   path: .
-   targetRevision: 0.0.1  # Baseline chart version:
-                          # https://github.com/thecosmicfrog/helm-charts/tree/0.0.1/charts/test-app
+   targetRevision: 0.0.2  # New chart with `spec.template.spec.containers[0].resources` block removed:
+                          # https://github.com/thecosmicfrog/helm-charts/compare/0.0.1...0.0.2
   helm:
     releaseName: test-app
   chart: test-app
 destination:
   server: https://kubernetes.default.svc
   namespace: bug
 syncPolicy:
   automated:
     prune: true
     selfHeal: true
   syncOptions:
     - ServerSideApply=true
 revisionHistoryLimit: 10

Apply this to the cluster: kubectl apply -n kube-system -f application.yaml

Expected behavior

The new chart version should install without error, as it is such a straightforward change (resources block removed from containers[0] block).

Actual behavior

Sync Status enters an Unknown state with the new chart version, and App Conditions displays 1 Error. That error is:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

The only way to seemingly complete the update is to manually sync, which works without issue. We have Auto Sync enabled, so I'm not sure why that does not resolve the issue.

Screenshots

Argo CD UI after applying chart 0.0.2:

Several (5-ish?) minutes later - with no intervention from me - error "appears" to be resolved... but note the versions are not matching between both sync fields:

Then, clicking Refresh...

...results in the same 1 Error outcome as before...

...and the Pods present on the cluster are still from the "old" (0.0.1) chart version:

The only way to fix is to manually Sync:

Which finally brings the app into sync at 0.0.2:

Version

2024/11/13 14:23:55 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
argocd: v2.13.0+347f221
  BuildDate: 2024-11-04T15:31:13Z
  GitCommit: 347f221adba5599ef4d5f12ee572b2c17d01db4d
  GitTreeState: clean
  GoVersion: go1.23.2
  Compiler: gc
  Platform: darwin/arm64
WARN[0001] Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web.
argocd-server: v2.13.0+347f221
  BuildDate: 2024-11-04T12:09:06Z
  GitCommit: 347f221adba5599ef4d5f12ee572b2c17d01db4d
  GitTreeState: clean
  GoVersion: go1.23.1
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.4.3 2024-07-19T16:40:33Z
  Helm Version: v3.15.4+gfa9efb0
  Kubectl Version: v0.31.0
  Jsonnet Version: v0.20.0

Logs

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Let me know if the above is enough information to reproduce the issue.

Thanks for your time - Aaron

andrii-korotkov-verkada commented 1 day ago

Can you share the helm charts, please? In particular, container spec.

thecosmicfrog commented 1 day ago

@andrii-korotkov-verkada Yes, of course. You can find the chart here: https://github.com/thecosmicfrog/helm-charts/tree/main/charts%2Ftest-app

There are also corresponding Git tags for 0.0.1 and 0.0.2.

andrii-korotkov-verkada commented 1 day ago

When this happens, can you looks at the current vs desired manifest, please? It's probably a bug in removing webhook fields for comparison (unless you actually have a webhook in the cluster that does it), but worth a shot.

thecosmicfrog commented 1 day ago

@andrii-korotkov-verkada The diff seems to be in a bit of a "confused" state from my viewing. See screenshots below.

Live Manifest for the Deployment object shows the resources block, as expected.

Desired Manifest is missing the resources block (again, as expected).

But, the Diff tab is completely blank.

I hope that helps. Let me know what additional information I can provide.

andrii-korotkov-verkada commented 1 day ago

Can you copy-paste whole manifests, please? Sorry for bugging you, it's just I'm looking for things out of line and need full manifests to check that.

thecosmicfrog commented 1 day ago

@andrii-korotkov-verkada No problem at all! Here's the Deployment manifests.

Live Manifest (managed fields unhidden):

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '3'
  creationTimestamp: '2024-11-13T17:14:51Z'
  generation: 3
  labels:
    app.kubernetes.io/instance: test-app
    argocd.argoproj.io/instance: test-app
  managedFields:
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            f:app.kubernetes.io/instance: {}
            f:argocd.argoproj.io/instance: {}
        f:spec:
          f:minReadySeconds: {}
          f:progressDeadlineSeconds: {}
          f:replicas: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:template:
            f:metadata:
              f:labels:
                f:app.kubernetes.io/instance: {}
            f:spec:
              f:containers:
                k:{"name":"http-echo"}:
                  .: {}
                  f:env:
                    k:{"name":"PORT"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"VERSION"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:livenessProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
                  f:name: {}
                  f:ports:
                    k:{"containerPort":5678,"protocol":"TCP"}:
                      .: {}
                      f:containerPort: {}
                      f:name: {}
                      f:protocol: {}
                  f:readinessProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
                  f:resources:
                    f:requests:
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    f:allowPrivilegeEscalation: {}
                    f:capabilities:
                      f:drop: {}
                    f:privileged: {}
                  f:startupProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
              f:hostIPC: {}
              f:hostNetwork: {}
              f:hostPID: {}
              f:securityContext:
                f:fsGroup: {}
                f:runAsGroup: {}
                f:runAsNonRoot: {}
                f:runAsUser: {}
                f:seccompProfile:
                  f:type: {}
              f:terminationGracePeriodSeconds: {}
      manager: argocd-controller
      operation: Apply
      time: '2024-11-14T14:43:51Z'
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:deployment.kubernetes.io/revision: {}
        f:status:
          f:availableReplicas: {}
          f:conditions:
            .: {}
            k:{"type":"Available"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Progressing"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:observedGeneration: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updatedReplicas: {}
      manager: kube-controller-manager
      operation: Update
      subresource: status
      time: '2024-11-14T14:44:34Z'
  name: test-app
  namespace: sandbox-aaron
  resourceVersion: '200923886'
  uid: 7b7349f2-9000-4fe3-a443-9eb4e1a1a659
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: test-app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: test-app
    spec:
      containers:
        - env:
            - name: VERSION
              value: 0.0.1
            - name: PORT
              value: '5678'
          image: hashicorp/http-echo
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: http-echo
          ports:
            - containerPort: 5678
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            requests:
              cpu: 100m
              memory: 32Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
          startupProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      terminationGracePeriodSeconds: 80
status:
  availableReplicas: 2
  conditions:
    - lastTransitionTime: '2024-11-14T06:03:11Z'
      lastUpdateTime: '2024-11-14T06:03:11Z'
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-11-13T17:14:51Z'
      lastUpdateTime: '2024-11-14T14:44:34Z'
      message: ReplicaSet "test-app-74dfc69c76" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

Desired Manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: test-app
    argocd.argoproj.io/instance: test-app
  name: test-app
  namespace: sandbox-aaron
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: test-app
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: test-app
    spec:
      containers:
        - env:
            - name: VERSION
              value: 0.0.2
            - name: PORT
              value: '5678'
          image: hashicorp/http-echo
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              path: /
              port: 5678
          name: http-echo
          ports:
            - containerPort: 5678
              name: http
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /
              port: 5678
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
          startupProbe:
            httpGet:
              path: /
              port: 5678
      hostIPC: false
      hostNetwork: false
      hostPID: false
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      terminationGracePeriodSeconds: 80

Let me know if you'd like me to re-post with the managed fields hidden, or anything else. Thanks!

andrii-korotkov-verkada commented 1 day ago

Hm, I don't see anything obviously wrong at the moment.

thecosmicfrog commented 1 day ago

Hm, I don't see anything obviously wrong at the moment.

Indeed. Notably, setting IncludeMutationWebhook=true in the SyncOptions appears to resolve the issue, but this doesn't seem like it should be necessary for such a simple change (removal of a block from containers[])? Hence why I'm hesitant to proceed with setting that flag.

Are you able to reproduce on your side? I believe the instructions and charts I provided should be enough to do so, but please advise if I can provide anything else.

andrii-korotkov-verkada commented 1 day ago

One more thing - do you have mutating webhooks setup in the cluster?

thecosmicfrog commented 1 day ago

We have three MutatingWebhooks:

aws-load-balancer-webhook
pod-identity-webhook
vpc-resource-mutating-webhook

As I understand it, all are a part of Amazon EKS and AWS Load Balancer Controller.

thecosmicfrog commented 56 minutes ago

Hi @andrii-korotkov-verkada. I have some additional information which should help you in finding the root cause of this.

I use Argo Rollouts for most of my applications (thus I use Rollout objects instead of Deployment). But the original error was triggered for an app using a Deployment and thus is what I used in my reproduction Helm charts.

Out of curiosity, I decided to see if the same error would trigger when using Rollout. I figured it would, since that is mostly a drop-in replacement for Deployment but, to my surprise, it seems to work without issue!

Please see my latest chart versions:

0.3.1: This chart is essentially the same as 0.0.1, but using a Rollout instead of a Deployment.
0.3.2: This is 0.3.1 with the resources block removed from spec.template.spec.containers[0].

See the code for 0.3.1 and 0.3.2 here. I had to add two very basic Service objects as this is required by Argo Rollouts, but you can likely just ignore them.

The chart artifacts are built and uploaded as before, so you can simply update spec.source.targetRevision in the application.yaml file I provided in the original post and kubectl apply.

I hope this helps.

Thanks - Aaron

argoproj / argo-cd

serverSideDiff error when removing block from spec.containers[] in Helm chart #20792