argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.94k stars 5.46k forks source link

serverSideDiff error when removing block from spec.containers[] in Helm chart #20792

Open thecosmicfrog opened 2 days ago

thecosmicfrog commented 2 days ago

Checklist:

Describe the bug

I am seeing an error in Argo CD when upgrading a Helm chart from one version to another. The only difference between the Helm chart versions is that the new version removes the resources block from spec.template.spec.containers[0] in the Deployment object. I have noticed that removing other blocks (e.g. env) results in the same issue (so it is not just a resources problem).

The specific error is:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)


Additional details:


To Reproduce

I have built a Helm chart to reproduce this, with two versions (0.0.1 and 0.0.2). Updating will not work, as you will see.


Expected behavior

The new chart version should install without error, as it is such a straightforward change (resources block removed from containers[0] block).


Actual behavior

Sync Status enters an Unknown state with the new chart version, and App Conditions displays 1 Error. That error is:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

The only way to seemingly complete the update is to manually sync, which works without issue. We have Auto Sync enabled, so I'm not sure why that does not resolve the issue.


Screenshots








Version

2024/11/13 14:23:55 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
argocd: v2.13.0+347f221
  BuildDate: 2024-11-04T15:31:13Z
  GitCommit: 347f221adba5599ef4d5f12ee572b2c17d01db4d
  GitTreeState: clean
  GoVersion: go1.23.2
  Compiler: gc
  Platform: darwin/arm64
WARN[0001] Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web.
argocd-server: v2.13.0+347f221
  BuildDate: 2024-11-04T12:09:06Z
  GitCommit: 347f221adba5599ef4d5f12ee572b2c17d01db4d
  GitTreeState: clean
  GoVersion: go1.23.1
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.4.3 2024-07-19T16:40:33Z
  Helm Version: v3.15.4+gfa9efb0
  Kubectl Version: v0.31.0
  Jsonnet Version: v0.20.0


Logs

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Let me know if the above is enough information to reproduce the issue.

Thanks for your time - Aaron

andrii-korotkov-verkada commented 1 day ago

Can you share the helm charts, please? In particular, container spec.

thecosmicfrog commented 1 day ago

@andrii-korotkov-verkada Yes, of course. You can find the chart here: https://github.com/thecosmicfrog/helm-charts/tree/main/charts%2Ftest-app

There are also corresponding Git tags for 0.0.1 and 0.0.2.

andrii-korotkov-verkada commented 1 day ago

When this happens, can you looks at the current vs desired manifest, please? It's probably a bug in removing webhook fields for comparison (unless you actually have a webhook in the cluster that does it), but worth a shot.

thecosmicfrog commented 1 day ago

@andrii-korotkov-verkada The diff seems to be in a bit of a "confused" state from my viewing. See screenshots below.

Screenshot 2024-11-14 at 14 49 02



Screenshot 2024-11-14 at 14 49 32



Screenshot 2024-11-14 at 14 46 44



I hope that helps. Let me know what additional information I can provide.

andrii-korotkov-verkada commented 1 day ago

Can you copy-paste whole manifests, please? Sorry for bugging you, it's just I'm looking for things out of line and need full manifests to check that.

thecosmicfrog commented 1 day ago

@andrii-korotkov-verkada No problem at all! Here's the Deployment manifests.

Live Manifest (managed fields unhidden):

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '3'
  creationTimestamp: '2024-11-13T17:14:51Z'
  generation: 3
  labels:
    app.kubernetes.io/instance: test-app
    argocd.argoproj.io/instance: test-app
  managedFields:
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            f:app.kubernetes.io/instance: {}
            f:argocd.argoproj.io/instance: {}
        f:spec:
          f:minReadySeconds: {}
          f:progressDeadlineSeconds: {}
          f:replicas: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:template:
            f:metadata:
              f:labels:
                f:app.kubernetes.io/instance: {}
            f:spec:
              f:containers:
                k:{"name":"http-echo"}:
                  .: {}
                  f:env:
                    k:{"name":"PORT"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"VERSION"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:livenessProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
                  f:name: {}
                  f:ports:
                    k:{"containerPort":5678,"protocol":"TCP"}:
                      .: {}
                      f:containerPort: {}
                      f:name: {}
                      f:protocol: {}
                  f:readinessProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
                  f:resources:
                    f:requests:
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    f:allowPrivilegeEscalation: {}
                    f:capabilities:
                      f:drop: {}
                    f:privileged: {}
                  f:startupProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
              f:hostIPC: {}
              f:hostNetwork: {}
              f:hostPID: {}
              f:securityContext:
                f:fsGroup: {}
                f:runAsGroup: {}
                f:runAsNonRoot: {}
                f:runAsUser: {}
                f:seccompProfile:
                  f:type: {}
              f:terminationGracePeriodSeconds: {}
      manager: argocd-controller
      operation: Apply
      time: '2024-11-14T14:43:51Z'
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:deployment.kubernetes.io/revision: {}
        f:status:
          f:availableReplicas: {}
          f:conditions:
            .: {}
            k:{"type":"Available"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Progressing"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:observedGeneration: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updatedReplicas: {}
      manager: kube-controller-manager
      operation: Update
      subresource: status
      time: '2024-11-14T14:44:34Z'
  name: test-app
  namespace: sandbox-aaron
  resourceVersion: '200923886'
  uid: 7b7349f2-9000-4fe3-a443-9eb4e1a1a659
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: test-app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: test-app
    spec:
      containers:
        - env:
            - name: VERSION
              value: 0.0.1
            - name: PORT
              value: '5678'
          image: hashicorp/http-echo
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: http-echo
          ports:
            - containerPort: 5678
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            requests:
              cpu: 100m
              memory: 32Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
          startupProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      terminationGracePeriodSeconds: 80
status:
  availableReplicas: 2
  conditions:
    - lastTransitionTime: '2024-11-14T06:03:11Z'
      lastUpdateTime: '2024-11-14T06:03:11Z'
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-11-13T17:14:51Z'
      lastUpdateTime: '2024-11-14T14:44:34Z'
      message: ReplicaSet "test-app-74dfc69c76" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2



Desired Manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: test-app
    argocd.argoproj.io/instance: test-app
  name: test-app
  namespace: sandbox-aaron
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: test-app
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: test-app
    spec:
      containers:
        - env:
            - name: VERSION
              value: 0.0.2
            - name: PORT
              value: '5678'
          image: hashicorp/http-echo
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              path: /
              port: 5678
          name: http-echo
          ports:
            - containerPort: 5678
              name: http
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /
              port: 5678
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
          startupProbe:
            httpGet:
              path: /
              port: 5678
      hostIPC: false
      hostNetwork: false
      hostPID: false
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      terminationGracePeriodSeconds: 80

Let me know if you'd like me to re-post with the managed fields hidden, or anything else. Thanks!

andrii-korotkov-verkada commented 1 day ago

Hm, I don't see anything obviously wrong at the moment.

thecosmicfrog commented 1 day ago

Hm, I don't see anything obviously wrong at the moment.

Indeed. Notably, setting IncludeMutationWebhook=true in the SyncOptions appears to resolve the issue, but this doesn't seem like it should be necessary for such a simple change (removal of a block from containers[])? Hence why I'm hesitant to proceed with setting that flag.

Are you able to reproduce on your side? I believe the instructions and charts I provided should be enough to do so, but please advise if I can provide anything else.

andrii-korotkov-verkada commented 1 day ago

One more thing - do you have mutating webhooks setup in the cluster?

thecosmicfrog commented 1 day ago

We have three MutatingWebhooks:

As I understand it, all are a part of Amazon EKS and AWS Load Balancer Controller.

thecosmicfrog commented 56 minutes ago

Hi @andrii-korotkov-verkada. I have some additional information which should help you in finding the root cause of this.

I use Argo Rollouts for most of my applications (thus I use Rollout objects instead of Deployment). But the original error was triggered for an app using a Deployment and thus is what I used in my reproduction Helm charts.

Out of curiosity, I decided to see if the same error would trigger when using Rollout. I figured it would, since that is mostly a drop-in replacement for Deployment but, to my surprise, it seems to work without issue!

Please see my latest chart versions:

See the code for 0.3.1 and 0.3.2 here. I had to add two very basic Service objects as this is required by Argo Rollouts, but you can likely just ignore them.

The chart artifacts are built and uploaded as before, so you can simply update spec.source.targetRevision in the application.yaml file I provided in the original post and kubectl apply.

I hope this helps.

Thanks - Aaron