argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.72k stars 5.4k forks source link

Server Side diff failing for fluent-bit #17568

Open andrewjamesbrown opened 7 months ago

andrewjamesbrown commented 7 months ago

Checklist:

Describe the bug

One of our ArgoCD instances is showing the following error when upgrading fluent-bit 0.30.4 -> 0.44.0:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource DaemonSet/fluent-bit: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

To Reproduce We use a Kyverno policy to modify images to point to a local ECR cache instead of pulling from dockerhub directly. When upgrading the fluent-bit helm chart from 0.30.4 to 0.44.0, we get the error above. We are using ArgoCD v2.10.3+0fd6344

Expected behavior

Screenshots

Version

v2.10.3+0fd6344

Logs

Paste any relevant application logs here.
bryanhorstmann commented 7 months ago

Just ran into a similar issue with kube-prometheus-stack. I had to set controller.diff.server.side: "false" in order to unblock myself. ArgoCD server: v2.10.0+2175939

I deleted a list from my values files and ran into this

0xDones commented 7 months ago

I'm having the same issue.

I had to set controller.diff.server.side: "true" to fix another error I was getting with SyncOptions.ServerSideApply=true on my applications, but now I'm getting this new error.

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: 
serverSideDiff error: error removing non config mutations for resource StatefulSet/loki-backend: 
error reverting webhook removed fields in predicted live resource: errors: 
.spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value) 
.spec.template.spec.containers: element 1: associative list with keys has an element that omits key field "name" (and doesn't have default value)
algo7 commented 6 months ago

Same issue with

{
    "Version": "v2.10.5+335875d",
    "BuildDate": "2024-03-28T15:02:45Z",
    "GitCommit": "335875d13e018bed6e03873f4742582582964745",
    "GitTreeState": "clean",
    "GoVersion": "go1.21.3",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v5.2.1 2023-10-19T20:13:51Z",
    "HelmVersion": "v3.14.3+gf03cc04",
    "KubectlVersion": "v0.26.11",
    "JsonnetVersion": "v0.20.0"
}

and csi-driver-nfs-v4.6.0

ptr1120 commented 6 months ago

I am now having the same issue with our custom deployment + server-side diff activated, after updating ArgoCd v2.10.4 -> v2.10.6:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource StatefulSet/xxx: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Even though my container 0 has a name.

How to fix that?

atlasoft commented 6 months ago

Just ran into this and checking "apply only" allowed me to sync the application again.

image

Skaronator commented 5 months ago

Just ran into this as well with the loki-distributed, mimir-distributed and grafana-agent helm-chart. We had server side apply and server-side diff enabled for a few weeks, and now it just broke. I only modified CPU/MEM resource limits/requests. Nothing else changed.

The "apply only" mentioned above didn't help either.

Running the latest version, 2.11.0.

Skaronator commented 5 months ago

I found a workaround for my issue. I deleted the affected StatefulSet/Deployment with Orphan, which means it doesn't delete any pods. Then ran ArgoCD Sync again, which re-created the STS/Deploy resources.

Skaronator commented 5 months ago

Looks like my issue is more related to ignoreDifferences not working with ServerSideDiff. There is already an open issue: #17362

Edit: nvm, removing all ignoreDifferences didn't fix it. Edit2: We switched back to client side diff and apply and only use server side apply for specific resourced (e.g. very large grafana dashboards)

STollenaar commented 4 months ago

I think I found the issue related to ServerSide diff. Which is and issue inside the gitops-engine repo. Basically when you have a nested value being changed it breaks the map traversal used for doing these server side comparisons. I tried debugging it and applying a bandaid fix that I don't know if it would even be right https://github.com/argoproj/gitops-engine/commit/c25fd94b5dfcaf7d0a8020c2410a1d7629637b67#diff-00282c65a618a9ea64cdb99da5137663dc5773f2c3fd8c37ed2e9a99f3d67f09L254

algo7 commented 2 months ago

Any update on this?

gmauleon commented 2 months ago

Having the same problem: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Modified some containers environment variables in my spec. ArgoCD 2.11.3 with server-side diff activated. Syncing manually fixed the problem.

But the main worry in our case is that's it's kind of a silent error, the app will flap from sync to unknown for a couple of minute and then back to sync for an hour every hour until we discovered it and manually sync. That last bug might be due to the fact that we use applicationset with progressive sync though.

adberger commented 2 months ago

Having the same problem: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Modified some containers environment variables in my spec. ArgoCD 2.11.3 with server-side diff activated. Syncing manually fixed the problem.

But the main worry in our case is that's it's kind of a silent error, the app will flap from sync to unknown for a couple of minute and then back to sync for an hour every hour until we discovered it and manually sync. That last bug might be due to the fact that we use applicationset with progressive sync though.

We also have this problem (Unknown Error State) and we don't use ApplicationSets with Progressive Sync.

gmauleon commented 2 months ago

Update: the error happens in a portion of the code that revert webhook mutation in the diffs as stated by @STollenaar

So on our side adding the IncludeMutationWebhook=true in the already present compare-options annotation like so argocd.argoproj.io/compare-options: ServerSideDiff=true,IncludeMutationWebhook=true, bypass the error.

Not sure what other problems can arise down the line by setting this option though...

adberger commented 2 months ago

Update: the error happens in a portion of the code that revert webhook mutation in the diffs as stated by @STollenaar

So on our side adding the IncludeMutationWebhook=true in the already present compare-options annotation like so argocd.argoproj.io/compare-options: ServerSideDiff=true,IncludeMutationWebhook=true, bypass the error.

Not sure what other problems can arise down the line by setting this option though...

Unknown Error state also gone?

gmauleon commented 2 months ago

Yes the unknown in this case was because of the errors while doing diffs, so there is definitely a problem in the code that "ignore the webhooks" in server side diffs, but so far it's a good workaround.

Including webhooks mutation in diffs will probably cause some unwanted differences though, depending on what webhooks you have in your clusters, in our case just ignoring the /metadata/generation at large did the trick.

      ignoreDifferences:
      - group: '*'
        jsonPointers:
        - /metadata/generation
        kind: '*'
sstarcher commented 2 months ago

I'm seeing similar issues for kube-prometheus-stack


ComparisonError: Failed to compare desired state to live state: failed to perform pre-diff normalization: error building typed results: error creating typedConfig: .spec.containers[1].ports: element 0: associative list with keys has an element that omits key field "protocol" (and doesn't have default value)