argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.75k stars 5.41k forks source link

ServerSideApply fails with "conversion failed" #11136

Closed Dbzman closed 9 months ago

Dbzman commented 1 year ago

Checklist:

Describe the bug Using ServerSideApply, configured in an Application via Sync Options, fails with

error calculating structured merge diff: error calculating diff: error while running updater.Apply: converting (v1.CronJob) to (v1beta1.CronJob): unknown conversion

Using it only with the "Sync" button, without having it configured for the app, works, though.

To Reproduce

Expected behavior ServerSideApply should work in both cases (app config + manual sync)

Screenshots Application configuration which breaks: Bildschirmfoto 2022-11-01 um 13 49 03

Using it only with the Sync button works: Bildschirmfoto 2022-11-01 um 13 50 44

Version

argocd: v2.5.0+b895da4
  BuildDate: 2022-10-25T14:40:01Z
  GitCommit: b895da457791d56f01522796a8c3cd0f583d5d91
  GitTreeState: clean
  GoVersion: go1.18.7
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.5.0+b895da4
  BuildDate: 2022-10-25T14:40:01Z
  GitCommit: b895da457791d56f01522796a8c3cd0f583d5d91
  GitTreeState: clean
  GoVersion: go1.18.7
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.1+g9f88ccb
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0
blakepettersson commented 1 year ago

Which version of k8s are you using?

Dbzman commented 1 year ago

We use 1.22.14.

blakepettersson commented 1 year ago

Could be the case that this is something that needs to be handled when doing the diff in gitops-engine, but I'm not familiar enough with SSA to say for sure. @leoluz?

(potentially related to #11139?)

leoluz commented 1 year ago

@Dbzman Please inspect your Argo CD controller logs and see if you find an entry with this message:

error creating gvk parser: ...

If so, can you provide the full message in the log?

Dbzman commented 1 year ago

@leoluz We didn't see any of those errors. We configured the loglevel to info. Not sure if the error is supposed to show there. What we further observed is that this doesn't happen consistently for all apps, but they all use the same api version. (batch/v1)

Dbzman commented 1 year ago

We noticed a very strange behavior here. We saved the affected CronJob manifest locally, deleted it on Kubernetes and re-created it again. (so it's the exact same manifest, just re-created) After that, Argo was able to sync the application. One thing is that those CronJobs were created with an older api version in the past, but we upgraded them to batch/v1 long ago and also in Kubernetes it shows as batch/v1. Don't know why re-creation helps in that case.

leoluz commented 1 year ago

We noticed a very strange behavior here. We saved the affected CronJob manifest locally, deleted it on Kubernetes and re-created it again. (so it's the exact same manifest, just re-created) After that, Argo was able to sync the application.

Thanks for the additional info. That actually makes sense. What is strange to me is that from your error message it seems that Argo CD is trying to convert from v1.CronJob to v1beta1.CronJob. Not sure why it is trying to go with an older version. That would only make sense if you are applying a CronJob with v1beta1.

I'll try to reproduce this error locally anyways.

Dbzman commented 1 year ago

Thanks for checking. Indeed, it's really weird that it tries to convert to an older version.

We had this issue on 60 of our 400 apps. Yesterday we fixed them all with the above mentioned workaround. Today all of those 60 apps show the error again. So it seems that it has nothing to do with old manifests that were upgraded.

leoluz commented 1 year ago

@Dbzman just confirming.. Are the steps to reproduce still valid with your latest findings??

Dbzman commented 1 year ago

@leoluz I would say yes.

mile-misan commented 1 year ago

Using 2.5.1 version and having similar issues. error calculating structured merge diff: error calculating diff: error while running updater.Apply: converting (v1beta1.PodDisruptionBudget) to (v1.PodDisruptionBudget): unknown conversion and error calculating structured merge diff: error calculating diff: error while running updater.Apply: converting (v2beta2.HorizontalPodAutoscaler) to (v1.HorizontalPodAutoscaler): unknown conversion

llavaud commented 1 year ago

Same here with 2.5.2: error calculating structured merge diff: error calculating diff: error while running updater.Apply: converting (v2beta1.HorizontalPodAutoscaler) to (v1.HorizontalPodAutoscaler): unknown conversion

agaudreault commented 1 year ago

Same behavior with 2.5.2: ComparisonError: error calculating structured merge diff: error calculating diff: error while running updater.Apply: converting (v1.Ingress) to (v1beta1.Ingress): unknown conversion

adding Ingress in case someone hits the issue with that resource.

leoluz commented 1 year ago

Just to provide some direction for users that might get into this error, the current workaround is disabling SSA in the failing resources by adding the annotation: argocd.argoproj.io/sync-options: ServerSideApply=false. For example, if the error is related to Ingress conversion then add the annotation to your Ingress resource.

chrisduong commented 1 year ago

Just to provide some direction for users that might get into this error, the current workaround is disabling SSA in the failing resources by adding the annotation: argocd.argoproj.io/sync-options: ServerSideApply=false. For example, if the error is related to Ingress conversion then add the annotation to your Ingress resource.

Hi @leoluz, I add the annotation but it didn't work, still the same problem (HorizontalPodAutoscaler case)

pseymournutanix commented 1 year ago

fwiw the same occurs with cronJob resources on 2.55 as well

msw-kialo commented 1 year ago

We run into similar issues when enabling SSA for our apps. However, the issue isn't consistent between clusters/apps (the same app/resource might work on one but not the other).

What is strange to me is that from your error message it seems that Argo CD is trying to convert from v1.CronJob to v1beta1.CronJob. Not sure why it is trying to go with an older version. That would only make sense if you are applying a CronJob with v1beta1.

@leoluz I believe managedFields are to blame. They include an apiVersion field that might reference an older (beta) version.

Managed fields of an affected `Ingress` resource: ```yaml metadata: managedFields: - apiVersion: networking.k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:alb.ingress.kubernetes.io/actions.ssl-redirect: {} f:alb.ingress.kubernetes.io/certificate-arn: {} f:alb.ingress.kubernetes.io/listen-ports: {} f:alb.ingress.kubernetes.io/scheme: {} f:alb.ingress.kubernetes.io/ssl-policy: {} f:alb.ingress.kubernetes.io/target-type: {} f:labels: .: {} f:app.kubernetes.io/instance: {} manager: kubectl operation: Update time: "2021-05-28T16:20:40Z" - apiVersion: networking.k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: {} manager: controller operation: Update time: "2021-08-02T09:10:54Z" - apiVersion: networking.k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:spec: f:ingressClassName: {} manager: argocd-application-controller operation: Update time: "2021-08-02T09:18:03Z" - apiVersion: networking.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: v:"group.ingress.k8s.aws/argo-ingresses": {} f:status: f:loadBalancer: f:ingress: {} manager: controller operation: Update time: "2022-03-21T15:25:24Z" - apiVersion: networking.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:alb.ingress.kubernetes.io/group.name: {} f:alb.ingress.kubernetes.io/load-balancer-attributes: {} f:kubectl.kubernetes.io/last-applied-configuration: {} f:spec: f:rules: {} manager: argocd-application-controller operation: Update time: "2022-08-15T11:22:05Z" name: argocd namespace: argocd resourceVersion: "206036857" uid: 3df56465-962b-42bb-9075-e61740b636cc ```
Managed fields of corresponding resource (same name / namespace) but on a different cluster (just different cluster / app age): ```yaml metadata: managedFields: - apiVersion: networking.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:alb.ingress.kubernetes.io/actions.ssl-redirect: {} f:alb.ingress.kubernetes.io/certificate-arn: {} f:alb.ingress.kubernetes.io/group.name: {} f:alb.ingress.kubernetes.io/listen-ports: {} f:alb.ingress.kubernetes.io/scheme: {} f:alb.ingress.kubernetes.io/ssl-policy: {} f:alb.ingress.kubernetes.io/target-type: {} f:labels: .: {} f:app.kubernetes.io/instance: {} f:spec: f:ingressClassName: {} f:rules: {} manager: kubectl-client-side-apply operation: Update time: "2022-05-05T15:11:18Z" - apiVersion: networking.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: {} v:"group.ingress.k8s.aws/argo-ingresses": {} f:status: f:loadBalancer: f:ingress: {} manager: controller operation: Update time: "2022-05-05T15:11:20Z" - apiVersion: networking.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:alb.ingress.kubernetes.io/load-balancer-attributes: {} f:kubectl.kubernetes.io/last-applied-configuration: {} manager: argocd-application-controller operation: Update time: "2022-08-15T11:21:51Z" ```

It also explains why recreating works - it clears the managedFields.

Sadly, it does not help me yet to resolve this issue without recreating the resources (I haven't found a way to clear/edit the managedFields).

jetersen commented 1 year ago

I believe managedFields are to blame. They include an apiVersion field that might reference an older (beta) version.

This is not a might this is the definitive issue. 😓

jetersen commented 1 year ago

@leoluz perhaps this will help: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/555-server-side-apply/README.md

Links that are useful in the readme are: https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/obsolete_versions_test.go

chrisduong commented 1 year ago

Do we why ArgoCD does not respect ".Capabilities.APIVersions", but use the "managedField" (supposed it is the reason, I don't know internally which component does this) as the way to decide which ApiGroup to use?

zswanson commented 1 year ago

We are seeing this in 2.8 with HPA, clusterrole, clusterrolebinding and roles, on clusters that have all been properly upgraded and resource manifests updated but the clusters were created back when these beta api versions were k8s and are now removed.

Amr-Aly commented 1 year ago

We're seeing the same issue with ClusterRole, ClusterRoleBinding.

zswanson commented 1 year ago

K8s docs notes that you can clear managed fields with a json patch. We've been employing that to get past this issue but this is really tiresome. Not sure if Argo can somehow handle it, which would be great. The errors in the ArgoCD sync panel aren't helpful enough because they don't tell us which resource had the conversion error.

k patch KIND NAME --type json -p '[{"op":"replace","path":"/metadata/managedFields","value":[{}]}]'

@msw-kialo fyi ^

leoluz commented 9 months ago

ServerSide Diff feature is merged and available in Argo CD 2.10-RC1. If enabled, it should address this and other diff problems when ServerSide Apply is used.

I am closing this for now and feel free to reopen if the issue persists.

pgier commented 5 months ago

Ran into a similar issue failing to calculate diff for ClusterRole

"converting (v1.ClusterRole) to (v1beta1.ClusterRole):"

Enabling server side diff on the application resolved the issue for me.

sacherus commented 2 months ago

We are using 2.10.5 and have this problem when we want to enable server side apply. I deleted all the hpas, but it didn't help.

ComparisonError: Failed to compare desired state to live state: failed to calculate diff: error calculating structured merge diff: error calculating diff: error while running updater.Apply: converting (v2.HorizontalPodAutoscaler) to (v1.HorizontalPodAutoscaler): unknown conversion

mdtoro-wyn commented 1 month ago

deleting the old HPAs and an old secret solve the issue in my case.