[Bug]: Conversion Webhook for containerservice.azure.upbound.io/v1beta1 sometimes fails

b-deam commented 4 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Affected Resource(s)

kubernetesclusters.containerservice.azure.upbound.io/v1beta1

Resource MRs required to reproduce the bug

No response

Steps to Reproduce

Create a composition that creates a KubernetesCluster managed resource via a Composition Function (we're using https://github.com/crossplane-contrib/function-cue)

What happened?

Occasionally the XR Synced status switches to False temporarily due to conversion webhook error(s). This status does not propagate upwards to the claim.

Relevant Error Output Snippet

Warning  ComposeResources  2m15s (x9 over 16m)    defined/compositeresourcedefinition.apiextensions.crossplane.io  cannot compose resources: cannot apply composed resource "aks_cluster": failed to prune fields: failed add back owned items: failed to convert pruned object at version containerservice.azure.upbound.io/v1beta2: conversion webhook for containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster returned invalid metadata: invalid metadata of type <nil> in input object

And beta trace:

$ crossplane beta trace -n claim-namespace myaks dev-aks-cluster-1 -o wide
NAME                                                                                RESOURCE                                    SYNCED   READY   STATUS
AKS/dev-aks-cluster-1 (dev-azure-eastus2)                                                                                     True     True    Available
└─ XAKS/dev-aks-cluster-1-mnxjr                                                                                               False    True    ReconcileError: cannot compose resources: cannot apply composed resource "aks_cluster": failed to prune fields: failed add back owned items: failed to convert pruned object at version containerservice.azure.upbound.io/v1beta2: conversion webhook for containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster returned invalid metadata: invalid metadata of type <nil> in input object
   ├─ XAKSNodepoolSet/dev-aks-cluster-1-nodepool-set                              nodepool_set                                True     True    Available
   │  ├─ KubernetesClusterNodePool/dev-aks-cluster-1-generalnp                    ng_generalnp                              True     True    Available
   │  └─ NodePoolCalculation/dev-aks-cluster-1-calc                               node_pool_calculation                       True     True    Available
   ├─ XPermission/dev-aks-cluster-1-ca-permission                                 cluster_autoscaler_permission               True     True    Available
   │  ├─ RoleAssignment/dev-aks-cluster-1-cluster-autoscaler-assignment           role_assignment                             True     True    Available
   │  ├─ RoleDefinition/dev-aks-cluster-1-cluster-autoscaler-definition           role_definition                             True     True    Available
   │  ├─ FederatedIdentityCredential/dev-aks-cluster-1-cluster-autoscaler-fedid   federated_identity                          True     True    Available
   │  └─ UserAssignedIdentity/dev-aks-cluster-1-cluster-autoscaler-identity       identity                                    True     True    Available
   ├─ KubernetesCluster/dev-aks-cluster-1-mnxjr                                   aks_cluster                                 True     True    Available
   ├─ EventHubNamespace/dev-aks-cluster-1-test                                     queue-dev-aks-cluster-1-test               True     True    Available
   ├─ EventHubNamespace/dev-aks-cluster-1-test-data                              queue-dev-aks-cluster-1-test-data        True     True    Available

Crossplane Version

1.14.3

Provider Version

1.3.0

Kubernetes Version

v1.28.9

Kubernetes Distribution

AKS

Additional Info

Hi, we were previously hitting https://github.com/crossplane-contrib/provider-upjet-azure/issues/645 and the workaround/fix was to upgrade our MR:

- kubernetesclusters.containerservice.azure.upbound.io/v1beta1
+ kubernetesclusters.containerservice.azure.upbound.io/v1beta2

Since upgrading we're noticing that occasionally (and seemingly randomly) the XR Synced status flips to False due to a conversion webhook failure. This condition only lasts for tens of seconds before Synced then reverts back to True. Sometimes this can take over an hour to reoccur. This status change does not propagate upwards to the claim (as shown in the crossplane beta trace output).

I unfortunately don't have a minimal reproduction available as the environment/configuration displaying this behaviour is very complex, multiple pipeline steps etc and I haven't had much luck reproducing with an environment running in kind.

There are no interesting logs in the Crossplane, Provider, or Function pods (even with --debug enabled), or the AKS control plane.

My limited debugging led me to this Kubernetes bug: https://github.com/kubernetes/kubernetes/issues/117356 - which seems plausible as I can see that the CRD only stores v1beta1: https://github.com/crossplane-contrib/provider-upjet-azure/blob/65f8cdcc27553672715bc0b9429b3c2f88af9baa/package/crds/containerservice.azure.upbound.io_kubernetesclusters.yaml#L11089

Any ideas on where to look next?

b-deam commented 4 months ago

FWIW we saw this exact same bug with the Kubernetes provider and Objects that was solved by moving our resources to v1alpha2 (which is the stored version) from v1alpha1: https://github.com/crossplane-contrib/provider-kubernetes/blob/5bfb71a932d71ada6e29b7bce4f2b4b8162f8ef9/package/crds/kubernetes.crossplane.io_objects.yaml#L865

To me, that suggests that we are indeed hitting https://github.com/kubernetes/kubernetes/issues/117356.

I'd say that that if the KubernetesCluster stored version was moved to v1beta2 we wouldn't see this issue.

b-deam commented 4 months ago

We recreated our claim (by deleting it and all associated XRs, MRs etc.) and the conversion webhook errors haven't reappeared for a number of days.

This chart shows the count of webhook conversion failures over a ~7 day period. You can see that the failures have stopped entirely around the time we deleted/recreated the claim.

It's worth noting that we had originally upgraded the MR from v1beta1 to v1beta2, so I'm not sure if recreating it at v1beta2 has anything to do with the lack of failures, but that seems unlikely to me as this error only appeared weeks after the initial claim creation.

If we see the error reappear, I'll update the issue.

b-deam commented 3 months ago

Just adding here that we've seen this reappear ~3 weeks later. No obvious correlation between the errors reappearing and changes to our composition etc.

felfa01 commented 1 month ago

I hit something similar today. When running kubectl get kubernetescluster I get the following error:

Error from server: conversion webhook for containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster failed: cannot convert from the spoke version "v1beta1" to the hub version "v1beta2": cannot apply the PavedConversion for the "azurerm_kubernetes_cluster" object: failed to convert the source map in mode "toEmbeddedObject" with source API version "v1beta1", target API version "v1beta2": value at the field path windowsProfile must be []any, not "map[string]interface {}"

This is my KubernetesCluster resource:

apiVersion: containerservice.azure.upbound.io/v1beta2
kind: KubernetesCluster
metadata:
  name: cluster
  labels:
    app.kubernetes.io/name: cluster
  annotations:
    crossplane.io/external-name: cluster
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
spec:
  managementPolicies: ["Observe"]
  forProvider:
    resourceGroupName: cluster
  providerConfigRef:
    name: azure-workload-identity

What is interesting is that I did run into something similar a few weeks back and what I did then was to change apiVersion v1beta1 -> v1beta2 which seemingly fixed the problem. But, today I saw it happening again in one of many clusters, so apparently it was not fixed.

Crossplane Version 1.17.1

Provider Version 1.7.0

Kubernetes Version v1.29.9

Kubernetes Distribution AKS

Restarting the crossplane provider pod appears to make the problem go away temporarily. Useful in cases where you manage KubernetesCluster resources with ArgoCD since this Conversion Webhook error causes the full Application to go into Unknown state, essentially blocking any additional reconcilation.

crossplane-contrib / provider-upjet-azure