Open b-deam opened 4 months ago
FWIW we saw this exact same bug with the Kubernetes provider and Objects
that was solved by moving our resources to v1alpha2
(which is the stored version) from v1alpha1
:
https://github.com/crossplane-contrib/provider-kubernetes/blob/5bfb71a932d71ada6e29b7bce4f2b4b8162f8ef9/package/crds/kubernetes.crossplane.io_objects.yaml#L865
To me, that suggests that we are indeed hitting https://github.com/kubernetes/kubernetes/issues/117356.
I'd say that that if the KubernetesCluster
stored version was moved to v1beta2
we wouldn't see this issue.
We recreated our claim (by deleting it and all associated XRs, MRs etc.) and the conversion webhook errors haven't reappeared for a number of days.
This chart shows the count of webhook conversion failures over a ~7 day period. You can see that the failures have stopped entirely around the time we deleted/recreated the claim.
It's worth noting that we had originally upgraded the MR from v1beta1
to v1beta2
, so I'm not sure if recreating it at v1beta2
has anything to do with the lack of failures, but that seems unlikely to me as this error only appeared weeks after the initial claim creation.
If we see the error reappear, I'll update the issue.
Just adding here that we've seen this reappear ~3 weeks later. No obvious correlation between the errors reappearing and changes to our composition etc.
I hit something similar today. When running kubectl get kubernetescluster
I get the following error:
Error from server: conversion webhook for containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster failed: cannot convert from the spoke version "v1beta1" to the hub version "v1beta2": cannot apply the PavedConversion for the "azurerm_kubernetes_cluster" object: failed to convert the source map in mode "toEmbeddedObject" with source API version "v1beta1", target API version "v1beta2": value at the field path windowsProfile must be []any, not "map[string]interface {}"
This is my KubernetesCluster
resource:
apiVersion: containerservice.azure.upbound.io/v1beta2
kind: KubernetesCluster
metadata:
name: cluster
labels:
app.kubernetes.io/name: cluster
annotations:
crossplane.io/external-name: cluster
argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
spec:
managementPolicies: ["Observe"]
forProvider:
resourceGroupName: cluster
providerConfigRef:
name: azure-workload-identity
What is interesting is that I did run into something similar a few weeks back and what I did then was to change apiVersion v1beta1
-> v1beta2
which seemingly fixed the problem. But, today I saw it happening again in one of many clusters, so apparently it was not fixed.
Crossplane Version 1.17.1
Provider Version 1.7.0
Kubernetes Version v1.29.9
Kubernetes Distribution AKS
Restarting the crossplane provider pod appears to make the problem go away temporarily. Useful in cases where you manage KubernetesCluster
resources with ArgoCD since this Conversion Webhook error causes the full Application to go into Unknown
state, essentially blocking any additional reconcilation.
Is there an existing issue for this?
Affected Resource(s)
kubernetesclusters.containerservice.azure.upbound.io/v1beta1
Resource MRs required to reproduce the bug
No response
Steps to Reproduce
KubernetesCluster
managed resource via a Composition Function (we're using https://github.com/crossplane-contrib/function-cue)What happened?
Occasionally the XR
Synced
status switches toFalse
temporarily due to conversion webhook error(s). This status does not propagate upwards to the claim.Relevant Error Output Snippet
And beta trace:
Crossplane Version
1.14.3
Provider Version
1.3.0
Kubernetes Version
v1.28.9
Kubernetes Distribution
AKS
Additional Info
Hi, we were previously hitting https://github.com/crossplane-contrib/provider-upjet-azure/issues/645 and the workaround/fix was to upgrade our MR:
Since upgrading we're noticing that occasionally (and seemingly randomly) the XR
Synced
status flips toFalse
due to a conversion webhook failure. This condition only lasts for tens of seconds beforeSynced
then reverts back toTrue
. Sometimes this can take over an hour to reoccur. This status change does not propagate upwards to the claim (as shown in thecrossplane beta trace
output).I unfortunately don't have a minimal reproduction available as the environment/configuration displaying this behaviour is very complex, multiple pipeline steps etc and I haven't had much luck reproducing with an environment running in
kind
.There are no interesting logs in the Crossplane, Provider, or Function pods (even with
--debug
enabled), or the AKS control plane.My limited debugging led me to this Kubernetes bug: https://github.com/kubernetes/kubernetes/issues/117356 - which seems plausible as I can see that the CRD only stores
v1beta1
: https://github.com/crossplane-contrib/provider-upjet-azure/blob/65f8cdcc27553672715bc0b9429b3c2f88af9baa/package/crds/containerservice.azure.upbound.io_kubernetesclusters.yaml#L11089Any ideas on where to look next?