crossplane-contrib / provider-upjet-azure

Official Azure Provider for Crossplane by Upbound.
Apache License 2.0
57 stars 74 forks source link

[Bug]: AKS Cluster constantly in updating state #686

Closed speer closed 5 months ago

speer commented 5 months ago

Is there an existing issue for this?

Affected Resource(s)

containerservice.azure.upbound.io/v1beta1 - KubernetesCluster

Resource MRs required to reproduce the bug

apiVersion: containerservice.azure.upbound.io/v1beta1
kind: KubernetesCluster
metadata:
  name: kc-1
spec:
  deletionPolicy: Delete
  forProvider:
    apiServerAccessProfile:
    - {}
    automaticChannelUpgrade: patch
    azureActiveDirectoryRoleBasedAccessControl:
    - adminGroupObjectIds:
      - groupid
      azureRbacEnabled: false
      managed: true
    azurePolicyEnabled: true
    defaultNodePool:
    - customCaTrustEnabled: true
      enableAutoScaling: false
      enableHostEncryption: false
      enableNodePublicIp: false
      fipsEnabled: false
      kubeletDiskType: OS
      maxPods: 110
      name: systempool
      nodeCount: 3
      onlyCriticalAddonsEnabled: true
      orchestratorVersion: "1.28"
      osDiskSizeGb: 100
      osDiskType: Ephemeral
      osSku: AlpineLinux
      type: VirtualMachineScaleSets
      ultraSsdEnabled: false
      vmSize: Standard_D4s_v3
      vnetSubnetId: subnetid
      workloadRuntime: OCIContainer
      zones:
      - "3"
    dnsPrefixPrivateCluster: kc-1
    identity:
    - identityIds:
      - miid
      type: UserAssigned
    imageCleanerEnabled: true
    imageCleanerIntervalHours: 48
    keyVaultSecretsProvider:
    - secretRotationEnabled: false
    kubeletIdentity:
    - clientId: clientid
      objectId: objectid
      userAssignedIdentityId: id
    kubernetesVersion: "1.28"
    linuxProfile:
    - adminUsername: azureuser
      sshKey:
      - keyData: ssh-rsa ...
          azurecloud
    localAccountDisabled: true
    location: westeurope
    microsoftDefender:
    - logAnalyticsWorkspaceId: workspaceid
    networkProfile:
    - dnsServiceIp: serviceip
      ebpfDataPlane: cilium
      loadBalancerSku: standard
      networkPlugin: azure
      networkPluginMode: Overlay
      outboundType: userDefinedRouting
      podCidr: podcidr
      serviceCidr: servicecidr
    oidcIssuerEnabled: true
    omsAgent:
    - logAnalyticsWorkspaceId: laworkspace
      msiAuthForMonitoringEnabled: true
    privateClusterEnabled: true
    privateClusterPublicFqdnEnabled: false
    privateDnsZoneId: privatednszone
    publicNetworkAccessEnabled: false
    resourceGroupName: rg-1
    roleBasedAccessControlEnabled: true
    runCommandEnabled: true
    skuTier: Standard
    supportPlan: KubernetesOfficial
    workloadIdentityEnabled: true
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: pc-aks
---
apiVersion: azure.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: pc-aks
spec:
  clientID: clientid
  credentials:
    source: UserAssignedManagedIdentity
  subscriptionID: subscriptionid
  tenantID: tenantid

Steps to Reproduce

Whenever we create a KubernetesCluster with the newest provider version 1.0.0 and azurePolicyEnabled: true and/or the keyVaultSecretsProvider enabled, the AKS cluster is in constant "updating" state.

What happened?

Since we upgraded to provider version 1.0.0, our Kubernetes Clusters remain in constant "updating" state on Azure side. Checking the Activity Log on the clusters, we notice a constant (every few minutes) ping-pong of properties.addonProfiles.azurepolicy.identity and properties.addonProfiles.azureKeyvaultSecretsProvider.identity getting added => removed => added => removed, etc. Both, the add and remove actions are performed by the Identity Crossplane uses, so no other party involved.

The KubernetesCluster resource on Kubernetes side, does not show these changes and stays synced=true, ready=true with no modifications in the status field.

Relevant Error Output Snippet

No response

Crossplane Version

1.15.1

Provider Version

1.0.0

Kubernetes Version

v1.27.9

Kubernetes Distribution

AKS

Additional Info

No response

speer commented 5 months ago

Screenshots of the Azure Activity Log.

Add: add

Remove: remove

speer commented 5 months ago

Turning on the debug log on the provider, we actually discovered the following recurring entry:

2024-03-27T14:51:44Z    DEBUG   provider-azure  Diff detected   {"uid": "6f83c9a5-f60f-4b4d-98e3-ad823d179417", "name": "kc-1", "gvk": "containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster", "instanceDiff": "*terraform.InstanceDiff{mu:sync.Mutex{state:0, sema:0x0}, Attributes:map[string]*terraform.ResourceAttrDiff{\"default_node_pool.0.upgrade_settings.#\":*terraform.ResourceAttrDiff{Old:\"1\", New:\"0\", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, \"default_node_pool.0.upgrade_settings.0.max_surge\":*terraform.ResourceAttrDiff{Old:\"10%\", New:\"\", NewComputed:false, NewRemoved:true, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}}, Destroy:false, DestroyDeposed:false, DestroyTainted:false, RawConfig:cty.NilVal, RawState:cty.NilVal, RawPlan:cty.NilVal, Meta:map[string]interface {}(nil)}"}

Once we explicitly set upgradeSettings[0].maxSurge to 10% in the resource, the recurring changes within addonProfiles are gone in the ActivityLog of the Azure Portal.

Looks like the Activity Log in Azure Portal is not really reliable with it's diff, as maxSurge has nothing in common with addonProfiles...