crossplane-contrib / provider-upjet-azure

Official Azure Provider for Crossplane by Upbound.
Apache License 2.0
61 stars 75 forks source link

[Bug]: Pipeline issues - AKS cluster (and other Azure resources) not created reliably #809

Open drew0ps opened 1 month ago

drew0ps commented 1 month ago

Is there an existing issue for this?

Affected Resource(s)

containerservice.azure.upbound.io/v1beta1 - KubernetesCluster

Resource MRs required to reproduce the bug

No response

Steps to Reproduce

What happened?

This run is using the examples/containerservice/v1beta1/kubernetescluster.yaml - same as main. https://github.com/crossplane-contrib/provider-upjet-azure/actions/runs/10682720219

Relevant Error Output Snippet

No response

Crossplane Version

main, whatever is used by the uptests pipeline

Provider Version

main

Kubernetes Version

No response

Kubernetes Distribution

No response

Additional Info

I have been trying to merge in my PR but the uptests keep failing, even though manual tests succeed with the newly added resource. Because the resource relies on an AKS cluster present in the cloud I ran the pipeline with AKS defined in the CR and noticed instability, several unsuccesful runs where the cluster was stuck in Creating state throughout the 30 minute run.

drew0ps commented 1 month ago

I ran another test against /test-examples="examples/containerservice/v1beta1/kubernetescluster.yaml" : https://github.com/crossplane-contrib/provider-upjet-azure/actions/runs/10698538178/job/29658186863 The cluster was in creating state for 20 minutes after which the pipeline failed.

turkenf commented 1 month ago

Hi @drew0ps, thank you for raising the issue.

I think this issue is not only for pipeline/uptest. I just tried in a local kind cluster and the resource cannot be created for over 20 minutes without any errors.

NAME SYNCED READY EXTERNAL-NAME AGE kubernetescluster.containerservice.azure.upbound.io/example-xft True False example-xft 24m

status: atProvider: {} conditions:

drew0ps commented 1 month ago

Hi @turkenf, thank you for the response and checking.

I suspect there is something with this specific CR since I also checked locally with the monolith (make run) and there I succeeded with a different CR (my k8s extensions story), although I run rancher desktop:

apiVersion: kubernetesconfiguration.azure.upbound.io/v1beta1
kind: KubernetesClusterExtension
metadata:
  annotations:
    meta.upbound.io/example-id: kubernetesconfiguration/v1beta1/kubernetesclusterextension
  labels:
    testing.upbound.io/example-name: example
  name: example
spec:
  forProvider:
    clusterId: subscriptions/mysub/resourceGroups/example/providers/Microsoft.ContainerService/managedClusters/example
    extensionType: microsoft.flux
    releaseTrain: Stable
    configurationSettings:
      retentionPolicy: "1"
  providerConfigRef:
    name: default

---

apiVersion: containerservice.azure.upbound.io/v1beta1
kind: KubernetesCluster
metadata:
  annotations:
    meta.upbound.io/example-id: kubernetesconfiguration/v1beta1/kubernetesclusterextension
  labels:
    testing.upbound.io/example-name: example
  name: example
spec:
  forProvider:
    defaultNodePool:
    - name: default
      nodeCount: 1
      vmSize: Standard_D4ds_v5
    dnsPrefix: example-aks
    identity:
    - type: SystemAssigned
    location: North Europe
    resourceGroupNameSelector:
      matchLabels:
        testing.upbound.io/example-name: example

---

apiVersion: azure.upbound.io/v1beta1
kind: ResourceGroup
metadata:
  annotations:
    meta.upbound.io/example-id: kubernetesconfiguration/v1beta1/kubernetesclusterextension
  labels:
    testing.upbound.io/example-name: example
  name: example
spec:
  forProvider:
    location: North Europe
k get kubernetesclusterextension.kubernetesconfiguration.azure.upbound.io/example  kubernetescluster.containerservice.azure.upbound.io/example resourcegroup.azure.upbound.io/example
NAME                                                                          SYNCED   READY   EXTERNAL-NAME   AGE
kubernetesclusterextension.kubernetesconfiguration.azure.upbound.io/example   True     True    example         17m

NAME                                                          SYNCED   READY   EXTERNAL-NAME   AGE
kubernetescluster.containerservice.azure.upbound.io/example   True     True    example         17m

NAME                                     SYNCED   READY   EXTERNAL-NAME   AGE
resourcegroup.azure.upbound.io/example   True     True    example         17m
Conditions:
    Last Transition Time:  2024-09-04T09:08:20Z
    Reason:                ReconcileSuccess
    Status:                True
    Type:                  Synced
    Last Transition Time:  2024-09-04T09:12:41Z
    Reason:                Available
    Status:                True
    Type:                  Ready
    Last Transition Time:  2024-09-04T09:12:13Z
    Reason:                Success
    Status:                True
    Type:                  LastAsyncOperation

The differences are: Location: Its north europe on mine and west europe on the non-working one vmsize is different The following options which are missing from my CR: tags: Environment: Production apiServerAccessProfile:

In this run the AKS cluster creation succeeded: https://github.com/crossplane-contrib/provider-upjet-azure/actions/runs/10699008353/job/29659653711 Target: /test-examples="examples/kubernetesconfiguration/v1beta1/kubernetesclusterextension.yaml"

turkenf commented 1 month ago

I am experiencing the same issue in the examples you gave, and this may be caused by our subscriptions 🤔

drew0ps commented 1 month ago

Hm that could be, but that does not explain why the AKS cluster succeeds with the creation on this run on subscription "2895a7df-ae9f-41b8-9e78-3ce4926df838" :

    logger.go:42: 09:59:01 | case/0-apply |     conditions:
    logger.go:42: 09:59:01 | case/0-apply |     - lastTransitionTime: "2024-09-04T09:44:10Z"
    logger.go:42: 09:59:01 | case/0-apply |       reason: Available
    logger.go:42: 09:59:01 | case/0-apply |       status: "True"
    logger.go:42: 09:59:01 | case/0-apply |       type: Ready
    logger.go:42: 09:59:01 | case/0-apply |     - lastTransitionTime: "2024-09-04T09:39:14Z"
    logger.go:42: 09:59:01 | case/0-apply |       reason: ReconcileSuccess
    logger.go:42: 09:59:01 | case/0-apply |       status: "True"
    logger.go:42: 09:59:01 | case/0-apply |       type: Synced
    logger.go:42: 09:59:01 | case/0-apply |     - lastTransitionTime: "2024-09-04T09:39:14Z"
    logger.go:42: 09:59:01 | case/0-apply |       reason: Success
    logger.go:42: 09:59:01 | case/0-apply |       status: "True"
    logger.go:42: 09:59:01 | case/0-apply |       type: LastAsyncOperation

This succeeds after like 5 minutes. The extension install still fails though, which eventually blocks my PR. All of this works on my own Azure env though.

drew0ps commented 3 weeks ago

Quick update here - the AKS cluster does not install because there is no free tier AKS anymore in West Europe. In North Europe it is still failing but with a different reason most probably. I could reproduce it with make e2e though and there I get the following event in the Azure Activity Log as I mentioned in my PR:

The PutAgentPoolHandler.PUT request limit has been exceeded for agentpool /subscriptions/mysub/resourcegroups/example-rg-kce/providers/Microsoft.ContainerService/managedClusters/example-kc-kce/agentPools/default, please retry again in 36 seconds

The same message comes up all the time so retry doesn't help.

drew0ps commented 2 weeks ago

For some reason the AKS cluster creation enters an update loop trying to remove a field called maxSurge. However, I found that if the field is added to the manifests this doesn't happen. It still seems to be a bug, although there is a workaround:

spec:
  forProvider:
    defaultNodePool:
    - name: default
      nodeCount: 1
      vmSize: Standard_D2_v2
      upgradeSettings:
      - maxSurge: '10%'
...