crossplane-contrib / provider-upjet-aws

Official AWS Provider for Crossplane by Upbound.
https://marketplace.upbound.io/providers/upbound/provider-aws
Apache License 2.0
143 stars 120 forks source link

[Bug]: NodeGroup MR is marked READY=true even if readinessCheck has not been satisfied #1228

Open adamhouse opened 6 months ago

adamhouse commented 6 months ago

Is there an existing issue for this?

Affected Resource(s)

eks.aws.upbound.io/v1beta1 -- NodeGroup

Resource MRs required to reproduce the bug

    - name: eks-workers
      base:
        apiVersion: eks.aws.upbound.io/v1beta1
        kind: NodeGroup
        spec:
          deletionPolicy: Delete
          forProvider:
            clusterNameSelector:
              matchControllerRef: true
              matchLabels:
                name: cluster-main
              policy:
                resolution: Required
                resolve: Always
            nodeRoleArnSelector:
              matchControllerRef: true
              matchLabels:
                name: worker-role
              policy:
                resolution: Required
                resolve: Always
            amiType: AL2_x86_64
            scalingConfig:
              - maxSize: 3
                minSize: 1
                desiredSize: 3
      patches:
        - type: PatchSet
          patchSetName: aws-pc
        - type: PatchSet
          patchSetName: region
        - fromFieldPath: spec.vpcConfig.subnets
          toFieldPath: spec.forProvider.subnetIds
        - fromFieldPath: status.worker.launchTemplate.id
          toFieldPath: spec.forProvider.launchTemplate[0].id
          policy:
            fromFieldPath: Required
        - fromFieldPath: status.worker.launchTemplate.version
          toFieldPath: spec.forProvider.launchTemplate[0].version
          transforms:
            - type: convert
              convert:
                toType: string
          policy:
            fromFieldPath: Required
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.status
          toFieldPath: status.worker.status
      readinessChecks:
        - type: MatchString
          fieldPath: status.atProvider.status
          matchString: "ACTIVE"

Steps to Reproduce

If access permits, throttle API requests to IAM in the AWS account you're testing in. This should cause the NodeGroup creation to fail.

What happened?

If AWS is unable to successfully create the NodeGroup above (resulting in a status of CREATE_FAILED on the AWS side that is passed to the resource at status.atProvider.status) Crossplane still marks the MR as READY=true despite a readinessCheck to explicitly match string ACTIVE.

Relevant Error Output Snippet

Example MR.

Resource marked as READY=true

9:09:27 ~ kg nodegroup tgplat-sbx-4-bjfqh-mcxhq
NAME                       READY   SYNCED   EXTERNAL-NAME              AGE
tgplat-sbx-4-bjfqh-mcxhq   True    True     tgplat-sbx-4-bjfqh-mcxhq   9m19s

Value of status.atProvider.status != ACTIVE as defined by readinessCheck.

9:09:32 ~ kg nodegroup tgplat-sbx-4-bjfqh-mcxhq -oyaml | yq -r '.status'
atProvider:
  amiType: AL2_x86_64
  arn: <REMOVED>
  capacityType: ON_DEMAND
  clusterName: tgplat-sbx-4-bjfqh-q544l
  diskSize: 0
  id: tgplat-sbx-4-bjfqh-q544l:tgplat-sbx-4-bjfqh-mcxhq
  launchTemplate:
    - id: lt-06de386050da58058
      name: terraform-20240318135907606600000007
      version: "1"
  nodeRoleArn: <REMOVED>
  releaseVersion: 1.28.5-20240307
  resources:
    - remoteAccessSecurityGroupId: ""
  scalingConfig:
    - desiredSize: 3
      maxSize: 3
      minSize: 1
  status: CREATE_FAILED
  subnetIds:
    - <REMOVED>
    - <REMOVED>
    - <REMOVED>
  tags:
    crossplane-kind: nodegroup.eks.aws.upbound.io
    crossplane-name: tgplat-sbx-4-bjfqh-mcxhq
    crossplane-providerconfig: tgplat-sbx-4-bjfqh-xlc5d
  tagsAll:
    crossplane-kind: nodegroup.eks.aws.upbound.io
    crossplane-name: tgplat-sbx-4-bjfqh-mcxhq
    crossplane-providerconfig: tgplat-sbx-4-bjfqh-xlc5d
  updateConfig:
    - maxUnavailable: 1
      maxUnavailablePercentage: 0
  version: "1.28"

Crossplane Version

v1.15.0-up.1

Provider Version

v1.1.0

Kubernetes Version

v1.28.6-eks-508b6b3

Kubernetes Distribution

EKS

Additional Info

No response

turkenf commented 5 months ago

Hi @adamhouse,

Thank you for raising this issue, could you please try with provider version 1.4.0 and let us know?

adamhouse commented 3 months ago

@turkenf I'm still seeing the same behavior on provider version 1.8.0 but I'm not sure if this is a real "problem" or if I'm simply misunderstanding the intended use of readinessChecks.

My NodeGroup as defined in my Composition:

          - name: eks-worker
            base:
              apiVersion: eks.aws.upbound.io/v1beta1
              kind: NodeGroup
              spec:
                deletionPolicy: Delete
                forProvider:
                  amiType: AL2_x86_64
                  clusterNameSelector:
                    matchControllerRef: true
                    matchLabels:
                      name: cluster-main
                    policy:
                      resolution: Required
                      resolve: Always
                  nodeRoleArnSelector:
                    matchControllerRef: true
                    matchLabels:
                      name: worker-role
                    policy:
                      resolution: Required
                      resolve: Always
                  scalingConfig:
                    - desiredSize: 3
                      maxSize: 3
                      minSize: 1
            patches:
              - type: PatchSet
                patchSetName: aws-pc
              - type: PatchSet
                patchSetName: region
              - type: FromCompositeFieldPath
                fromFieldPath: spec.vpcConfig.subnets
                toFieldPath: spec.forProvider.subnetIds
              - type: FromCompositeFieldPath
                fromFieldPath: status.worker.launchTemplate.id
                toFieldPath: spec.forProvider.launchTemplate[0].id
                policy:
                  fromFieldPath: Required
              - type: FromCompositeFieldPath
                fromFieldPath: status.worker.launchTemplate.version
                toFieldPath: spec.forProvider.launchTemplate[0].version
                policy:
                  fromFieldPath: Required
                transforms:
                  - convert:
                      toType: string
                    type: convert
              - type: ToCompositeFieldPath
                fromFieldPath: status.atProvider.status
                toFieldPath: status.worker.status
            readinessChecks:
              - type: MatchString
                fieldPath: status.atProvider.status
                matchString: ACTIVE

The corresponding Managed Resource:

9:44:25 ~ kg nodegroup.eks vpc-cni-ue1-9c7t5-sq5nk
NAME                      SYNCED   READY   EXTERNAL-NAME             AGE
vpc-cni-ue1-9c7t5-sq5nk   True     True    vpc-cni-ue1-9c7t5-sq5nk   22m

The MR's status.atProvider output:

9:44:38 ~ kg nodegroup.eks vpc-cni-ue1-9c7t5-sq5nk -oyaml | yq -r .status.atProvider
amiType: AL2_x86_64
arn: <REMOVED>
capacityType: ON_DEMAND
clusterName: vpc-cni-ue1-sbx
diskSize: 0
id: vpc-cni-ue1-sbx:vpc-cni-ue1-9c7t5-sq5nk
launchTemplate:
  id: lt-0d97494af0df766d2
  name: terraform-20240628142005627000000003
  version: "1"
nodeRoleArn: <REMOVED>
releaseVersion: 1.29.3-20240625
scalingConfig:
  desiredSize: 3
  maxSize: 3
  minSize: 1
status: CREATING
subnetIds:
  - <REMOVED>
  - <REMOVED>
  - <REMOVED>
tags:
  crossplane-kind: nodegroup.eks.aws.upbound.io
  crossplane-name: vpc-cni-ue1-9c7t5-sq5nk
  crossplane-providerconfig: vpc-cni-ue1-9c7t5-pmrlf
tagsAll:
  crossplane-kind: nodegroup.eks.aws.upbound.io
  crossplane-name: vpc-cni-ue1-9c7t5-sq5nk
  crossplane-providerconfig: vpc-cni-ue1-9c7t5-pmrlf
updateConfig:
  maxUnavailable: 1
  maxUnavailablePercentage: 0
version: "1.29"

Note the value of status.atProvider.status is CREATING which conflicts with the readinessCheck I have defined. Given this scenario, should I expect the MR to show as READY=false?

github-actions[bot] commented 1 week ago

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

mitchelldavis44 commented 1 week ago

/fresh bump