crossplane-contrib / provider-upjet-gcp

Official GCP Provider for Crossplane by Upbound.
Apache License 2.0
63 stars 68 forks source link

[Bug]: NodePool's `queuedProvisioning` field doesn't take effect #570

Closed vfarcic closed 1 month ago

vfarcic commented 1 month ago

Is there an existing issue for this?

Affected Resource(s)

container.gcp.upbound.io/v1beta2 - NodePool

Resource MRs required to reproduce the bug

No response

Steps to Reproduce

Create a nodepool attached to a GKE cluster.

What happened?

Since recently (one of the newer releases of the provider), after a nodepool is created, it continuously tries to use queued_provisioning. The same setup worked in older versions of the provider. According to the docs, there is no parameter that can be used to define queued provisioning.

Relevant Error Output Snippet

From events:

  Warning  CannotUpdateExternalResource     2s (x4 over 12s)     managed/container.gcp.upbound.io/v1beta1, kind=nodepool  async update failed: refuse to update the external resource because the following update requires replacing it: cannot change the value of the argument "queued_provisioning.#" from "" to ""


### Crossplane Version

v1.16.0

### Provider Version

v1.5.0

### Kubernetes Version

_No response_

### Kubernetes Distribution

_No response_

### Additional Info

_No response_
miraai commented 1 month ago

Hello! I am facing the same issue, using the same versions. I use v1beta1 version of a provider. In v1beta2 field actually exists, see here.

But I would like to know how to resolve this in v1beta1. If its a change related to v1beta2, why would we see this is v1beta1?

My error output for reference:

Warning  CannotUpdateExternalResource  1s (x2 over 1s)  managed/container.gcp.upbound.io/v1beta1, kind=nodepool  async update failed: refuse to update the external resource because the following update requires replacing it: cannot change the value of the argument "queued_provisioning.#" from "" to ""

Update: same thing happening on v1beta2 :/

vfarcic commented 1 month ago

My bad... I thought that the marketplace docs always show the latest version, which is true for the provider version but false for API version. The default is v1beta1 even though the latest is v1beta2.

miraai commented 1 month ago

Did you manage to make it work? Because I do keep getting the same error even on v1beta2. I did set queuedProvisioning.enabled: false though.

mergenci commented 1 month ago

@vfarcic I believe that the marketplace lists storage version in the provider page. All versions of a resource could be found under the version option box in the resource page. It would be clearer, in my humble opinion, to list both the storage version and the latest version in the provider page.

@miraai I haven't tested, but I don't expect queuedProvisioning to work, unless you're using the default value. Here's why: queuedProvisioning was a beta field in the Terraform provider v5.19.0 — beta fields in the Terraform provider are excluded from our APIs. We upgraded to Terraform provider v5.28.0, in which queuedProvisioning is no longer a beta field. You can see that the generated API began to include queuedProvisioning at that time. Our storage version is still v1beta1, though. Therefore, unless we implemented a custom conversion webhook implementation, the field will be lost, when the API server converts the field from v1beta2 to v1beta1 to store.

I don't know how we are supposed to handle cases like this. I'll investigate deeper.

miraai commented 1 month ago

@mergenci totally understand, but no matter what I do, either if its v1beta1 or v1beta2, setting up queuedProvisioning or omitting it, I still get the error and my node status is always in ReconcileError. :/

Cluster and node pool work properly, everything is connected as it should. Do you have any advice how I can deal with this in the meantime? Using default node pool does not bring this error up, but I need autoscaling abilities for some clusters.

Thank you!

mergenci commented 1 month ago

@miraai I see 🤔 At the moment, I don't have anything in mind that might help. Can you post example manifests here, so that it's easier for us to reproduce your issue?

miraai commented 1 month ago

Sure thing!

These are the resources, just using this composition for testing:

resources:
  - name: cluster
    base:
      apiVersion: container.gcp.upbound.io/v1beta2
      kind: Cluster
      spec:
        providerConfigRef:
          name: gcp-provider-config
        writeConnectionSecretToRef:
          namespace: crossplane
          name: gkecluster
        forProvider:
          location: us-central1
          initialNodeCount: 1
          removeDefaultNodePool: true
          deletionProtection: false
          enableIntranodeVisibility: true
          network: idp-network
          management:
            autoRepair: true
            autoUpgrade: true
    patches:
    - fromFieldPath: spec.name
      toFieldPath: metadata.name
    - fromFieldPath: spec.name
      toFieldPath: spec.writeConnectionSecretToRef.name
      transforms:
      - type: string
        string:
          fmt: '%s-secrets'
    - fromFieldPath: spec.claimRef.namespace
      toFieldPath: spec.writeConnectionSecretToRef.namespace
    - fromFieldPath: spec.parameters.version
      toFieldPath: spec.forProvider.minMasterVersion
    - type: ToCompositeFieldPath
      fromFieldPath: metadata.name
      toFieldPath: status.clusterName
    - type: ToCompositeFieldPath
      fromFieldPath: status.conditions[0].reason
      toFieldPath: status.controlPlaneStatus
    connectionDetails:
    - fromConnectionSecretKey: kubeconfig
  - name: nodepool
    base:
      apiVersion: container.gcp.upbound.io/v1beta2
      kind: NodePool
      spec:
        providerConfigRef:
          name: gcp-provider-config
        forProvider:
          nodeLocations:
            - us-central1-a
            - us-central1-b
            - us-central1-c
          clusterSelector:
            matchControllerRef: true
          nodeConfig:
            oauthScopes:
              - https://www.googleapis.com/auth/cloud-platform
          autoscaling:
            maxNodeCount: 3
            minNodeCount: 1
          management:
            autoRepair: true
            autoUpgrade: true
    patches:
    - fromFieldPath: spec.name
      toFieldPath: metadata.name
    - fromFieldPath: spec.claimRef.namespace
      toFieldPath: spec.writeConnectionSecretToRef.namespace
    - fromFieldPath: spec.parameters.version
      toFieldPath: spec.forProvider.version
    - fromFieldPath: spec.parameters.minNodeCount
      toFieldPath: spec.forProvider.initialNodeCount
    - fromFieldPath: spec.parameters.minNodeCount
      toFieldPath: spec.forProvider.autoscaling.minNodeCount
    - fromFieldPath: spec.parameters.nodeSize
      toFieldPath: spec.forProvider.nodeConfig.machineType
      transforms:
        - type: map
          map:
            small: n2-standard-4
            medium: n2-standard-8
            large: n2-standard-16
    - type: ToCompositeFieldPath
      fromFieldPath: status.conditions[0].reason
      toFieldPath: status.nodePoolStatus

Composite resource if it matters:

kind: XCompositeCluster
metadata:
  name: np-cluster
spec:
  name: np-cluster
  compositionSelector:
    matchLabels:
      provider: google
  parameters:
    version: 1.29.4-gke.1043004
    nodeSize: small
    minNodeCount: 1

We are still in early stages of adopting crossplane for GCP, but it would be cool if i did not have to worry about this going forward. It could of course be that I am missing something small that's causing this :D

mergenci commented 1 month ago

Regarding my above comment that

beta fields in the Terraform provider are excluded from our APIs

@ulucinar rightfully asked how beta fields are represented in the Terraform schema. It really seems like there's no such thing as a beta field in the schema. The PR that introduced queuedProvisioning field has no implementation! It certainly wasn't in the resource's schema at the time it was introduced. There might be another mechanism to handle such fields, even though that sounds rather contrived. queuedProvisioning was implemented in a later PR.

mergenci commented 1 month ago

@miraai, this issue is similar to https://github.com/crossplane-contrib/provider-upjet-gcp/issues/561. We will address as I explained in my comment there. We might have a special case for this resource, because we have to handle the beta queuedProvisioning field for v1beta1. I'm not sure how we can do so, but I'm rather confident that v1beta2 would work as intended after the fix.

withnale commented 1 month ago

Do we know when there is likely to be a fix for this? Upgrading to the latest provider breaks all of our existing nodepools.

mergenci commented 1 month ago

@withnale, I'm not able to share a timeline. Conversion webhook issues, like this issue, are our top priority at the moment.

miraai commented 1 month ago

@mergenci Thank you for looking into this! I will stick with default node pool for the PoC phase we are in, so hopefully this gets resolved by the time I get out of PoC. Will we get notified here when it is done or should i follow the other issue as well?

jeanduplessis commented 1 month ago

@miraai you can keep an eye on the releases section for this provider or the #announcements channel in the Crossplane Slack workspace to get notified when the fix is released.