GoogleCloudPlatform / cloud-foundation-toolkit

The Cloud Foundation toolkit provides GCP best practices as code.
Apache License 2.0
969 stars 459 forks source link

Unable to provision a GKE cluster when using releaseChannel with Deployment Manager #546

Closed patrickmslatteryvt closed 4 years ago

patrickmslatteryvt commented 4 years ago

So I cloned the latest changes the the GKE template that added the releaseChannel setting from: https://github.com/GoogleCloudPlatform/cloud-foundation-toolkit/pull/539

But it does not seem to work at all.

To troubleshoot it I created a very minimal DM config based on: https://raw.githubusercontent.com/GoogleCloudPlatform/cloud-foundation-toolkit/master/dm/templates/gke/examples/gke_regional_private.yaml but using our lab projects network settings:

# Test of the GKE (Google Kubernetes Engine) template.

imports:
  - path: templates/gke/gke.py
    name: gke.py

resources:
  - name: regional-test-cluster
    type: gke.py
    properties:
      region: us-east4
      cluster:
        name: regional-test-cluster
#        releaseChannel:
#          channel: REGULAR
        network: mi9-com-lab-002-net
        subnetwork: kube-10-230-51-0m26
        loggingService: "logging.googleapis.com/kubernetes"
        monitoringService: "monitoring.googleapis.com/kubernetes"
        privateClusterConfig:
          enablePrivateNodes: true
          masterIpv4CidrBlock: 10.230.255.0/28
        ipAllocationPolicy:
          useIpAliases: true
          createSubnetwork: false
          clusterSecondaryRangeName: k8s-pods
          servicesSecondaryRangeName: k8s-services
        nodePools:
          - name: default
            initialNodeCount: 1
            config:
              localSsdCount: 0
              oauthScopes:
                - https://www.googleapis.com/auth/compute
                - https://www.googleapis.com/auth/devstorage.read_only
                - https://www.googleapis.com/auth/logging.write
                - https://www.googleapis.com/auth/monitoring
        locations:
          - us-east4-a
          - us-east4-b
          - us-east4-c

If I deploy it as above, it creates a cluster, that cluster creation eventually times out due a network issue but it does actually create the 3 cluster nodes. (That network issue is unrelated and I'm not going to chase it down for this test, the core thing is the cluster is actually created)

But if I enable the commented out releaseChannel lines above I get this error immediately when I try to deploy:

gcloud deployment-manager \
  deployments create \
  --preview \
  --automatic-rollback-on-error \
  "${CLUSTER_NAME}-gke-deploy" \
  --config="./deployments/${CONFIG_FILE}"

gcloud deployment-manager \
  deployments \
  update "${CLUSTER_NAME}-gke-deploy"

The fingerprint of the deployment is Uyktd1Hwyh5z8Df5uGLxeA==
Waiting for update [operation-1580176511650-59d2983a8ca59-4fe769ce-dec0d160]...failed.                                                                                                                                        
ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1580176511650-59d2983a8ca59-4fe769ce-dec0d160]: errors:
- code: RESOURCE_ERROR
  location: /deployments/mi9-la2-ge4-007-gke-deploy/resources/mi9-la2-ge4-007
  message: '{"ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Master
    version 1.15.7-gke.23 must be set to the default REGULAR releaseChannel version
    1.14.8-gke.33.","status":"INVALID_ARGUMENT","statusMessage":"Bad Request","requestPath":"https://container.googleapis.com/v1beta1/projects/mi9-com-lab-002-7d72-10570/locations/us-east4/clusters","httpMethod":"POST"}}'

I note that there is a similar issue logged in the TerraForm GKE module at: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/383

Any ideas?

Note: This used to work up until mid December 2019 or so, I had written my own version of the patch for using releaseChannel with Deployment Manager some time back and it worked perfectly. Then one day it just suddenly stopped working... I can only surmise a change in the API that isn't obvious to me.

patrickmslatteryvt commented 4 years ago

Google Support suggested this fix to me and it did indeed work: You need to add the current supported cluster version of the release channel in the initialClusterVersion field. You can get this value from: https://cloud.google.com/kubernetes-engine/docs/release-notes-regular or from the error message that Deployment Manager throws.

+ initialClusterVersion: "1.14.8-gke.33"
  releaseChannel:
    channel: REGULAR

It's an ugly hack but it works...

patrickmslatteryvt commented 4 years ago

One more gotcha, non-default nodespools need the initial cluster version specified also:

           - name: nodepool2
+            version: "GKE_VERSION"
             initialNodeCount: MIN_NODE_COUNT

Replace GKE_VERSION with the same version as the initialClusterVersion