GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
887 stars 217 forks source link

GKE - Error 400: Node_pool_id must be specified. #653

Open travisrandolph-bestbuy opened 2 years ago

travisrandolph-bestbuy commented 2 years ago

Bug Description

We have roughly 10 clusters showing the error below. I believe the issue stems from users manually deleting the cluster from the console. The KCC resources aren't removed and the cluster is rebuilt on the next reconciliation. I tried to abandon and delete the resources from KCC and then re-acquire them. The error still persisted after acquisition.

Additional Diagnostic Information

status:
  conditions:
  - lastTransitionTime: "2022-03-16T21:29:03Z"
    message: 'Update call failed: error applying desired state: summary: googleapi:
      Error 400: Node_pool_id must be specified., badRequest'
    reason: UpdateFailed
    status: "False"

Config Connector Version

1.76.0

Config Connector Mode

namespaced

maqiuyujoyce commented 2 years ago

Hi @travisrandolph-bestbuy , thanks for reporting the issue.

I tried to reproduce it using the following resource YAML and steps, but the cluster was recreated successfully without any issue:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  name: test-1
  namespace: test
spec:
  location: us-central1-a
  initialNodeCount: 1
  1. Installed Config Connector 1.76.0 in the namespaced mode.
  2. Created the GKE cluster using the snippet above via Config Connector.
  3. Deleted the GKE cluster via Cloud Console.
  4. Waited for the drift correction of ContainerCluster resource.

Could you provide more details about your scenario? Would be great if you can provide the yaml snippet that will result in the "node_pool_id must be specified" issue.

The KCC resources aren't removed and the cluster is rebuilt on the next reconciliation.

What's the status of the cluster (the GCP resource) after rebuild? Is there a similar error in the resource status/command output when you run gcloud container clusters describe?

I tried to abandon and delete the resources from KCC and then re-acquire them.

Did you acquire the resource using the same YAML? Have you tried to acquire the cluster using the YAML exported by the config-connector tool?

travisrandolph-bestbuy commented 2 years ago

Would be great if you can provide the yaml snippet that will result in the "node_pool_id must be specified" issue.

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
annotations:
cnrm.cloud.google.com/remove-default-node-pool: "true"
name: {{ REDACTED }}
spec:
location: us-central1
nodeLocations:
- us-central1-b
- us-central1-c
initialNodeCount: 1
releaseChannel:
channel: STABLE
notificationConfig:
pubsub:
enabled: true
topicRef:
name: {{ REDACTED }}
maintenancePolicy:
recurringWindow:
endTime: 2099-01-01T23:00:00Z
recurrence: FREQ=WEEKLY;BYDAY=MO,TU,WE,TH,FR
startTime: 2021-01-01T15:00:00Z
enableBinaryAuthorization: true
verticalPodAutoscaling:
enabled: true
networkRef:
name: {{ REDACTED }}
subnetworkRef:
name: {{ REDACTED }}
addonsConfig:
httpLoadBalancing:
disabled: false
masterAuthorizedNetworksConfig:
cidrBlocks:
- cidrBlock: {{ REDACTED }}
privateClusterConfig:
enablePrivateEndpoint: true
enablePrivateNodes: true
masterIpv4CidrBlock: {{ REDACTED }}
ipAllocationPolicy:
clusterIpv4CidrBlock: {{ REDACTED }}
clusterSecondaryRangeName: {{ REDACTED }}
servicesIpv4CidrBlock: {{ REDACTED }}
servicesSecondaryRangeName: {{ REDACTED }}
databaseEncryption:
state: ENCRYPTED
keyName: projects/{{ REDACTED }}/locations/us-central1/keyRings/{{ REDACTED }}/cryptoKeys/{{ REDACTED }}
enableShieldedNodes: true
workloadIdentityConfig:
identityNamespace: {{ REDACTED }}.svc.id.goog
nodeConfig:
imageType: COS
metadata:
disable-legacy-endpoints: "true"
block-project-ssh-keys: "true"
oauthScopes:
- https://www.googleapis.com/auth/cloud-platform
shieldedInstanceConfig:
enableSecureBoot: true
enableIntegrityMonitoring: true
tags:
- {{ REDACTED }}
---
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
name: {{ REDACTED }}
spec:
location: us-central1
nodeLocations:
- us-central1-a
- us-central1-b
- us-central1-c
initialNodeCount: 1
autoscaling:
minNodeCount: 1
maxNodeCount: 10
nodeConfig:
serviceAccountRef:
name: {{ REDACTED }}
imageType: COS_CONTAINERD
machineType: n2-standard-2
tags:
- {{ REDACTED }}
preemptible: false
oauthScopes:
- https://www.googleapis.com/auth/cloud-platform
metadata:
disable-legacy-endpoints: "true"
block-project-ssh-keys: "true"
shieldedInstanceConfig:
enableSecureBoot: true
enableIntegrityMonitoring: true
management:
autoRepair: true
autoUpgrade: true
clusterRef:
name: {{ REDACTED }}

What's the status of the cluster (the GCP resource) after rebuild? Is there a similar error in the resource status/command output when you run gcloud container clusters describe?

The status of the cluster is "RUNNING".

Did you acquire the resource using the same YAML? Have you tried to acquire the cluster using the YAML exported by the config-connector tool?

We are just using the same YAML that we originally used to deploy the cluster.

travisrandolph-bestbuy commented 2 years ago

@maqiuyujoyce after further review on our end, we agree that the issue isn't due to deleting cluster from the console. One thing we've noticed is some of our clusters nodeConfig doesn't match what we deployed them with. For instance, we deployed a cluster with imageType: "COS", but the kcc cluster resource has "COS_CONTAINERD". Both the node pools in the cluster have an imageType of "COS_CONTAINERD", but that shouldn't change the config for the kcc cluster resource right?

mbzomowski commented 2 years ago

Hi @travisrandolph-bestbuy can you tell me which version of GKE you're running on these clusters? This issue could be stemming from specifying the COS imageType in the yaml, as all Docker-based image types are not supported as of GKE version 1.24 and later.

erik-carlson commented 2 years ago

I'm working with Travis on this - none of the clusters are at 1.24, they're all at 1.20.

I think that the other key piece of information is that we have multiple node pools attached to the cluster. If we remove one of the two node pools from the cluster, then the cluster resource changes to an "updating" status.

Once that happens we're seeing the node pools constantly in an update status which I believe is because of the imageType. In this case the imagetype of the node pools is COS_CONTAINERD and the cluster's nodeConfig imageType is COS. The cluster will update the imageType to COS, but then KCC reconciles it back to COS_CONTAINERD, then the cluster updates it back again, and on and on. Of course we can't update the nodeConfig of the cluster resource because it is immutable. This may be an issue for a separate ticket.

erik-carlson commented 2 years ago

another update on this - once we removed the imageType from the nodeConfig of the cluster resource we no longer see the "node_pool_id must be specified" error, nor does the nodepool go into an endless update loop. To do this we had to abandon the resource and re-acquire with the new nodeConfig since nodeConfig on the KCC resource is immutable. This seems to stem from the fact that the nodeConfig of the cluster can get changed from the GCP side and then KCC can change it back even though it is "immutable".

This is no longer blocking us, though it still seems like an issue that KCC would be able to update a cluster config but only if there is a single nodepool attached to it.

jcanseco commented 2 years ago

Hi @erik-carlson. Thank you for keeping us up-to-date. We admittedly do not yet know the root-cause, but I am happy to hear that you are no longer blocked on this issue.

That said, I should point out that you should not actually be specifying spec.nodeConfig on ContainerCluster if you intend to remove the default node pool and create customer node pools using ContainerNodePool. We are aware that this is not sufficiently communicated in the docs though, so we'll file some internal tasks for that.

GKE has this weird behavior where they consider the "first node pool in the list of node pools" as the default node pool; so your cluster's spec.nodeConfig may end up conflicting with the config of first node pool of the list, especially whenever the first node pool is changed (e.g. when you delete node pools). I suspect this might have some role to play in this issue.

jravetch commented 1 year ago

We're having a similar issue in that we get the same error when trying to update an existing node pool from COS to COS_CONTAINERD. The cluster has multiple node pools. What is the recommended way to update the image type to containerd from COS on an existing node pool via kcc? This is node pool is created by ContainerNodePool.

Update call failed: error applying desired state: summary: googleapi: Error 400: Node_pool_id must be specified., badRequest