GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
898 stars 231 forks source link

Node pool left with conditions error when created without quota #321

Open Scorpiion opened 3 years ago

Scorpiion commented 3 years ago

Describe the bug I believe this is a bug, we created new node pools and did not have enough CPU quota, so we just increased the quota and waiting for config connector to retry. It did retry and the node pool came up and running as it should. However, in the GKE UI it says status "ERROR" and looking at the CLI describe command I still see this error message long after I have increased the quota and the node pool has gotten created (days):

conditions:
- code: GCE_QUOTA_EXCEEDED
  message: "Instance 'gke-PROJECT-ID--shared-gvisor-no-8f60c54d-6q63' creation\
    \ failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1."
- message: "Insufficient quota to satisfy the request: Not all instances running in\
    \ IGM after 22.733968139s. Expected 1, running 0, transitioning 1. Current errors:\
    \ [GCE_QUOTA_EXCEEDED]: Instance 'gke-PROJECT-ID--shared-gvisor-no-8f60c54d-6q63'\
    \ creation failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1."
- code: GCE_QUOTA_EXCEEDED
  message: "Instance 'gke-PROJECT-ID--shared-gvisor-no-d6494a35-hllh' creation\
    \ failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1."
- message: "Insufficient quota to satisfy the request: Not all instances running in\
    \ IGM after 20.972921476s. Expected 1, running 0, transitioning 1. Current errors:\
    \ [GCE_QUOTA_EXCEEDED]: Instance 'gke-PROJECT-ID--shared-gvisor-no-d6494a35-hllh'\
    \ creation failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1."
- code: GCE_QUOTA_EXCEEDED
  message: "Instance 'gke-PROJECT-ID--shared-gvisor-no-76aced21-lmvz' creation\
    \ failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1."
- message: "Insufficient quota to satisfy the request: Not all instances running in\
    \ IGM after 20.806883191s. Expected 1, running 0, transitioning 1. Current errors:\
    \ [GCE_QUOTA_EXCEEDED]: Instance 'gke-PROJECT-ID--shared-gvisor-no-76aced21-lmvz'\
    \ creation failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1."

...

status: ERROR
statusMessage: "europe-north1-c: Insufficient quota to satisfy the request: Not all\
  \ instances running in IGM after 22.733968139s. Expected 1, running 0, transitioning\
  \ 1. Current errors: [GCE_QUOTA_EXCEEDED]: Instance 'gke-PROJECT-ID--shared-gvisor-no-8f60c54d-6q63'\
  \ creation failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1.;\
  \ europe-north1-a: Insufficient quota to satisfy the request: Not all instances\
  \ running in IGM after 20.972921476s. Expected 1, running 0, transitioning 1. Current\
  \ errors: [GCE_QUOTA_EXCEEDED]: Instance 'gke-PROJECT-ID--shared-gvisor-no-d6494a35-hllh'\
  \ creation failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1.;\
  \ europe-north1-b: Insufficient quota to satisfy the request: Not all instances\
  \ running in IGM after 20.806883191s. Expected 1, running 0, transitioning 1. Current\
  \ errors: [GCE_QUOTA_EXCEEDED]: Instance 'gke-PROJECT-ID--shared-gvisor-no-76aced21-lmvz'\
  \ creation failed: Quota 'C2_CPUS' exceeded.  Limit: 0.0 in region europe-north1."

Workaround When I after this created a new node pool with the same exact settings (but quota was available from the start), then I don't get any errors. So my workaround was to delete the node pool and recreate it and then it works and the error is gone.

ConfigConnector Version Run the following command to get the current ConfigConnector version

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}' 
1.27.1

To Reproduce

  1. Set quota to 0 for a CPU type (or select one that is already 0, not sure if the same error happens if quota is too little but exists, the reason we had this was that europe-north1 did not used to have C2 machines)
  2. Wait for quota to really propagate (to be safe, wait 30 minutes, I have found that sometimes 15 minutes is not enough)
  3. Create a new node pool (should not matter what cluster it's for I think)
  4. Wait until you see error from (should be seconds or a minute)
    kubectl describe containernodepools.container.cnrm.cloud.google.com -n xxxxxxxxxxxxxx node-pool-x
  5. Increase quota to a high enough value
  6. Wait X minutes until node pool is up
  7. Check that the node pools works, deploy some and check the "Node pool details" view in the GKE UI
  8. Note the status ERROR in the GKE UI
  9. Note a similar error message to mine above from
    gcloud container node-pools describe  --project=xxxxxx --region=xxxxxxxxx --cluster=xxxxxxxxxxxxxx node-pool-x
maqiuyujoyce commented 3 years ago

Hi @Scorpiion , thank you for reporting the issue. Could you confirm if the behavior is consistent? I tested the scenario for both "CPU" quota and "C2_CPU" quota on version 1.27.1, and I couldn't reproduce it.

Here are the steps I followed:

  1. Used out the current CPU/C2_CPU quota in region europe-north1.
  2. Created a new node pool (E2/C2 machineType) in zone europe-north1-a.
  3. Wait until I saw the following error (example is for C2):
    Update call failed: error applying desired state: summary: error creating NodePool: googleapi: Error 403: Insufficient regional quota to satisfy request: resource "C2_CPUS": request requires '4.0' and is short '4.0'. project has a quota of '8.0' with '0.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=<my-project-id>., forbidden, detail: 
  4. Requested quota increase and waited for approval.
  5. Waited until the Config Connector resources showed UpToDate in status.
  6. Deployed a pod to a specific node in the node pool I just created and verified the pod is running.
  7. Checked the node pool's detailed information on Cloud Console and via gcloud. No error found.