GoogleCloudPlatform / pubsec-declarative-toolkit

The GCP PubSec Declarative Toolkit is a collection of declarative solutions to help you on your Journey to Google Cloud. Solutions are designed using Config Connector and deployed using Config Controller.
Apache License 2.0
30 stars 26 forks source link

Autopilot cluster is auto deleting again after 30 min timeout on failure to transition from "creating" state #857

Open fmichaelobrien opened 4 months ago

fmichaelobrien commented 4 months ago

see either manual or scripted GKE cluster creation https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/blob/gh766-script/solutions/setup.sh#L198C1-L198C174

Screenshot 2024-02-27 at 11 55 09 AM

gcloud anthos config controller create "$CLUSTER" --location "$REGION" --network "$NETWORK" --subnet "$SUBNET" --master-ipv4-cidr-block="172.16.0.128/28" --full-management

remove --full-management

as in https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/blob/main/docs/advanced-install.md#gke-autopilot---recommended Getting older jira https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/464

Client still has issues with standard GKE cluster

Reproducing on one of my orgs...

fmichaelobrien commented 4 months ago

Testing US clusters now from

michael@cloudshell:~$ gcloud config set project kcc-timeout-cso
Updated property [core/project].
michael@cloudshell:~ (kcc-timeout-cso)$ 
export CLUSTER=kcc
export REGION=us-central-1
export NETWORK=kcc-vpc
export SUBNET=kcc-sn
export CIDR_KCC_VPC=192.168.0.0/16
  gcloud services enable krmapihosting.googleapis.com 
  gcloud services enable container.googleapis.com
  gcloud services enable cloudresourcemanager.googleapis.com 
  gcloud services enable accesscontextmanager.googleapis.com 
  gcloud services enable cloudbilling.googleapis.com
  gcloud services enable serviceusage.googleapis.com 
  gcloud services enable servicedirectory.googleapis.com 
  gcloud services enable dns.googleapis.com
  gcloud services enable anthos.googleapis.com
# na only
 gcloud compute networks create "$NETWORK" --subnet-mode=custom
# all of google
 gcloud compute networks create "default" 

 fix
  - Location REGION:us-east4 violates constraint constraints/gcp.resourceLocations on the resource projects/kcc-timeout-cso/regions/us-east4/subnetworks/kcc-us-east4-sn.
  https://console.cloud.google.com/iam-admin/orgpolicies/gcp-resourceLocations?orgonly=true&project=kcc-timeout-cso&supportedpurview=organizationId
Screenshot 2024-02-27 at 12 36 47 PM Screenshot 2024-02-27 at 12 38 34 PM Screenshot 2024-02-27 at 12 39 06 PM
 gcloud compute networks subnets create "kcc-us-east4-sn" --network "$NETWORK" --range "$CIDR_KCC_VPC" --region "us-east4" --stack-type=IPV4_ONLY

gcloud anthos config controller create "$CLUSTER" --location "$REGION" --network "$NETWORK" --subnet "$SUBNET" --master-ipv4-cidr-block="172.16.0.128/28" --full-management
Screenshot 2024-02-27 at 12 39 42 PM

at the org level - but kcc will revert it

Screenshot 2024-02-27 at 12 40 19 PM

takes 5 min to propagate
michael@cloudshell:~ (kcc-timeout-cso)$ gcloud compute networks subnets create "kcc-us-east4-sn" --network "$NETWORK" --range "$CIDR_KCC_VPC" --region "us-east4" --stack-type=IPV4_ONLY
Created [https://www.googleapis.com/compute/v1/projects/kcc-timeout-cso/regions/us-east4/subnetworks/kcc-us-east4-sn].
NAME: kcc-us-east4-sn
REGION: us-east4
NETWORK: kcc-vpc
RANGE: 192.168.0.0/16
STACK_TYPE: IPV4_ONLY
IPV6_ACCESS_TYPE: 
INTERNAL_IPV6_PREFIX: 
EXTERNAL_IPV6_PREFIX: 

michael@cloudshell:~ (kcc-timeout-cso)$ gcloud anthos config controller create "$CLUSTER" --location "us-east4" --network "kcc-vpc" --subnet "kcc-us-east4-sn" --master-ipv4-cidr-block="172.16.0.128/28" --full-management
Create request issued for: [kcc]
Waiting for operation [projects/kcc-timeout-cso/locations/us-east4/operations/operation-1709056068060-612609fd68669-b51f242c-8b2a2484] to complete...working.. 
1245
fmichaelobrien commented 4 months ago

forgot to override the peering constraint

Screenshot 2024-02-27 at 1 02 06 PM

https://console.cloud.google.com/iam-admin/orgpolicies/compute-restrictVpcPeering?organizationId=734065690346&orgonly=true&supportedpurview=organizationId

Screenshot 2024-02-27 at 1 02 44 PM
ERROR: (gcloud.anthos.config.controller.create) unexpected error occurred while waiting for SLM operation [projects/krmapihosting-slm/locations/us-east4/operations/operation-1709056076620-61260a05921ae-b6751cf1-af711ba3]: errored while waiting for operation: projects/krmapihosting-slm/locations/us-east4/operations/operation-1709056076620-61260a05921ae-b6751cf1-af711ba3: Operation failed with error: 
generic::invalid_argument: terraform apply failed, error: exit status 1, stderr: 

Error: Error waiting for creating GKE cluster: Constraint constraints/compute.restrictVpcPeering violated for project 59969913664. Peering the network projects/gke-prod-us-east4-0839/global/networks/gke-ncca9b3ff9ac9f6c6986-8201-7368-net is not allowed.

  on main_autopilot.tf line 32, in resource "google_container_cluster" "acp_cluster":
  32: resource "google_container_cluster" "acp_cluster" {

, stdout: 
google_container_cluster.acp_cluster: Creating...
google_container_cluster.acp_cluster: Still creating... [10s elapsed]
google_container_cluster.acp_cluster: Still creating... [20s elapsed]
google_container_cluster.acp_cluster: Still creating... [30s elapsed]
google_container_cluster.acp_cluster: Still creating... [40s elapsed]
google_container_cluster.acp_cluster: Still creating... [50s elapsed]
google_container_cluster.acp_cluster: Still creating... [1m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [1m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [1m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [1m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [1m40s elapsed]
google_container_cluster.acp_cluster: Still creating... [1m50s elapsed]
google_container_cluster.acp_cluster: Still creating... [2m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [2m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [2m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [2m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [2m40s elapsed]
google_container_cluster.acp_cluster: Still creating... [2m50s elapsed]
google_container_cluster.acp_cluster: Still creating... [3m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [3m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [3m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [3m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [3m40s elapsed]
google_container_cluster.acp_cluster: Still creating... [3m50s elapsed]
google_container_cluster.acp_cluster: Still creating... [4m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [4m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [4m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [4m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [4m40s elapsed]
google_container_cluster.acp_cluster: Still creating... [4m50s elapsed]
google_container_cluster.acp_cluster: Still creating... [5m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [5m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [5m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [5m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [5m40s elapsed]
google_container_cluster.acp_cluster: Still creating... [5m50s elapsed]
google_container_cluster.acp_cluster: Still creating... [6m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [6m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [6m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [6m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [6m40s elapsed]
google_container_cluster.acp_cluster: Still creating... [6m50s elapsed]
google_container_cluster.acp_cluster: Still creating... [7m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [7m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [7m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [7m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [7m40s elapsed]
google_container_cluster.acp_cluster: Still creating... [7m50s elapsed]
google_container_cluster.acp_cluster: Still creating... [8m0s elapsed]
google_container_cluster.acp_cluster: Still creating... [8m10s elapsed]
google_container_cluster.acp_cluster: Still creating... [8m20s elapsed]
google_container_cluster.acp_cluster: Still creating... [8m30s elapsed]
google_container_cluster.acp_cluster: Still creating... [8m40s elapsed]

Subsequent cleanup succeeded
fmichaelobrien commented 4 months ago

1303

michael@cloudshell:~ (kcc-timeout-cso)$ gcloud anthos config controller create "$CLUSTER" --location "us-east4" --network "kcc-vpc" --subnet "kcc-us-east4-sn" --master-ipv4-cidr-block="172.16.0.128/28" --full-management
Create request issued for: [kcc]
Waiting for operation [projects/kcc-timeout-cso/locations/us-east4/operations/operation-1709056987983-61260d6ab703e-3f497a08-b07ceb66] to complete...working.. 
Screenshot 2024-02-27 at 1 03 48 PM

33%

Screenshot 2024-02-27 at 1 05 41 PM

1306 55%

Screenshot 2024-02-27 at 1 06 00 PM

need to get to 83%

Screenshot 2024-02-27 at 1 06 27 PM

at 87% - 15 workloads will populate 1312 at 83%

Screenshot 2024-02-27 at 1 11 01 PM

1313 up - 10 min duration

Screenshot 2024-02-27 at 1 12 14 PM Screenshot 2024-02-27 at 1 12 41 PM
PodUnschedulable
Reason
Cannot schedule pods: node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}.
[Learn more ](https://cloud.google.com/kubernetes-engine/docs/troubleshooting#PodUnschedulable)
Source
[bootstrap-6dbc584955-9j2v7](https://console.cloud.google.com/kubernetes/pod/us-east4/krmapihost-kcc/krmapihosting-system/bootstrap-6dbc584955-9j2v7?project=kcc-timeout-cso&supportedpurview=project)
Screenshot 2024-02-27 at 1 13 00 PM
fmichaelobrien commented 4 months ago

script will auto delete shortly

michael@cloudshell:~ (kcc-timeout-cso)$ gcloud anthos config controller create "$CLUSTER" --location "us-east4" --network "kcc-vpc" --subnet "kcc-us-east4-sn" --master-ipv4-cidr-block="172.16.0.128/28" --full-management
Create request issued for: [kcc]
Waiting for operation [projects/kcc-timeout-cso/locations/us-east4/operations/operation-1709056987983-61260d6ab703e-3f497a08-b07ceb66] to complete...working...                                           
Waiting for operation [projects/kcc-timeout-cso/locations/us-east4/operations/operation-1709056987983-61260d6ab703e-3f497a08-b07ceb66] to complete...working..                                            
Waiting for operation [projects/kcc-timeout-cso/locations/us-east4/operations/operation-1709056987983-61260d6ab703e-3f497a08-b07ceb66] to complete...working                                              
Waiting for operation [projects/kcc-timeout-cso/locations/us-east4/operations/operation-1709056987983-61260d6ab703e-3f497a08-b07ceb66] to complete...working   

1315

Screenshot 2024-02-27 at 1 14 44 PM

getting better on workloads - was red heering on pos scheduling

Screenshot 2024-02-27 at 1 15 13 PM Screenshot 2024-02-27 at 1 16 00 PM

6 G at 1316

Screenshot 2024-02-27 at 1 16 38 PM Screenshot 2024-02-27 at 1 17 01 PM

1317

Screenshot 2024-02-27 at 1 17 25 PM

1319 : 8 of 15

Screenshot 2024-02-27 at 1 19 02 PM

1320: 11 of 15

Screenshot 2024-02-27 at 1 19 53 PM

1321:

Screenshot 2024-02-27 at 1 20 33 PM

1322: 12 of 15

Screenshot 2024-02-27 at 1 21 31 PM
Created instance [kcc].
Fetching cluster endpoint and auth data.
kubeconfig entry generated for krmapihost-kcc.
michael@cloudshell:~ (kcc-timeout-cso)$ 

1322 14 of 15

Screenshot 2024-02-27 at 1 22 24 PM
bootstrap OK Deployment 1/1 krmapihosting-system krmapihost-kcc  
  cnrm-controller-manager-fbgrg35rhrhz7f5czo3a OK Stateful Set 1/1 cnrm-system krmapihost-kcc  
  cnrm-deletiondefender OK Stateful Set 1/1 cnrm-system krmapihost-kcc  
  cnrm-resource-stats-recorder OK Deployment 1/1 cnrm-system krmapihost-kcc  
  cnrm-unmanaged-detector OK Stateful Set 1/1 cnrm-system krmapihost-kcc  
  cnrm-webhook-manager Does not have minimum availability Deployment 2/2 cnrm-system krmapihost-kcc  
  config-management-operator OK Deployment 1/1 config-management-system krmapihost-kcc  
  configconnector-operator OK Stateful Set 1/1 configconnector-operator-system krmapihost-kcc  
  configsync-healthcheck-service OK Deployment 1/1 configsync-healthcheck-system krmapihost-kcc  
  gatekeeper-audit OK Deployment 1/1 gatekeeper-system krmapihost-kcc  
  gatekeeper-controller-manager OK Deployment 1/1 gatekeeper-system krmapihost-kcc  
  krmapihosting-metrics-agent OK Daemon Set 3/3 krmapihosting-monitoring krmapihost-kcc  
  otel-collector OK Deployment 1/1 config-management-monitoring krmapihost-kcc  
  reconciler-manager OK Deployment 1/1 config-management-system krmapihost-kcc  
  resource-group-controller-manager OK Deployment 1/1 resource-group-system krmapihost-kcc

bootstrap OK Deployment 1/1 krmapihosting-system krmapihost-kcc cnrm-controller-manager-fbgrg35rhrhz7f5czo3a OK Stateful Set 1/1 cnrm-system krmapihost-kcc cnrm-deletiondefender OK Stateful Set 1/1 cnrm-system krmapihost-kcc cnrm-resource-stats-recorder OK Deployment 1/1 cnrm-system krmapihost-kcc cnrm-unmanaged-detector OK Stateful Set 1/1 cnrm-system krmapihost-kcc cnrm-webhook-manager Does not have minimum availability Deployment 2/2 cnrm-system krmapihost-kcc config-management-operator OK Deployment 1/1 config-management-system krmapihost-kcc configconnector-operator OK Stateful Set 1/1 configconnector-operator-system krmapihost-kcc configsync-healthcheck-service OK Deployment 1/1 configsync-healthcheck-system krmapihost-kcc gatekeeper-audit OK Deployment 1/1 gatekeeper-system krmapihost-kcc gatekeeper-controller-manager OK Deployment 1/1 gatekeeper-system krmapihost-kcc krmapihosting-metrics-agent OK Daemon Set 3/3 krmapihosting-monitoring krmapihost-kcc otel-collector OK Deployment 1/1 config-management-monitoring krmapihost-kcc reconciler-manager OK Deployment 1/1 config-management-system krmapihost-kcc resource-group-controller-manager OK Deployment 1/1 resource-group-system krmapihost-kcc

1324: 15 of 15

Screenshot 2024-02-27 at 1 23 39 PM

total time 21 min

Screenshot 2024-02-27 at 1 24 43 PM
michael@cloudshell:~ (kcc-timeout-cso)$ gcloud anthos config controller list
NAME: kcc
LOCATION: us-east4
STATE: RUNNING