Open fmichaelobrien opened 1 year ago
it occurred for me even on a completely clean GKE cluster today after setup-kcc.sh before core-landing-zone package rendering and on my 6 week old cluster with all the LZ packages up full LZ older cluster "Update webhook to no longer intercept system requests."
trying standard GKE - also to reduce spend on https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/492
kubectl is timing out - kpt won't deploy as usual as expectedmichael@cloudshell:~ (kcc-oi-cluster)$ kubectl get pods --all-namespaces E0928 19:21:26.018017 1006 memcache.go:265] couldn't get current server API group list: Get "https://172.16.0.130/api?timeout=32s": dial tcp 172.16.0.130:443: i/o timeout
michael@cloudshell:~/kcc-oi/kpt (kcc-oi-cluster)$ kpt live apply core-landing-zone --reconcile-timeout=2m --output=table W0928 19:25:14.080087 1087 factory.go:66] Failed to query apiserver to check for flow control enablement: %vmaking /livez/ping request: context deadline exceeded
As a workaround I will use a standard cluster and retest - avoided standard as there was the odd timeout between 15-30 min
however my older kcc cluster on kcc.landing.systems even with webhook error is ok
root_@cloudshell:~ (kcc-kls-cluster3)$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cnrm-system cnrm-controller-manager-3fo6phebqgg23knqq5qq-0 1/1 Running 0 4d2h
cnrm-system cnrm-controller-manager-7c4rehlik7xgxc2utq6a-0 1/1 Running 0 4d1h
something off with my clean KCC env obrien.industries (will check the diffs on setup-kcc.sh) - as my older 6 week kcc.landing.systems - even with the admission errors will edit the yaml no problem
something off with my clean KCC env obrien.industries (will check the diffs on setup-kcc.sh) - as my older 6 week kcc.landing.systems - even with the admission errors will edit the yaml no problem
root_@cloudshell:~ (kcc-kls-cluster3)$ kubectl edit validatingwebhookconfiguration/gatekeeper-validating-webhook-configuration
Edit cancelled, no changes made.
root_@cloudshell:~ (kcc-kls-cluster3)$
Triaging connection older server working has 35 address in .kube/config
server: https://35.203.120.71
name: gke_kcc-kls-cluster3_northamerica-northeast1_krmapihost-kcc-kls3
newer server has private address
server: https://172.16.0.130
name: gke_kcc-oi-cluster_northamerica-northeast1_krmapihost-kcc-oi
Found above issue - forgot to add -p for public endpoint
ran./setup-kcc.sh -af kcc.env
for
https://github.com/ssc-spc-ccoe-cei/gcp-tools/commit/941d542e5024144b541136e19700b50cd8eaf895
see
export CLUSTER=kcc-oi2
export REGION=northamerica-northeast1
export PROJECT_ID=kcc-oi2-cluster
export LZ_FOLDER_NAME=kcc-lz-20230928b
export NETWORK=kcc-oi2-vpc
export SUBNET=kcc-oi2-sn
michael@cloudshell:~/kcc-oi/github/gcp-tools/scripts/bootstrap (kcc-oi)$ ./setup-kcc.sh -afp kcc.env
1644 - estimate 1700 kcc-oi2 cluster up
##INFO - Create Config controller
Create request issued for: [kcc-oi2]
Waiting for operation [projects/kcc-oi2-cluster/locations/northamerica-northeast1/operations/operation-1695933801715-606715bd057e8-f452780e-92d1cb2e] to complete...working..
fix
michael@cloudshell:~/kcc-oi/github/gcp-tools/scripts/bootstrap (kcc-oi)$ ./setup-kcc.sh -afp kcc.env
aiting for operation [projects/kcc-oi2-cluster/locations/northamerica-northeast1/operations/operation-1695933801715-606715bd057e8-f452780e-92d1cb2e] to complete...done.
Created instance [kcc-oi2].
Fetching cluster endpoint and auth data.
kubeconfig entry generated for krmapihost-kcc-oi2.
##INFO - Config controller get credentials
Fetching cluster endpoint and auth data.
kubeconfig entry generated for krmapihost-kcc-oi2.
##WARNING - configure-kcc-access.sh script should be run once connectivity to the cluster is established using bastion host / proxy.
ichael@cloudshell:~/kcc-oi/github/gcp-tools/scripts/bootstrap (kcc-oi2-cluster)$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gk3-krmapihost-kcc-oi2-default-pool-6fc83c0e-ss20 Ready <none> 9m12s v1.27.3-gke.100
gk3-krmapihost-kcc-oi2-pool-1-28f0e374-tzw8 Ready <none> 3m43s v1.27.3-gke.100
gk3-krmapihost-kcc-oi2-pool-1-ae2f0850-4kmt Ready <none> 7m32s v1.27.3-gke.100
gk3-krmapihost-kcc-oi2-pool-1-c9c2a582-9sdc Ready <none> 2m47s v1.27.3-gke.100
cluster up with no admissions endpoint (has both public and private endpoints)
Triage
https://cloud.google.com/kubernetes-engine/docs/troubleshooting/troubleshoot-upgrades
details
See no issues with the cluster in #445 New issues since 20230927 in #534
KCC cluster up via https://github.com/ssc-spc-ccoe-cei/gcp-tools/blob/main/scripts/bootstrap/setup-kcc.sh
TODO: check webhook visibility fix This is new since my last cluster was up 6 weeks ago in #445
This cluster has an admission webhook installed that is intercepting system critical requests in the last 24 hours. Intercepting these requests can impact availability of the GKE Control Plane. Learn more
https://cloud.google.com/kubernetes-engine/docs/how-to/optimize-webhooks?&_ga=2.215054544.-699491976.1695837480#unsafe-webhooks
gatekeeper-validating-webhook-configuration Intercepting cluster-scoped system resources
gatekeeper-validating-webhook-configuration Intercepting resources in the kube-node-lease namespace gatekeeper-validating-webhook-configuration Intercepting resources in the kube-system namespace
https://cloud.google.com/kubernetes-engine/docs/how-to/optimize-webhooks?&_ga=2.254246863.-699491976.1695837480#no-available-endpoints
workloads up