Open gecube opened 1 month ago
again terminating:
$ kubectl get pods -n envoy-gateway-system
NAME READY STATUS RESTARTS AGE
backend-69fcff487f-zdj8s 1/1 Running 0 9m12s
envoy-envoy-gateway-system-eg-5391c79d-565cc65f6b-h4g2s 2/2 Running 0 22s
envoy-envoy-gateway-system-eg-5391c79d-79fdb5d7f-wzlxh 2/2 Terminating 0 9m12s
envoy-gateway-7ff7cffb6c-pwsbm 1/1 Running 0 5m1s
$ kubectl describe pod -n envoy-gateway-system envoy-envoy-gateway-system-eg-5391c79d-79fdb5d7f-wzlxh
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned envoy-gateway-system/envoy-envoy-gateway-system-eg-5391c79d-79fdb5d7f-wzlxh to gke-dev2-pool-1-d6247585-5pi9
Normal Pulling 10m kubelet Pulling image "envoyproxy/envoy:distroless-dev"
Normal Pulled 7m31s kubelet Successfully pulled image "envoyproxy/envoy:distroless-dev" in 2.099736442s (2m34.66287988s including waiting)
Normal Created 7m31s kubelet Created container envoy
Normal Started 7m31s kubelet Started container envoy
Normal Pulling 7m31s kubelet Pulling image "docker.io/envoyproxy/gateway-dev:latest"
Normal Pulled 7m1s kubelet Successfully pulled image "docker.io/envoyproxy/gateway-dev:latest" in 248.719122ms (30.285645955s including waiting)
Normal Created 7m1s kubelet Created container shutdown-manager
Normal Started 7m kubelet Started container shutdown-manager
Normal Killing 75s kubelet Stopping container envoy
Normal Killing 75s kubelet Stopping container shutdown-manager
Warning Unhealthy 6s (x7 over 66s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503```
may be the same as #3220
target cluster is GKE
$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.13-gke.1000000
WARNING: version difference between client (1.30) and server (1.27) exceeds the supported minor version skew of +/-1
the envoy is taken from the chart and the gateway api manifests are taken from like here https://gateway-api.sigs.k8s.io/guides/#install-experimental-channel
is GKE cluster with autopilot enabled ? is GKE also trying to simultaneously install gateway-api manifests ? https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways
@arkodg Good day! No, GKE without autopilot. No, GKE is not trying install any additional controllers. It is very interesting, that now it is working stable. So it was transient issue, but it was very... khm... unconvenient and uncomfortable to experience it.
thanks for the update @gecube
sounds like we need to revisit the livenessProbe
and readinessProbe
defaults
https://github.com/envoyproxy/gateway/blob/01dd7b11e7768623e801d82d80dc170a8e0572e3/internal/infrastructure/kubernetes/proxy/testdata/deployments/default.yaml#L255
periodSeconds
, and failureThreshold
) help ?cc @owenhaynes
Yeah this looks the similar to the issue #3220 i never saw errors in the gateway pod like shown here. From my issue i think its related to the image pulling the enovy image taking longer to pull and start compared to the shutdown pod.
I did increase the all the probes failureThreshold of the shutdown container using a patch to fix this issue.
thanks @owenhaynes, wanna share your patch
here ? if many hit this, maybe we should just commit that in / revisit the defaults
@owenhaynes Hi! I don't think it is related to image pulling, as images were already present on the nodes (?). @arkodg It is relatively difficult to reproduce. I will need to make a brand new environment and check again.
@gecube yeah in my instance it was when nodes have just been started from a fresh node instance so no images are prepulled
shutdown manager patch @arkodg
provider:
type: "Kubernetes"
kubernetes:
envoyDeployment:
patch:
type: StrategicMerge
value:
spec:
template:
spec:
containers:
- name: shutdown-manager
livenessProbe:
failureThreshold: 10
readinessProbe:
failureThreshold: 10
moving this to v1.1, lets relax the probe values a bit more, as I'm hearing many more users complain about GKE installs being flaky
Good day!
I was playing around quickstart. I definitely did not do something weird. But suddenly I got the next:
The pods were in the next state:
It helped to manually remove gateway and envoy-gateway.
Expected behaviour:
better error handling, and restart of the pod in case of issues.