GoogleContainerTools / skaffold

Easy and Repeatable Kubernetes Development
https://skaffold.dev/
Apache License 2.0
14.94k stars 1.62k forks source link

Improve status check handling for GKE Autopilot clusters #6011

Open briandealwis opened 3 years ago

briandealwis commented 3 years ago

Can we improve the status check reporting when deploying to a GKE Autopilot cluster — informing the user that the cluster/node is being scaled up to accomodate the new job?

Waiting for deployments to stabilize...
 - deployment/leeroy-app: 0/3 nodes are available: 1 Insufficient memory, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1623436397}, that the pod didn't tolerate, 2 Insufficient cpu.
    - pod/leeroy-app-c469448b5-wb2db: 0/3 nodes are available: 1 Insufficient memory, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1623436397}, that the pod didn't tolerate, 2 Insufficient cpu.
 - deployment/leeroy-web: 0/3 nodes are available: 1 Insufficient memory, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1623436397}, that the pod didn't tolerate, 2 Insufficient cpu.
    - pod/leeroy-web-99d978f66-9dr2j: 0/3 nodes are available: 1 Insufficient memory, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1623436397}, that the pod didn't tolerate, 2 Insufficient cpu.
[large pause]
 - deployment/leeroy-web is ready. [1/2 deployment(s) still pending]
 - deployment/leeroy-app is ready.

If the pod is not scheduled, we could look at the events to see if there was a TriggeredScaleUp event.

tejal29 commented 3 years ago

sounds like a nice feature. Is there a way we could detect if its a autopilot cluster?

briandealwis commented 3 years ago

I've been told that looking at the pod events should show cluster autoscaling or node auto provisioning events. Both are briefly described here:

https://cloud.google.com/architecture/best-practices-for-running-cost-effective-kubernetes-applications-on-gke

ValentinFunk commented 2 years ago

Is there a way to use this with autopilot clusters yet? Deployments always fail for me (something about unscheduable), even if I see a TriggeredScaleUp

ericzzzzzzz commented 1 year ago

Please use tolerate-failures-until-deadline flag with auto-pilot cluster if this issue occurs https://github.com/GoogleContainerTools/skaffold/blob/e1014dd0052ce20db690ae606b2f99d2281cd0c4/cmd/skaffold/app/cmd/flags.go#L335-L343