GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
80 stars 63 forks source link

GKE node auto-provisioning not scaling down #288

Open Shaked opened 3 years ago

Shaked commented 3 years ago

While using kubeflow 1.0, 1.2, 1.3 I have noticed that sometimes nodes do not scale down.

AFAIU this happens because of node auto-provisioning. Nodes are scaled up and in some cases kube-system pods might start running on them, preventing them from scaling down.

One option to consider is to put a taint on the nodepool that you want to be able to scale to 0. That way system pods will not be able to run on those nodes, so they won't block scale-down. Downside is you'll need to add a toleration to all the pods that you want to run on this nodepool (this can be automated with mutating admission webhook). This is a very useful pattern if you have a nodepool with particularly expensive nodes. Alternatively you can create PDBs for all non-daemonset system pods. Note: restarting some system pods can cause various types of disruption to your cluster, which is why CA does not restart them by default (ex. restarting metrics-server will break all HPAs in your cluster for a few minutes). It's up to you to decide which disruptions you're ok with.

https://github.com/kubernetes/autoscaler/issues/2377#issuecomment-618275429

Not sure if relevant but maybe these lines require an update?

https://github.com/kubeflow/gcp-blueprints/blob/1d41c6ca7fc904d91dfcfb44e61e42435801e72c/kubeflow/common/cluster/upstream/cluster.yaml#L32-L37

Currently I'm considering to cancel the node auto-provisioning although it would be nice to have this working as expected.

Any ideas how to fix this?

Bobgy commented 3 years ago

A known problem is Istio Sidecars https://istio.io/latest/docs/ops/common-problems/injection/#cluster-is-not-scaled-down-automatically

We need to add cluster-autoscaler.kubernetes.io/safe-to-evict": "true" for all the pods with istio sidecar and are safe to evict.

Bobgy commented 3 years ago

We could add this to known services for users, welcome contributions!