GoogleCloudPlatform / cluster-toolkit

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
Apache License 2.0
213 stars 147 forks source link

GKE node pool upgrade settings are not configurable #3343

Open chajath opened 8 hours ago

chajath commented 8 hours ago

Describe the bug

In toolkit, GKE node pool upgrade settings are hardcoded: https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/26fafe0d5a53138b807e490d4d6597544ad658d1/modules/compute/gke-node-pool/main.tf#L62-L66

This stops us from efficiently upgrading nodes in-place. Without change, each node upgrade can take up 9+ minutes, which makes maintaining big node pools unrealistic.

Steps to reproduce

Steps to reproduce the behavior:

  1. Trigger in-place node pool upgrade

Expected behavior

You have an option to make sure multiple nodes are made unavailable time to minimize the downtime.

Actual behavior

You don't have any option to upgrade more than one node at a time.

ankitkinra commented 6 hours ago

Quick question, you are okay with SURGE strategy , but want to configure more than one node unavailable at a time ?

chajath commented 6 hours ago

yes. In our particular case, being able to set higher value for max_unavailable while keeping everything else as is, would be enough.