Recommended way to change agents_size without downtime?

Israphel commented 5 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Description

We deploy our clusters with a default node_pool, using:

agents_pool_name            = "default"
agents_pool_max_surge       = try(each.value.max_surge, "10%")
agents_availability_zones   = ["1", "2", "3"]
agents_type                 = "VirtualMachineScaleSets"
agents_size                 = try(each.value.agents_size, "Standard_D2s_v3")
temporary_name_for_rotation = "tmp"

We're replacing agents_size with the ARM equivalent, and we can see the "tmp" node_pool being created, but then all the default nodes are drained at once, without respecting PDB, essentially taking down every service

1s          Normal   Drain             node/aks-default-15731243-vmss000009      Draining node: aks-default-15731243-vmss000009
2s          Normal   Drain             node/aks-default-15731243-vmss00000x      Draining node: aks-default-15731243-vmss00000x
2s          Normal   Drain             node/aks-default-15731243-vmss00000e      Draining node: aks-default-15731243-vmss00000e

Are we doing it the wrong way? how can we change the agents size without such a drastic draining?

New or Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

zioproto commented 5 months ago

@Israphel could you please confirm which version of the module you are using ?

zioproto commented 5 months ago

@Israphel I understand you are trying to change the "agents_size" of the system node pool. If you look at the provider documentation this is changing the default_node_pool block of the azurerm_kubernetes_cluster resource.

Please check this documentation:

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster

Screenshot 2024-06-07 at 08 30 20

The behaviour you see is expected, and I dont think this is something we can workaround in the module.

I found this related provider issue:

https://github.com/hashicorp/terraform-provider-azurerm/issues/22849

Feel free to open upstream at https://github.com/hashicorp/terraform-provider-azurerm/issues a new issue if you would like this behaviour to change.

I will keep this issue open in case you have additional questions.

Thanks

Israphel commented 5 months ago

I use 8.0.0

the only way we found was creating a new node_pool, drain all the defaults, change agents_size and then drain the temporary node_pool once more. Is this what everyone is doing to prevent downtime?

The problem we see is that when you upgrade kubernetes, this doesn't happens, everything goes smoothly and the PDBs are respected. But changing the instance type just drains all at once, too aggresive.

Azure / terraform-azurerm-aks