Closed Israphel closed 1 month ago
This is triggered in provider v3.106.0 because of this PR: https://github.com/hashicorp/terraform-provider-azurerm/pull/26137
tagging @ms-henglu and @stephybun
Is it correct that changing drain_timeout_in_minutes
forces a replacement of the cluster ?
- upgrade_settings {
- drain_timeout_in_minutes = 30 -> null # forces replacement
- max_surge = "10%" -> null
- node_soak_duration_in_minutes = 10 -> null
}
I see in the docs that for nodepools --drain-timeout
can be used both for add
and update
commands. Should be the same for the default nodepool.
Is it only the "unsetting" to null
that forces the resource replacement ? Any other value would be accepted ?
@Israphel the module does not support this feature yet, as tracked in https://github.com/Azure/terraform-azurerm-aks/issues/530 You should have not changed settings in AKS outside of Terraform, because this caused the Terraform state drift you are facing now.
I am not sure if you can revert the change with CLI so that the ARM API returns again drain_timeout_in_minutes
= null
to solve the Terraform state drift issue.
I suggest as temporary workaround to pin the Terraform provider version to v3.105.0 until this module supports the drain_timeout_in_minutes
option.
Thanks, went back to 105.
we're changing the default node group instance type and the rotation behaviour is extremely aggresive. Any way to make it better without the support of soak time, as of today?
@Israphel this PR is now merged: https://github.com/Azure/terraform-azurerm-aks/pull/564
Is it possible for you to pin the module at commit 5858b260a1d6a9d2ee3687a08690e8932ca86af1 ?
for example:
module "aks" {
source = git::https://github.com/Azure/terraform-azurerm-aks.git?ref=5858b260a1d6a9d2ee3687a08690e8932ca86af1
[..CUT..]
and then set your configuration for drain_timeout_in_minutes
.
This should unblock you until there is a new release that includes the feature.
Please let us know if this works for you. Thanks
hello. I actually got unblocked by going back to 105, then applying/refreshing, and then I could continue upgrading normally, the state drift was fixed.
I've tried to reproduce this issue by using the following config @Israphel @zioproto but I can't reproduce the same issue:
resource "random_id" "prefix" {
byte_length = 8
}
resource "random_id" "name" {
byte_length = 8
}
resource "azurerm_resource_group" "main" {
count = var.create_resource_group ? 1 : 0
location = var.location
name = coalesce(var.resource_group_name, "${random_id.prefix.hex}-rg")
}
locals {
resource_group = {
name = var.create_resource_group ? azurerm_resource_group.main[0].name : var.resource_group_name
location = var.location
}
}
resource "azurerm_virtual_network" "test" {
address_space = ["10.52.0.0/16"]
location = local.resource_group.location
name = "${random_id.prefix.hex}-vn"
resource_group_name = local.resource_group.name
}
resource "azurerm_subnet" "test" {
address_prefixes = ["10.52.0.0/24"]
name = "${random_id.prefix.hex}-sn"
resource_group_name = local.resource_group.name
virtual_network_name = azurerm_virtual_network.test.name
enforce_private_link_endpoint_network_policies = true
}
resource "azurerm_subnet" "pod" {
address_prefixes = ["10.52.1.0/24"]
name = "${random_id.prefix.hex}-pod"
resource_group_name = local.resource_group.name
virtual_network_name = azurerm_virtual_network.test.name
enforce_private_link_endpoint_network_policies = true
}
# resource "azurerm_resource_group" "nodepool" {
# location = local.resource_group.location
# name = "f557-nodepool"
# }
module "aks-eu-north" {
source = "Azure/aks/azurerm"
version = "8.0.0"
prefix = "f557"
resource_group_name = local.resource_group.name
node_resource_group = "f557-nodepool${random_id.name.hex}"
kubernetes_version = "1.29.2"
orchestrator_version = "1.29.2"
oidc_issuer_enabled = true
workload_identity_enabled = true
agents_pool_name = "default"
agents_availability_zones = ["1", "2", "3"]
agents_type = "VirtualMachineScaleSets"
agents_size = try("Standard_B4s_v2", "Standard_D2s_v3")
temporary_name_for_rotation = "tmp"
enable_auto_scaling = true
agents_count = null
agents_min_count = 1
agents_max_count = 8
azure_policy_enabled = true
log_analytics_workspace_enabled = false
log_retention_in_days = 30
network_plugin = "azure"
load_balancer_sku = "standard"
ebpf_data_plane = "cilium"
os_disk_size_gb = 60
rbac_aad = true
rbac_aad_managed = true
rbac_aad_azure_rbac_enabled = true
role_based_access_control_enabled = true
# rbac_aad_admin_group_object_ids = [local.inputs["groups"]["infra"]]
sku_tier = "Standard"
vnet_subnet_id = azurerm_subnet.test.id
pod_subnet_id = azurerm_subnet.pod.id
agents_labels = {}
agents_tags = {}
}
After apply, I updated soak duration time via Azure CLI:
az aks nodepool update --cluster-name f557-aks --resource-group ba4d95fcea318222-rg --name default --node-soak-duration 5
Then I ran terraform plan
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
# azurerm_subnet.pod will be updated in-place
~ resource "azurerm_subnet" "pod" {
id = "/subscriptions/xxxxxxxxxxxx/resourceGroups/ba4d95fcea318222-rg/providers/Microsoft.Network/virtualNetworks/ba4d95fcea318222-vn/subnets/ba4d95fcea318222-pod"
name = "ba4d95fcea318222-pod"
# (11 unchanged attributes hidden)
- delegation {
- name = "aks-delegation" -> null
- service_delegation {
- actions = [
- "Microsoft.Network/virtualNetworks/subnets/join/action",
] -> null
- name = "Microsoft.ContainerService/managedClusters" -> null
}
}
}
# module.aks-eu-north.azurerm_kubernetes_cluster.main will be updated in-place
~ resource "azurerm_kubernetes_cluster" "main" {
id = "/subscriptions/xxxxxxxxxxxx/resourceGroups/ba4d95fcea318222-rg/providers/Microsoft.ContainerService/managedClusters/f557-aks"
name = "f557-aks"
tags = {}
# (39 unchanged attributes hidden)
~ default_node_pool {
name = "default"
tags = {}
# (33 unchanged attributes hidden)
- upgrade_settings {
- drain_timeout_in_minutes = 0 -> null
- max_surge = "10%" -> null
- node_soak_duration_in_minutes = 0 -> null
}
}
# (6 unchanged blocks hidden)
}
Plan: 0 to add, 2 to change, 0 to destroy.
We were not able to reproduce this issue on our side.
We've also consulted the service team but we have no idea where this 30
came from, @Israphel could you please try to give us a minimum example that could reproduce this issue
Try with:
az aks nodepool update --cluster-name f557-aks --resource-group ba4d95fcea318222-rg --name default --max-surge 10% --node-soak-duration 10 --drain-timeout 30
Is there an existing issue for this?
Greenfield/Brownfield provisioning
brownfield
Terraform Version
1.5.5
Module Version
8.0.0
AzureRM Provider Version
3.106.0
Affected Resource(s)/Data Source(s)
azurerm_kubernetes_cluster
Terraform Configuration Files
tfvars variables values
Debug Output/Panic Output
More info
The only thing I did was applying soak time via the command line since the module doesn't support it, but I wouldn't expect the whole cluster to be destroyed just for that.
The same issue doesn't occur with provider 3.105.0