Not able to upgrade AKS cluster using terraform module - Minor version of node pool version 27 is bigger than control plane version

dunefro commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Terraform Version

1.6.2

Module Version

7.4.0

AzureRM Provider Version

3.69.0

Affected Resource(s)/Data Source(s)

module.aks.azurerm_kubernetes_cluster.main

Terraform Configuration Files

resource "azurerm_user_assigned_identity" "cluster" {
  location            = var.location
  name                = var.name
  resource_group_name = var.resource_group_name
}

resource "azurerm_role_assignment" "network_contributor_cluster" {
  scope                = var.vnet_id
  role_definition_name = "Network Contributor"
  principal_id         = azurerm_user_assigned_identity.cluster.principal_id
}

module "aks" {
  source                    = "Azure/aks/azurerm"
  version                   = "7.4.0"
  resource_group_name       = var.resource_group_name
  cluster_name              = var.name
  location                  = var.location
  prefix                    = "prefix"
  workload_identity_enabled = var.workload_identity_enabled

  # agent configuration
  # agents_availability_zones = []
  agents_count         = local.intial_node_pool_min_count
  agents_max_count     = local.intial_node_pool_max_count
  agents_min_count     = local.intial_node_pool_min_count
  agents_pool_name     = "initial"
  agents_size          = var.intial_node_pool_instance_type
  agents_tags          = local.tags
  orchestrator_version = var.kubernetes_version

  # autoscaler configuration
  auto_scaler_profile_enabled                          = true
  auto_scaler_profile_expander                         = "random"
  auto_scaler_profile_max_graceful_termination_sec     = "180"
  auto_scaler_profile_max_node_provisioning_time       = "5m"
  auto_scaler_profile_max_unready_nodes                = 0
  auto_scaler_profile_scale_down_delay_after_add       = "2m"
  auto_scaler_profile_scale_down_delay_after_delete    = "30s"
  auto_scaler_profile_scale_down_unneeded              = "1m"
  auto_scaler_profile_scale_down_unready               = "2m"
  auto_scaler_profile_scale_down_utilization_threshold = "0.3"

  # cluster level configurations
  api_server_authorized_ip_ranges            = var.allowed_ip_ranges
  create_role_assignment_network_contributor = false
  enable_auto_scaling                        = true
  enable_host_encryption                     = true
  identity_ids                               = [azurerm_user_assigned_identity.cluster.id]
  identity_type                              = "UserAssigned"
  kubernetes_version                         = var.kubernetes_version

  # storage
  storage_profile_blob_driver_enabled         = var.enable_blob_driver
  storage_profile_disk_driver_enabled         = var.enable_disk_driver
  storage_profile_disk_driver_version         = var.disk_driver_version
  storage_profile_file_driver_enabled         = var.enable_file_driver
  storage_profile_snapshot_controller_enabled = var.enable_snapshot_controller
  storage_profile_enabled                     = var.enable_storage_profile

  # network configuration
  network_plugin             = var.network_plugin
  vnet_subnet_id             = var.subnet_id
  net_profile_dns_service_ip = var.dns_ip
  net_profile_service_cidr   = var.service_cidr
  net_profile_pod_cidr       = var.pod_cidr
  # net_profile_docker_bridge_cidr = "10.244.0.10"

  node_pools = local.node_pools

  oidc_issuer_enabled = var.oidc_issuer_enabled
  os_disk_size_gb     = var.disk_size

  # makes the initial node pool have a taint `CriticalAddonsOnly=true:NoSchedule`
  # helpful in scheduling important workloads 
  only_critical_addons_enabled = true

  private_cluster_enabled = var.private_cluster_enabled

  # rbac 
  rbac_aad                          = false
  role_based_access_control_enabled = false

  tags = local.tags
}

tfvars variables values

################################################################################
# Cluster
################################################################################

variable "name" {
  description = "Name of the cluster"
  type        = string
}
variable "kubernetes_version" {
  description = "Version of the kubernetes engine"
  default     = "1.27"
  type        = string
}

variable "oidc_issuer_enabled" {
  description = "Enable OIDC for the cluster"
  default     = true
  type        = bool
}

variable "disk_size" {
  description = "Disk size of the initial node pool in GB"
  default     = "100"
  type        = string
}

variable "intial_node_pool_instance_type" {
  description = "Instance size of the initial node pool"
  default     = "Standard_D2s_v5"
  type        = string
}

variable "intial_node_pool_spot_instance_type" {
  description = "Instance size of the initial node pool"
  default     = "Standard_D4s_v5"
  type        = string
}

variable "workload_identity_enabled" {
  description = "Enable workload identity in the cluster"
  default     = true
  type        = bool
}

variable "control_plane" {
  description = "Whether the cluster is control plane"
  type        = bool

}

variable "enable_storage_profile" {
  description = "Enable storage profile for the cluster. If disabled `enable_blob_driver`, `enable_file_driver`, `enable_disk_driver` and `enable_snapshot_controller` will have no impact"
  type        = bool
  default     = true
}

variable "enable_blob_driver" {
  description = "Enable blob storage provider"
  type        = bool
  default     = true
}

variable "enable_file_driver" {
  description = "Enable file storage provider"
  type        = bool
  default     = true
}

variable "enable_disk_driver" {
  description = "Enable disk storage provider"
  type        = bool
  default     = true
}

variable "disk_driver_version" {
  description = "Version of disk driver. Supported values `v1` and `v2`"
  type        = string
  default     = "v1"
}
variable "enable_snapshot_controller" {
  description = "Enable snapshot controller"
  type        = bool
  default     = true
}
################################################################################
# Network
################################################################################

variable "vnet_id" {
  description = "Vnet ID for the cluster"
  type        = string
}

variable "subnet_id" {
  description = "Subnet Id for the cluster"
  type        = string
}

variable "network_plugin" {
  description = "Network plugin to use for cluster"
  type        = string
  default     = "kubenet"
}

variable "pod_cidr" {
  description = "CIDR of the pod in cluster"
  default     = "10.244.0.0/16"
  type        = string
}

variable "service_cidr" {
  description = "CIDR of the services in cluster"
  default     = "10.255.0.0/16"
  type        = string
}

variable "dns_ip" {
  description = "IP from service CIDR used for internal DNS"
  default     = "10.255.0.10"
  type        = string
}

variable "allowed_ip_ranges" {
  description = "Allowed IP ranges to connect to the cluster"
  default     = ["0.0.0.0/0"]
  type        = list(string)
}

variable "private_cluster_enabled" {
  description = "Private cluster"
  default     = false
  type        = bool
}

################################################################################
# Generic
################################################################################

variable "resource_group_name" {
  description = "Name of the resource group"
  type        = string
}

variable "location" {
  description = "Location of the resource group"
  type        = string
}

variable "tags" {
  description = "A map of tags to add to all resources"
  type        = map(string)
  default     = {}
}

# locals.tf
locals {
  tags = merge(
    {
      "terraform-module" = "terraform-azure-cluster"
      "terraform"        = "true"
      "cluster-name"     = var.name
    },
    var.tags
  )
  intial_node_pool_min_count = var.control_plane ? 2 : 1
  intial_node_pool_max_count = var.control_plane ? 3 : 2
  node_pools = {
    spot = {
      name            = "spotpool"
      node_count      = 1
      max_count       = 20
      min_count       = 1
      os_disk_size_gb = 100
      priority        = "Spot"
      vm_size         = var.intial_node_pool_spot_instance_type

      # mandatory to pass otherwise node pool will be recreated
      enable_auto_scaling     = true
      custom_ca_trust_enabled = false
      enable_host_encryption  = false
      enable_node_public_ip   = false
      eviction_policy         = "Delete"
      orchestrator_version    = var.kubernetes_version
      node_taints = [
        "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
      ]
      tags           = local.tags
      zones          = []
      vnet_subnet_id = var.subnet_id
    }
  }
}

Debug Output/Panic Output

│ Error: updating Default Node Pool Agent Pool (Subscription: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx"
│ Resource Group Name: "tfy-az-prod"
│ Managed Cluster Name: "az-prod-eaus"
│ Agent Pool Name: "initial") performing CreateOrUpdate: agentpools.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="NodePoolMcVersionIncompatible" Message="Node pool version 1.27.3 and control plane version 1.26.6 are incompatible. Minor version of node pool version 27 is bigger than control plane version 26. For more information, please check https://aka.ms/aks/UpgradeVersionRules"
│ 
│   with module.aks.azurerm_kubernetes_cluster.main,
│   on .terraform/modules/aks/main.tf line 17, in resource "azurerm_kubernetes_cluster" "main":
│   17: resource "azurerm_kubernetes_cluster" "main" {

Expected Behaviour

Ideally the control plane should first gets updated and then the node pool. To resolve this issue I have to first update the control plane from the portal and then update the node pool from terraform.

Actual Behaviour

There is no terraform diff for updating control plane and node pool upgrade causes version incompatibility error.

Steps to Reproduce

No response

Important Factoids

No response

References

No response

mkilchhofer commented 10 months ago

We at @swisspost have the exact same issue. I analyzed it with some of my colleagues (@hichem-belhocine will ping you @zioproto via MSteams).

We have no auto update mechanism from AKS side in place (Auto Upgrade Type = Disabled), we specify the exact patch version for both:

control-plane (var.kubernetes_version)
nodepools (var.orchestrator_version)

With TF module version 6.8.0 and the upgrade from 1.25 -> 1.26 a plan looked like this:

Everything went smooth. Then we upgraded to module version 7.5.0 and the upgrade is now triggered via an other TF provider (azapi) and the control plane variable is ignored via a lifecycle { } block. Ref:

https://github.com/Azure/terraform-azurerm-aks/pull/336

A plan to 1.27 now looks like this (w/ module version 7.5.0):

And of course this will fail as you cannot update the nodepool to a newer version than the control-plane:

│ Error: updating Default Node Pool Agent Pool (Subscription: "<REDACTED>"
│ Resource Group Name: "rg-aks-dev-m98cn0001"
│ Managed Cluster Name: "aks-dev-m98cn0001"
│ Agent Pool Name: "system") performing CreateOrUpdate: agentpools.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="NodePoolMcVersionIncompatible" Message="Node pool version 1.27.7 and control plane version 1.26.10 are incompatible. Minor version of node pool version 27 is bigger than control plane version 26. For more information, please check https://aka.ms/aks/UpgradeVersionRules"
│ 
│   with module.aks_post.module.aks.azurerm_kubernetes_cluster.main,
│   on .terraform/modules/aks_post.aks/main.tf line 17, in resource "azurerm_kubernetes_cluster" "main":
│   17: resource "azurerm_kubernetes_cluster" "main" {
│

zioproto commented 10 months ago

@dunefro I understand you are using var.orchestrator_version = var.kubernetes_version

I see the problem where after PR #336 the cluster update is handled like this:

https://github.com/Azure/terraform-azurerm-aks/blob/3851478e869fd5523899b10e0d654d3c9b631b4c/main.tf#L603-L616

so the Terraform plan will attempt to update just the nodepools, because var.orchestrator_version is not ignored in the lifecycle.

https://github.com/Azure/terraform-azurerm-aks/blob/3851478e869fd5523899b10e0d654d3c9b631b4c/main.tf#L520-L526

@dunefro to unblock your work you could set the following values:

kubernetes_version = 1.27
orchestrator_version =1.26

This will update the control plane first to the latest 1.27 patch version. Once Terraform apply completes you can update your values to:

kubernetes_version = 1.27
orchestrator_version =1.27

This will finish the update updating also the node pools.

@dunefro please confirm if this unblocks your work and you can upgrade successfully.

@lonegunmanb @nellyk for the long term solution I see 2 options:

1) Make very visible in the documentation of the module that kubernetes_version and orchestrator_version are 2 distinct variables for the reason that first it is necessary to upgrade the control plane and then the node pools.

2) Support the scenario where the user configures kubernetes_version = orchestrator_version during an upgrade. In this case we should add to the lifecycle ignore_changes also the var.orchestrator_version both in the azurerm_kubernetes_cluster_node_pool (extra node pools) and azurerm_kubernetes_cluster (system node pool) and handle those upgrades with a new azapi resource.

Please everyone give feedback on what is the best option for you.

zioproto commented 10 months ago

TODO:

test if the following code: https://github.com/Azure/terraform-azurerm-aks/blob/3851478e869fd5523899b10e0d654d3c9b631b4c/main.tf#L603-L616

will upgrade also the System node pool to var.kubernetes_version regardless of the value of var.orchestrator_version.

mkilchhofer commented 10 months ago

Please everyone give feedback on what is the best option for you.

I am not a fan of the approach to upgrade everything with the Azapi provider if we do hard pinning and disable autopatch.

The approach of azapi has several caveats, the main one is the non-idempotent behavior. If we rollout only the control plane, nothing seems to happen in the first run. The null_resource with the "version keeper" is updated, but it doesn't trigger the update in the first run. Then if you trigger another run without changing any code, the upgrade finally starts. IMHO non-idempotent behavior is a no-no in the Terraform world.

I was happy with the available approach of 6.8.

Maybe we need two things:

behavior of 6.8 for no autopatch
behavior of 7.5 incase autopatch is active

zioproto commented 10 months ago

Maybe we need two things:

behavior of 6.8 for no autopatch

behavior of 7.5 incase autopatch is active

The challenge is that we cannot have a conditional lifecycle block that depends on the value of var.automatic_channel_upgrade.

We would have to create 2 independent azurerm_kubernetes_cluster resources and create only 1 conditionally. This code would be very hard to maintain because every change must be duplicated to all the azurerm_kubernetes_cluster resources.

zioproto commented 10 months ago

Next step:

@zioproto to provide a minimal repro example for the team.

zioproto commented 10 months ago

The approach of azapi has several caveats, the main one is the non-idempotent behavior. If we rollout only the control plane, nothing seems to happen in the first run. The null_resource with the "version keeper" is updated, but it doesn't trigger the update in the first run. Then if you trigger another run without changing any code, the upgrade finally starts. IMHO non-idempotent behavior is a no-no in the Terraform world.

@mkilchhofer I am not able to reproduce the non-idempotent behavior. I created PR #501 where I extend one of our current CI tests to run the upgrades. As I soon as I change the value of var.kubernetes_version the upgrade of the control plane is triggered. Could you please test ? Or could you please provide the exact steps to reproduce the non-idempotent behavior ? Thanks

mkilchhofer commented 10 months ago

Could you please test ? Or could you please provide the exact steps to reproduce the non-idempotent behavior ? Thanks

Can do some more testing but I am now one week on vacation (⛷️🎿🏔️) . I am back in the office on February 5th.

dunefro commented 10 months ago

please confirm if this unblocks your work and you can upgrade successfully.

Yes this works. Thanks @zioproto

propyless commented 9 months ago

@dunefro I understand you are using var.orchestrator_version = var.kubernetes_version

I see the problem where after PR #336 the cluster update is handled like this:

https://github.com/Azure/terraform-azurerm-aks/blob/3851478e869fd5523899b10e0d654d3c9b631b4c/main.tf#L603-L616

so the Terraform plan will attempt to update just the nodepools, because var.orchestrator_version is not ignored in the lifecycle.

https://github.com/Azure/terraform-azurerm-aks/blob/3851478e869fd5523899b10e0d654d3c9b631b4c/main.tf#L520-L526

@dunefro to unblock your work you could set the following values:
kubernetes_version = 1.27
orchestrator_version =1.26
This will update the control plane first to the latest 1.27 patch version. Once Terraform apply completes you can update your values to:
kubernetes_version = 1.27
orchestrator_version =1.27
This will finish the update updating also the node pools.

@dunefro please confirm if this unblocks your work and you can upgrade successfully.

@lonegunmanb @nellyk for the long term solution I see 2 options:

Make very visible in the documentation of the module that kubernetes_version and orchestrator_version are 2 distinct variables for the reason that first it is necessary to upgrade the control plane and then the node pools.

Support the scenario where the user configures kubernetes_version = orchestrator_version during an upgrade. In this case we should add to the lifecycle ignore_changes also the var.orchestrator_version both in the azurerm_kubernetes_cluster_node_pool (extra node pools) and azurerm_kubernetes_cluster (system node pool) and handle those upgrades with a new azapi resource.

Please everyone give feedback on what is the best option for you.

Hello zioproto. I think it makes sense from a user perspective with both WoW's. Since typically you do upgrade first the control plane before the workers.

For a usability PoV I prefer having to being able to set both versions to the version I want and the module to handle upgrade order. As it once worked in 6.8.0.

However today, there is no documentation over the modules dropped support for handling a seamless upgrade of both orchestrator and worker planes, causing this confusion.

I think it makes sense to be able to upgrade some node pools to a newer version, while some maybe can stay behind a version.

For us who are using the module, just the added documentation and the intended way to upgrade after the 6.8 to 7.5 upgrade is enough for us to be satisfied.

Edit: For us who use PR's to to peer review each others changes, this means we need to do two changes.. one for the controller plane, the second for, node pools... which is a bit more of a hassle.

zioproto commented 8 months ago

Could you please test ? Or could you please provide the exact steps to reproduce the non-idempotent behavior ? Thanks

Can do some more testing but I am now one week on vacation (⛷️🎿🏔️) . I am back in the office on February 5th.

@mkilchhofer friendly ping about this pending issue :) thanks

fleetwoodstack commented 7 months ago

@mkilchhofer friendly ping about this issue :)

mkilchhofer commented 6 months ago

@mkilchhofer friendly ping about this issue :)

I cannot test it anymore. We no longer use the module and manage the required TF resources ourself.

/cc: @zioproto

fleetwoodstack commented 6 months ago

Can anyone else test it?

dunefro commented 5 months ago

I was testing out the entire setup once again with the current version being 1.28 and the new_version being 1.29. As mentioned by @propyless I started out with updating the variable kubernetes_version first to 1.29.

Somehow I am not able to detect any drift for upgrading the kubernetes version. If I try to use orchestrator_version = 1.29, there is a drift detection but only for the node pools, not the control plane. This is my AKS terraform file.

I have also tried to upgrade to AKS module version 9.0.0 and azurerm version 3.107.0, still the same results.

One more weird things i have observed in the drift is the node count of the Azure AKS node pool. If I am not specifying the node count, rather specifying the min_count and max_count for the node pool, upon each terraform apply, even if there is no manual change in terraform from my side, the terraform detects a drift to reduce the size of the node pool back to 0. This should not happen if I am using cluster_autoscaler.

It would be great if we can get some help in the official documentation on -

how to use the node pool
how to perform k8s upgrades

they should cover various scenarios where node pools are managed by cluster_autoscaler or not.

zioproto commented 5 months ago

how to use the node pool

Would this example we already test in the CI help to demostrate how to use node pools ?

https://github.com/Azure/terraform-azurerm-aks/tree/main/examples/multiple_node_pools

zioproto commented 5 months ago

@lonegunmanb could you please double check this with me ? in the resource azurerm_kubernetes_cluster_node_pool we have a each.value.node_count value.

https://github.com/Azure/terraform-azurerm-aks/blob/2c364a6d206f6c546da70fb6b9bd0dd633a83764/extra_node_pool.tf#L6-L28

we mark node_count as optional:

https://github.com/Azure/terraform-azurerm-aks/blob/2c364a6d206f6c546da70fb6b9bd0dd633a83764/variables.tf#L944-L949

But in the example we have node_count = 1.

https://github.com/Azure/terraform-azurerm-aks/blob/2c364a6d206f6c546da70fb6b9bd0dd633a83764/examples/multiple_node_pools/main.tf#L34-L44

When cluster autoscaler is enabled node_count should be null. Is possible that we have an ambiguity here ? I am afraid the value will be 0 instead of null unless explicitly set.

For the system node pool the variable is called agents_count and it must be explicitly set to null:

https://github.com/Azure/terraform-azurerm-aks/blob/2c364a6d206f6c546da70fb6b9bd0dd633a83764/variables.tf#L30-L34

@dunefro what happens if you set to null the node_count in the node_pools object ? Does it fix the state drift problem ?

zioproto commented 5 months ago

@dunefro To try to reproduce your issue I tested the example in this repo examples/multiple_node_pools with the following change

 resource "azurerm_virtual_network" "test" {
   address_space       = ["10.52.0.0/16"]
-  location            = local.resource_group.location
+  location            = "eastus"
   name                = "${random_id.prefix.hex}-vn"
   resource_group_name = local.resource_group.name
 }
@@ -36,7 +36,10 @@ locals {
     for i in range(3) : "worker${i}" => {
       name                  = substr("worker${i}${random_id.prefix.hex}", 0, 8)
       vm_size               = "Standard_D2s_v3"
-      node_count            = 1
+      min_count             = 1
+      max_count             = 3
+      enable_auto_scaling   = true
       vnet_subnet_id        = azurerm_subnet.test.id
       create_before_destroy = i % 2 == 0
     }

But I dont have Terraform state drift , and I can apply "terraform apply" over and over again without detecting any change.

Could you please provide a code example that shows the problem of terraform trying to force the nodes to 0 ? Thanks

dunefro commented 5 months ago

what happens if you set to null the node_count in the node_pools object ? Does it fix the state drift problem ?

@zioproto Yes this gets solved when I set the node_count to null, terraform doesn't detect any drift.

Azure / terraform-azurerm-aks