Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

[BUG] Auto channel upgrade misses to upgrade the orchestratorVersion on node pool leaving inconsistent state. #3508

Open slzmruepp opened 1 year ago

slzmruepp commented 1 year ago

Describe the bug Auto channel upgrade stable misses to update orchestratorVersion on Nodepool.

To Reproduce Steps to reproduce the behavior: Create AKS cluster with nodepool system and user with Kubernetes Version N-2 Enable Autochannel upgrade stable Wait for the cluster to upgrade. Look at the gui under node pools which reports wrong patch version for user pool or az command as well as kubectl get nodes command.

Expected behavior orchestratorVersion and currentOrchestratorVersion should report the correct deployed version.

Screenshots InkedScreenshot 2023-02-24 150303 InkedScreenshot 2023-02-24 150321 InkedScreenshot 2023-02-24 150346

Environment (please complete the following information):

az aks show --resource-group rg-project-stg --name aks-project-stg | grep kubernetesVersion "kubernetesVersion": "1.24.9", az aks show --resource-group rg-project-stg --name aks-project-stg | grep currentKubernetesVersion "currentKubernetesVersion": "1.24.9", az aks nodepool show --resource-group rg-project-stg --cluster-name aks-project-stg --name user| grep orchestratorVersion "orchestratorVersion": "1.24.6", az aks nodepool show --resource-group rg-project-stg --cluster-name aks-project-stg --name user| grep currentOrchestratorVersion "currentOrchestratorVersion": "1.24.9",


**Additional context**
This disables the possibility especially in terraform runs to further apply changes like tags or non functional changes to the cluster. It is a blocker. Basically all projects involve this fail. See the errors on terraform runs here:

│ Error: updating Managed Cluster (Subscription: "XXX" │ Resource Group Name: "rg-project-master-int" │ Managed Cluster Name: "aks-project-master-int"): managedclusters.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="NotAllAgentPoolOrchestratorVersionSpecifiedAndUnchanged" Message="Using managed cluster api, all Agent pools' OrchestratorVersion must be all specified or all unspecified. If all specified, they must be stay unchanged or the same with control plane. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api" │ │ with module.aks_cluster.azurerm_kubernetes_cluster.aks_cluster, │ on modules/aks/main.tf line 32, in resource "azurerm_kubernetes_cluster" "aks_cluster": │ 32: resource "azurerm_kubernetes_cluster" "aks_cluster" {

sabbour commented 1 year ago

@kaarthis @chandraneel please take a look

ghost commented 1 year ago

Action required from @Azure/aks-pm

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

mloskot commented 1 year ago

I think, @slzmruepp, you should be able to work around this issue with az aks nodepool upgrade, see https://learn.microsoft.com/en-us/azure/aks/use-multiple-node-pools#upgrade-a-node-pool

slzmruepp commented 1 year ago

This did not help. The only way this could be mitigated was to update the nodepool version through the rest api by putting the proper version in postman:

Endpoint:

https://management.azure.com/subscriptions/{{subscriptionId}}/resourceGroups/{{resourcegroup}}/providers/Microsoft.ContainerService/managedClusters/{{aksname}}/agentPools/{{nodepoolname}}?api-version=2023-01-01

Body raw:

{
  "properties": {
    "orchestratorVersion": "1.24.9"
  }
}
mloskot commented 1 year ago

@slzmruepp I see. I found this issue while investigating unexpected partial Kubernetes upgrade of my AKS cluster which I performed using Terraform. It's second time ever I run such upgrade of the same cluster, but first time it's succeeded with partial upgrade result. so I thought my issue might be related to your issue with the auto-ugprade.

I run upgrade from 1.25.5 to 1.26.3 and here is what I mean by partial upgrade result:

$ az aks show --resource-group ${AKS_CLUSTER_GROUP} --name ${AKS_CLUSTER} --output table

Name                 Location  ResourceGroup    KubernetesVersion  CurrentKubernetesVersion  ProvisioningState
-------------------  --------  ---------------  -----------------  ------------------------  -----------------
aks-xxx-uks-stg-aks  uksouth   rg-aks-xxx-stg   1.26.3             1.26.3                     Succeeded

but for some reason the system node pool has not been upgraded

$ az aks nodepool list --resource-group ${AKS_CLUSTER_GROUP} --cluster-name ${AKS_CLUSTER} --output table

Name     OsType    KubernetesVersion    VmSize            Count    MaxPods    ProvisioningState    Mode
-------  --------  -------------------  ----------------  -------  ---------  -------------------  ------
default  Linux     1.25.5               Standard_D2_v3    1        30         Succeeded            System
w1abc    Windows   1.26.3               Standard_E2as_v5  0        30         Succeeded            User
$ kubectl get nodes

NAME                             STATUS  ROLES  AGE  VERSION
aks-default-36914368-vmss000000  Ready   agent  8d   v1.25.5
aksw1abc000001                   Ready   agent  3d   v1.26.3

After I found your issue, I also confirmed that the orchestrator versions of the system node pool were left at 1.25.5

I decided to try az aks nodepool upgrade and it did the trick for me:

Name     OsType    KubernetesVersion  VmSize            Count    MaxPods  ProvisioningState  Mode
-------  --------  -----------------  ----------------  -------  -------  -----------------  ------
default  Linux     1.26.3             Standard_D2_v3    1        30       Succeeded          System
w1abc    Windows   1.26.3             Standard_E2as_v5  1        30       Succeeded          User

So, I shared it above.

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

BertelBB commented 1 year ago

Also facing this issue.

My setup:

Updating kubernetes_version from 1.24 -> 1.25 does upgrade AKS to 1.25 but leaves the node pools on version 1.24.

Initially I thought it was me misunderstanding how Automatic Channel Upgrades works. So I tried switching from patch to stable but node pools are still left untouched during upgrades. The documentation states that orchestrator_version should be left empty if they are to always follow AKS version. Is it possible that Terraform is setting the orchestrator version if it is not explicitly set to null and that is causing node pools to be stuck on older version?

BertelBB commented 1 year ago

Did some digging and I don't think this is a bug or issue in AKS/Azure's API.

AzureRM Terraform provider will always set orchestratorVersion to currentOrchestratorVersion IF the Terraform variable orchestrator_version is unset. See PR when change was introduced: https://github.com/hashicorp/terraform-provider-azurerm/pull/18130

Also see these links:

Basically, what I think is happening in my case is that I have an existing cluster running on version 1.24.10. The cluster was deployed using Terraform's azurerm_kubernetes_cluster resource with kubernetes_version="1.24", default_node_pool.orchestrator_version=null and automatic_channel_upgrade="patch". Now I want to upgrade to 1.25 and what happens is AzureRM TF provider performs the upgrade on AKS (control-plane) but because orchestrator_version is unset it uses the currentOrchestratorVersion returned by Azure's API which correctly is set to 1.24.10 and so from Azure's point of view I am specifically asking to upgrade AKS to 1.25 but keep the node-pools on 1.24.10.

Hope that makes sense.

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

sergiomcalzada commented 10 months ago

I got the error even from the portal when selecting "only upgrade control plane"

microsoft-github-policy-service[bot] commented 9 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 8 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 8 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 7 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 7 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 6 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 6 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

@kaarthis, @sdesai345 would you be able to assist?