Closed guy-microsoft closed 3 years ago
Hi guy-microsoft, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
Hi @guy-microsoft would you be able to let us see the arm snippet you are using to do the upgrade?
Hi @justindavies, sure. This is the part of the template that is relevant to the cluster upgrade:
{
"type": "Microsoft.ContainerService/managedClusters",
"apiVersion": "2021-02-01",
"name": "[parameters('aksName')]",
"location": "[parameters('location')]",
"dependsOn": [
"[resourceId('Microsoft.KeyVault/vaults', parameters('keyVaultName'))]",
"[resourceId('Microsoft.OperationalInsights/workspaces', variables('logAnalyticsDefaultName'))]"
],
"sku": {
"name": "Basic",
"tier": "Free"
},
"properties": {
"kubernetesVersion": "1.19.11",
"dnsPrefix": "[parameters('aksName')]",
"agentPoolProfiles": [
{
"name": "hbs",
"count": 8,
"vmSize": "Standard_DS2_v2",
"osDiskSizeGB": 128,
"osDiskType": "Managed",
"maxPods": 110,
"type": "VirtualMachineScaleSets",
"availabilityZones": [
"1",
"2",
"3"
],
"minCount": 3,
"maxCount": 100,
"enableAutoScaling": true,
"orchestratorVersion": "1.19.11",
"enableNodePublicIP": false,
"osType": "Linux",
"mode": "System",
"nodeLabels": {
"scope": "hbs",
"subScope": "default"
}
},
{
"name": "nodepool4er",
"count": 8,
"vmSize": "Standard_DS2_v2",
"osDiskSizeGB": 128,
"osDiskType": "Managed",
"maxPods": 110,
"type": "VirtualMachineScaleSets",
"availabilityZones": [
"1",
"2",
"3"
],
"minCount": 4,
"maxCount": 100,
"enableAutoScaling": true,
"orchestratorVersion": "1.19.11",
"enableNodePublicIP": false,
"osType": "Linux",
"mode": "User",
"nodeLabels": {
"scope": "external-resources"
}
}
],
"servicePrincipalProfile": {
"clientId": "[parameters('servicePrincipalAppId')]",
"secret": "[parameters('servicePrincipalSecret')]"
},
"addonProfiles": {
"kubedashboard": {
"enabled": true,
"config": {}
},
"omsagent": {
"enabled": true,
"config": {
"loganalyticsworkspaceresourceid": "[resourceId('Microsoft.OperationalInsights/workspaces', variables('logAnalyticsDefaultName'))]"
}
}
},
"enableRBAC": true,
"networkProfile": {
"networkPlugin": "kubenet",
"networkPolicy": "calico",
"loadBalancerSku": "Standard"
},
"autoScalerProfile": {
"balance-similar-node-groups": "false",
"expander": "random",
"max-empty-bulk-delete": "10",
"max-graceful-termination-sec": "600",
"max-total-unready-percentage": "45",
"new-pod-scale-up-delay": "0s",
"ok-total-unready-count": "3",
"scale-down-delay-after-add": "10m",
"scale-down-delay-after-delete": "10s",
"scale-down-delay-after-failure": "3m",
"scale-down-unneeded-time": "10m",
"scale-down-unready-time": "20m",
"scale-down-utilization-threshold": "0.5",
"scan-interval": "10s",
"skip-nodes-with-local-storage": "false",
"skip-nodes-with-system-pods": "true"
},
"aadProfile": "[if(equals(parameters('env'), 'prod'), variables('aksAadProfile'), json('null'))]"
}
}
Would you be able to raise a support request, this should work as expected.
Hi there :wave: AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.
Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.
Please do mention this issue in the case description so our teams can coordinate to help you.
Thank you!
Yes @justindavies, I will do it.
@guy-microsoft Something you should check is if you have Pod Disruption Budgets defined for your applications. I ran in to a similar issue when upgrading AKS to a newer version of kubernetes and it was found that our DevOps team didn't define Pod Disruption Budgets for apps.
With pod disruption budgets in place, the cluster should ensure that there is always the desired number of pods in a ready state before proceeding with planned maintenance operations.
https://docs.microsoft.com/en-us/azure/aks/operator-best-practices-scheduler#voluntary-disruptions
Thanks @mikenri, but we did defined a Pod Disruption Budget for every application with #replicas > 2 to ensure its minimal number of pods is 1 during the update. It's indeed interesting why there were disruptions in my service despite the Pod Disruption Budget when all nodes were updated simultaneously.
Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.
This issue will now be closed because it hasn't had any activity for 15 days after stale. guy-microsoft feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.
What happened: Updated K8S version using ARM template of both control plane and worker nodes. Nodes were updated simultaneously, resulting in disruption to running applications:
What you expected to happen: Nodes should be updated one by one to avoid disruption to running applications: This is also mentioned in this doc
How to reproduce it (as minimally and precisely as possible): Not consistent. In most of the cases it is done as expected (one by one). However, sometimes it is done simultaneously
Anything else we need to know?:
Environment:
kubectl version
): While upgrading from 1.18 to 1.19