Node pools set to 0 do not get upgraded when rolling cluster-wide AKS upgrade

marcjimz commented 4 years ago

What happened: Hi all - we're noticing that for node pools set to a scaleset of 0, when we run an upgrade to a later version of Kubernetes that the node pool does not upgrade itself.

What you expected to happen: We expect the node pool to upgrade without manual intervention. Instead, we have to scale the node pool up to 1 OR delete the node pool and recreate it to have the proper version

How to reproduce it (as minimally and precisely as possible): Scale a node pool to 0, then upgrade your AKS cluster. We have confirmed this across multiple env.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.15.10
Size of cluster (how many worker nodes are in the cluster?) 30 nodes
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) HTTP microservices, machine learning
Others:

ghost commented 4 years ago

Hi marcjimz, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

paulgmiller commented 4 years ago

Upgrading a 0 node agentpool should work from mc or agentpool. I'm unable to repro locally with my own cluster. Assuming this is a vmss cluster. Would you be willing to share a clusters fqdn and time of the operation that didn't upgrade it?

palma21 commented 4 years ago

Thanks @marcjimz deleted the comment just to avoid having the FQDN stay here. We're looking into it.

paulgmiller commented 4 years ago

Looks like this came in through azure portal we see 4 requests. 1 for the cluster to just update controlplane . Then 3 each agentpool. This might be a problem in the portal calling code. Trying to confirm. If you can confirm the operation was kicked off from portal that would help.

ghost commented 4 years ago

@palma21, @jenetlan, @chandraneel, @abshamsft would you be able to assist?

marcjimz commented 4 years ago

@paulgmiller correct, we typically do AZ CLI but this was one done via the portal.

marcjimz commented 4 years ago

Thanks @marcjimz deleted the comment just to avoid having the FQDN stay here. We're looking into it.

Thanks!

jenetlan commented 4 years ago

Thanks for the information - we'll look into the portal experience and I'll update here with what we find.

paulgmiller commented 4 years ago

So it seems from our instrumentation in the portal you initiated 1) An upgrade of the cluster control plane 2) Upgrades of 3/4 of the agentpools but not the agentpool with 0 nodes.

Maybe the UI/docs aren't clear that #1 won't upgrade any node pools regardless of size? There are valid reasons for a customer not to upgrade a nodepool.

marcjimz commented 4 years ago

@paulgmiller upgrading to 1.16 is not an option

paulgmiller commented 4 years ago

Ack thanks that seems like a bug and we're looking

palma21 commented 3 years ago

@marcjimz we're not being able to repro in different clusters. Do you mind either capturing a network trace on your browser and/or open a support ticket? (instructions below)

ghost commented 3 years ago

Hi there :wave: AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.

Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.

Please do mention this issue in the case description so our teams can coordinate to help you.

Thank you!

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

jabbera commented 3 years ago

I did an upgrade via arm template from 1.20.2 to 1.20.5 and did not have this issue.

paulgmiller commented 3 years ago

Yes this should actually be handled for a couple months now in vmss by upgrading just the model.

On Fri, May 7, 2021, 3:57 AM Mike @.***> wrote:

I did an upgrade via arm template from 1.20.2 to 1.20.5 and did not have this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/1908#issuecomment-834267129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALH6S2PJZFA7FGB6LAIWYLTMPBSTANCNFSM4SYTXRVQ .

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Thanks for reaching out. I'm closing this issue as it was marked with "Fix released" and it hasn't had activity for 7 days.

Azure / AKS

Node pools set to 0 do not get upgraded when rolling cluster-wide AKS upgrade #1908