Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

Allow updating node pools using ARM template #2194

Open aelij opened 3 years ago

aelij commented 3 years ago

What happened: Currently the only option to add a node pool in an ARM template is by creating a separate child resource (.../providers/Microsoft.ContainerService/managedClusters/aks1/agentPools/p2). This presents a problem when trying to apply an update to the primary (system) node pool which requires recreating it, for example, to allow it to join an existing subnet or enable "encryption at host".

If we were to add the new agent pool using a child resource in the template, the template will no longer be idempotent (i.e. it won't be able to deploy a new clean environment) and also the template would not clean up the old pool. It forces us to use scripts to complement ARM.

ARM deployments were made to be idempotent and this essentially breaks it.

Update: Another non-idempotent related issue:

Code: OperationNotAllowed Message: Updating Kubernetes version and agent node scaling are mutually exclusive operations.

AKS should be able to handle these kind of updates on its own.

What you expected to happen: Allow updating node pools using the agentPoolProfiles array of the managedClusters type.

Environment:

ghost commented 3 years ago

Hi aelij, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 3 years ago

Triage required from @Azure/aks-pm

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

kelly-brown commented 3 years ago

I agree this really needs to be done. Put the idea out on the feedback site. Any way we could get an ETA or find out if this is even planned?

https://feedback.azure.com/forums/914020-azure-kubernetes-service-aks/suggestions/43214700-enable-nodepool-image-updates-via-arm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

danimal521 commented 3 years ago

@palma21 can you look into this one?

ghost commented 3 years ago

Triage required from @Azure/aks-pm @palma21

ghost commented 3 years ago

Action required from @palma21, @justindavies, @yizhang4321.

ghost commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

aelij commented 2 years ago

Not stale :)

ghost commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

aelij commented 2 years ago

Still not stale

ghost commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

aelij commented 2 years ago

Not stale

tspearconquest commented 2 years ago

This also affects the azurerm terraform provider. Seems to be an antipattern to have the nodepools as a separate resource rather than a block inside the cluster's resource definition block.

aelij commented 2 years ago

@tspearconquest I don't think the problem is that it's a child resource, it's the duality - it's both a child resource and an array property. And you have to keep them in sync. If I could choose, I'd go for something that allows creating a cluster with NO node pools and add the pools exclusively using child resources. This allows for more granular updates.

tspearconquest commented 2 years ago

Apologies, I didn't realize there was a nuance there that I was missing in my phrasing. Thank you.

On the Terraform side, there is a "default_node_pool" block which is part of the parent resource. Trying to do anything to that nodepool in terraform code, or making any changes to it manually, and then running terraform, can end up destroying and recreating your cluster. This concept of a "default" node pool doesn't seem to be defined anywhere in Azure, while conversely "system" and "user" node pools are well known to me. I agree with you regarding creating a cluster with no nodepools, which is why I brought this up.

It definitely seems strange to have a default nodepool definition in the cluster resource definition (for terraform) in the first place, given that there is no "default" in Azure. In fact, I was just able to delete a default nodepool using azure-cli the other day and re-link a secondary nodepool I had setup as the "default" nodepool in my terraform state file without incurring any downtime.

Of course, that's why I'm here. It was an all day affair take care of it, all in order to change 2 settings that can only be modified when a nodepool (or cluster) is created.

ghost commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

aelij commented 2 years ago

Not stale

acortelyou commented 2 years ago

Not stale

stack111 commented 2 years ago

@palma21 can you look into this one?

@palma21 any chance we could have some response for this issue?

denniszielke commented 1 year ago

@kaarthis would this be something for the upgrade node pool scenario? It would certainly make upgrading easier.

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

fschmied commented 1 year ago

Not stale.

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

fschmied commented 1 year ago

Not stale.

jaumecen commented 1 year ago

Facing this issue right now, 2 years and still nothing. It's a hedache trying to update the cluster system node and the user ones at the same time automatically with an ARM :/

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

fschmied commented 1 year ago

Not stale

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

tspearconquest commented 1 year ago

not stale

tspearconquest commented 8 months ago

not stale

matthchr commented 8 months ago

We're aware of this problem and discussing solutions internally. I don't have an ETA for when a fix will be forthcoming but we're very aware that this breaks certain use-cases. cc @phealy, @bmoore-msft

microsoft-github-policy-service[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

jaumecen commented 6 months ago

not stale

microsoft-github-policy-service[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

tspearconquest commented 4 months ago

not stale

microsoft-github-policy-service[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

fschmied commented 3 months ago

not stale

microsoft-github-policy-service[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

matthchr commented 2 months ago

not stale

microsoft-github-policy-service[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

fschmied commented 2 months ago

Not stale. This would be incredibly useful.

microsoft-github-policy-service[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.