Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

Node reboots in single node system nodepool #3012

Closed Freakazoid182 closed 6 months ago

Freakazoid182 commented 2 years ago

What happened:

With the following setup, node reboots get stuck when using a single node system nodepool:

The reason the node reboots do not complete, is that CoreDNS has a PodDisruptionBudget set with minAvailable: 1. As there is only one node where this pod can be scheduled (1 system node), and that is the node that requires restarting.

My first thought was to set the cluster autoscaler to scale to max 2 nodes. Then technically a new node will start to re-schedule CoreDNS and allow the other node to restart. After a while the restarted node should be removed again by the autoscaler, being below the utilization threshold. This will not happen though because the calico-typha deployment requires to run at 2 replicas and can not run on the same node due to conflicting ports. I.e. setting max 2 nodes on the autoscaler for the system nodepool will cause the nodepool to always run with 2 nodes.

The CoreDNS PodDisruptionBudget and calico-typha replica number don't seem to be configurable. Updating the Kubernetes state only works temporarily as AKS will overwrite these configurations eventually.

Because of the stated reasons, it is currently impossible to run a single node system nodepool with this cluster setup.

In my case, a single node is preferred as it's for a development / test AKS-cluster which doesn't have any HA requirements. Always running an extra node just to support running a single calico-typha pod seems wasteful. The cluster can work just as well with a single calico-typha pod instance.

What you expected to happen:

By being able to configure the CoreDNS PodDisruptionBudget and / or the number of calico-typha replicas, make it possible to run a single system node.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

ghost commented 2 years ago

Hi Freakazoid182, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 2 years ago

Triage required from @Azure/aks-pm

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Action required from @robbiezhang.

robbiezhang commented 1 year ago

@Freakazoid182, can you elaborate how the "node reboot" is conducted? PDB was introduced in k8s to protect the application from down time (together with multiple replicas). If the cluster only has 1 node, and you want to reboot it, you don't need to respect the PDB, since it will always cause application down time.

Freakazoid182 commented 1 year ago

@robbiezhang It has been a while that I was working on this issue, but as far as I remember the reboots were conducted by Kured.

IIRC, Kured logged an error that it could not reboot the node due to the PDB being in place.

What I was trying, was to change the PDB, or remove it. But because it's managed by AKS, it would eventually be re-installed again, because AKS manages all the CoreDNS resources.

Allowing to set a configuring on AKS to disable the CoreDNS PDB would solve this issue.

microsoft-github-policy-service[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

microsoft-github-policy-service[bot] commented 9 months ago

Action required from @robbiezhang.

microsoft-github-policy-service[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

microsoft-github-policy-service[bot] commented 6 months ago

This issue will now be closed because it hasn't had any activity for 7 days after stale. Freakazoid182 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.