Closed Freakazoid182 closed 6 months ago
Hi Freakazoid182, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
Triage required from @Azure/aks-pm
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Action required from @robbiezhang.
@Freakazoid182, can you elaborate how the "node reboot" is conducted? PDB was introduced in k8s to protect the application from down time (together with multiple replicas). If the cluster only has 1 node, and you want to reboot it, you don't need to respect the PDB, since it will always cause application down time.
@robbiezhang It has been a while that I was working on this issue, but as far as I remember the reboots were conducted by Kured.
IIRC, Kured logged an error that it could not reboot the node due to the PDB being in place.
What I was trying, was to change the PDB, or remove it. But because it's managed by AKS, it would eventually be re-installed again, because AKS manages all the CoreDNS resources.
Allowing to set a configuring on AKS to disable the CoreDNS PDB would solve this issue.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
Action required from @robbiezhang.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
This issue will now be closed because it hasn't had any activity for 7 days after stale. Freakazoid182 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.
What happened:
With the following setup, node reboots get stuck when using a single node system nodepool:
The reason the node reboots do not complete, is that CoreDNS has a
PodDisruptionBudget
set withminAvailable: 1
. As there is only one node where this pod can be scheduled (1 system node), and that is the node that requires restarting.My first thought was to set the cluster autoscaler to scale to max 2 nodes. Then technically a new node will start to re-schedule CoreDNS and allow the other node to restart. After a while the restarted node should be removed again by the autoscaler, being below the utilization threshold. This will not happen though because the
calico-typha
deployment requires to run at 2 replicas and can not run on the same node due to conflicting ports. I.e. setting max 2 nodes on the autoscaler for the system nodepool will cause the nodepool to always run with 2 nodes.The CoreDNS
PodDisruptionBudget
andcalico-typha
replica number don't seem to be configurable. Updating the Kubernetes state only works temporarily as AKS will overwrite these configurations eventually.Because of the stated reasons, it is currently impossible to run a single node system nodepool with this cluster setup.
In my case, a single node is preferred as it's for a development / test AKS-cluster which doesn't have any HA requirements. Always running an extra node just to support running a single
calico-typha
pod seems wasteful. The cluster can work just as well with a singlecalico-typha
pod instance.What you expected to happen:
By being able to configure the CoreDNS
PodDisruptionBudget
and / or the number ofcalico-typha
replicas, make it possible to run a single system node.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): 1.22.4