Azure / aksArc

# Welcome to the Azure Kubernetes Service enabled by Azure Arc (AKS Arc) repo This is where the AKS Arc team will track features and issues with AKS Arc. We will monitor this repo in order to engage with our community and discuss questions, customer scenarios, or feature requests. Checkout our projects tab to see the roadmap for AKS Arc!
MIT License
111 stars 45 forks source link

Set AntiAffinityClassNames on AKS-HCI VMs #281

Open nmdange2 opened 1 year ago

nmdange2 commented 1 year ago

Title: Set AntiAffinityClassNames on AKS-HCI VMs

Description: This is a feature request to make use of Hyper-V's built-in anti-affinity rules to ensure AKS VMs do not run on the same host. I have a 4-node AKS-HCI cluster, and I noticed on several occasions that multiple VMs within the same node pool ended up on the same physical host. This is not ideal for HA. If all the worker nodes in a pool are running on the same physical host, and that physical host goes down, then all workloads in that pool also go down. Pods with multiple replicas should run on separate physical hosts where possible.

Each control plane and worker node pool should have a unique anti-affinity class name assigned so that the Failover Cluster will ensure the VMs run on different physical hosts, but VMs in different node pools can still run on the same host. This could be computed based on the AKS cluster name + worker node pool name (or "controlplane" for the control plane VMs)

Description of Anti-affinity feature: https://learn.microsoft.com/en-us/windows-server/failover-clustering/cluster-affinity

Elektronenvolt commented 1 year ago

@baziwane - this is what we've discussed a while ago for control plane and load balancer nodes (Kubernetes cluster critical nodes). @nmdange2 - Doing this for node pools as well is a good idea too in our case. Our physical Hyper-V nodes can host a lot of VMs - if we create a small "special purpose" (labeled) node pool, all VMs may sit on the same physical node. In case of a physical node failure - the deployed application can't re-schedule - no more nodes with that label in the cluster.

baziwane commented 1 year ago

That's correct - issue #76. We are still tracking this on the roadmap.