Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 310 forks source link

[Feature] Migration from Azure CNI+Dynamic IP Allocation to CNI+Overlay #4077

Open sleepy-manul opened 10 months ago

sleepy-manul commented 10 months ago

Is your feature request related to a problem? Please describe. We have multiple AKS clusters at the moment (some of them in production), of which all use the network configuration "Azure CNI with Dynamic IP Allocation" (dedicated Azure subnet for nodes, dedicated Azure subnet for pods). We want to migrate to Azure CNI+Cillium with Overlay Networking. However, as it stands today (2024-01-29), such a change is not possible; the cluster needs to be torn down and rebuilt. While we have automated the configuration to a high degree, this is still a considerable risk and downtime for production systems. Even if the AKS cluster needed to be stopped to make such a configuration change, that would still be superior to having to rebuild from scratch.

Describe the solution you'd like az aks update --name my-aks-cluster --network-plugin-mode overlay --pod-cidr "192.168.0.0/17" works on clusters that are currently running with dedicated node and pod subnets.

Describe alternatives you've considered Lots of potentially unnecessary work hours (estimated: 100+) for testing, preparing and reinstallation of our AKS clusters, including coordination of maintenance windows, preparing users for downtime etc.

Additional context N/A

paulgmiller commented 1 month ago

AKS is considering so upvote if you want this.

Also take a look at https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni-static-block-allocation and tell us why that doesn't meat your needs as we're still trying to get feedback there (dualstack is missing there for example)

sleepy-manul commented 1 month ago

First of all, thank you for taking the time to work on an old issue ticket, it is really appreciated. To answer your inquiry:

The alternative (static block allocation) would not work for us. The reason is that, due to corporate policies, the amount of private IPs (from Azure VNets) we can use for AKS overall is severely restricted (so making efficient use through dynamic allocation is crucial for us). However, we can use an arbitrary range of private IP addresses that are not bound to Azure VNets/subnets. Consequently, Azure CNI+Cillium would solve all our problems, but we do need a migration path. Destroying and recreating, especially for the production cluster, is not feasible because even with full automation (Terraform etc.), the downtime for the production cluster would be too long (mission-critical system for the customer).