Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 304 forks source link

[BUG] AKS APIServer public IP designated as External Public IP by Azure networking #4422

Open zmalik opened 1 month ago

zmalik commented 1 month ago

Describe the bug First of all, I'm not sure how helpful it will be to open this issue in this repository or whether it should be handled by Azure Networking. However, I believe all AKS customers could be affected by this, so perhaps the AKS team can drive this potential issue to a resolution or explain this behavior.

We are observing that in our Azure AKS cluster, the AKS API Server Public IP is being designated as ExternalPublic by Azure Networking. This means that if a NAT gateway is attached to the subnet where AKS nodes are running, all communication between nodes and the AKS API Server is charged as if the control plane is running on AWS, GCP, or any other external infrastructure.

Practically, this means that as we add more load (nodes/pods/operators) to our AKS cluster or use API-intensive features like Azure NPM, the cost of operating a simple AKS cluster can become unmanageable. The communication between the cluster and the control plane is charged at $0.045 per GB.

In the default AKS setup, an AKS mutation webhook also injects the public FQDN in the kube-system namespace as environment variables to all pods. Consequently, all namespaces, except kube-system, use the in-cluster private IP address. This makes controlling this traffic cost challenging for users, as Azure does not provide much flexibility to reduce this traffic. Given that only users with significant load in the AKS cluster will attach a NAT gateway (as the default egress is insufficient under heavy load), this significantly increases the chances of overcharging. In our environment, a single node is receiving between 6GiB to 13GiB of traffic every hour from the API server. If a cluster has 250 nodes, this translates to a substantial amount of traffic and associated costs.

This amount can significantly get out of the hands if you have Azure NPM on a big cluster or you are scaling nodes and pods to fit in 1000 nodes in AKS Cluster. And in case you have 50s of AKS clusters this all sums up.

bare in mind this is all just kube-system and systemd units(node-problem-detector/kubelet) communicating to the APIServer IP. As rest of operators are just using kubernetes.default.svc.cluster.local which resolves to internal IP address.

To Reproduce

Here are few diagrams on the current behavior we see.

image

once we divert the traffic through some other route

image

we see a direct reduction in cost of NAT Gateway and also the metrics we see significant decrease.

image

Expected behavior Just like storage accounts or managed databases, when using public IP in same region, traffic is not going through the NAT Gateway, as its designated AzurePublic

We have tested this behavior and without private link, uploading a file of 100s of Gi, doesn't go through NAT gateway. And if we upload a file to a storage account from another region, which is also designated as AzurePublic in flow logs. That traffic goes through the NAT Gateway, which is fully understandable as we are dealing with inter-region traffic.

Im aware of the private AKS cluster setup, but feels like that it's not fair to not to list this additional charge in the default AKS cluster setup.

Environment (please complete the following information):

Additional context Some additional behavior we have seen in the classification of the AzurePublic or ExternalPublic

image

We do see that APIServer public IP of another cluster in same region is identified as AzurePublic and we are not charged for that traffic. So the nodes talking to their own APIServer are charged extra, but get free traffic to another AKS APIServer running in same Azure region for free.

There is an Azure Support ticket on-going for several weeks and first impression is that this is desired behavior. I would like to hear AKS engineering opinion on it.

eyltl commented 1 month ago

You sure about the nat cost & azure storage?

https://learn.microsoft.com/en-us/answers/questions/1084416/same-region-rounting-to-storage-account-from-subne

zmalik commented 1 month ago

You sure about the nat cost & azure storage?

yes, very sure. As in we tested it.

please take a look at recent link from May of this year: https://learn.microsoft.com/en-us/answers/questions/1663216/traffic-path-between-azure-storage-account-and-azu

here they mention:

Azure VM to Azure Storage in the Same Region: If private endpoints are not used, the Azure VM will connect to the storage account's public endpoint. However, even though it's a public endpoint, the traffic between the Azure VM and the storage account, when both are in the same region, typically remains within Microsoft's Azure internal network, not traversing the public internet. This setup leverages the Azure network, optimizing for security and performance within the same regional infrastructure.

kamilzzz commented 1 month ago

Storage is special kind of service where traffic path may indeed be different as far as I know.

For network flows targeting public IPs in the same region traffic is definitely traversing NAT Gateway as one use case for NAT Gateway is to avoid SNAT exhaustion. And we were using that to avoid SNAT exhaustion when targeting managed databases.

zmalik commented 1 month ago

some updates:

This aligns with our observed NAT Gateway cost increases. AKS users globally following best practices with managed NAT Gateway may have experienced unexpected cost increases after this change.

Investigation into high traffic volume between AKS control-plane and nodes continues

kevinnowland commented 1 month ago

@zmalik Do you have documentation describing the fix for NAT Gateway Metrics? Thank you

zmalik commented 1 month ago

@kevinnowland unfortunately not. I am not aware of any public documentation for that fix.

zmalik commented 5 days ago

so most of the traffic is coming from the operators. Operators such as istio, kube-state-metrics, argoCD or vertical-pod-autoscaler takes the lead. Any managed add-on such ama-metrics is also present. This is understandable knowing all of these operators watch quite significant amount of objects.

This also matches with the traffic pattern where APIServer to Nodes inbound traffic is 10x more than outbound traffic. This was verified by coroot also. I did a second check by writing a small node agent which does the following:

I did this clunky manual verification step to fully discard any errors in our analysis. I honestly feel that all of this traffic by default should not be charged by Azure. As this is coming from kubernetes watch pattern.