Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 307 forks source link

[BUG] Scale out with NAP recently started failing #4542

Open vikas-rajvanshy opened 1 month ago

vikas-rajvanshy commented 1 month ago

Describe the bug NodeClaims created by NAP are not launching, this causes scale out to fail. Seems to be a recent regression, describing the node  claim leads to this message:

{ "error": { "code": "MissingApiVersionParameter", "message": "The api-version query parameter (?api-version=) is required for all requests." } }

To Reproduce Repros consistently on one of our clusters, but not the other. Perhaps this regression is starting to roll out.

Create a workload that needs to add nodes and uses NAP.

You will see the following message, but the node is never added to the cluster successfully. [Pod should schedule on: nodeclaim/default-x7kct]

kubectl describe nodeclaim -n kube-system

RESPO... Reason: LaunchFailed Status: False Type: Launched Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Ready Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Registered Events:

Expected behavior Nodes launch and scale out the workload as expected.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

justindavies commented 1 month ago

@tallaxes @Bryce-Soghigian

Bryce-Soghigian commented 1 month ago

Searched logs based on the nodeclaim you provided, found this error message on the put for network interface

\"code\": \"CannotMixIPBasedAddressesAndIPConfigurationsOnLoadBalancerBackendAddressPool\",\n    \"message\": \"Mixing backend ipconfigurations and IPAddresses in backend pool /subscriptions//resourceGroups//providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes is not allowed.\"

vikas-rajvanshy commented 1 month ago

Thanks for looking this up Bryce. What could cause this to happen - is there a setting in AKS that could cause this?

tallaxes commented 1 month ago

Is this cluster (possibly unlike others) using IP-based SLB?

vikas-rajvanshy commented 1 month ago

I'm using a common bicep file to provision both of my clusters so they should both have the same settings. I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this. It also uses Istio Mesh and Ingress gateway

tallaxes commented 1 month ago

I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this

That's what I suspect

vikas-rajvanshy commented 1 month ago

Thanks for the suggestion - I'll try turning it off later this evening to see if it mitigates the issue.

vikas-rajvanshy commented 1 month ago

I tried the mitigation - applying the fix required me to tear down and rebuild the cluster. It seemed to be working fine for 3-4 days and then I ran into a similar set of symptoms again this morning. The logs look different this time though.

NodeClaims fail with:

Any ideas? Could this be related to https://github.com/Azure/AKS/issues/4545?

CCOLLOT commented 1 month ago

The only way to find out if it's related to the other issue is to either:

Node not registered / not found issues are often related to a connectivity issue between the node's kubelet and the API server. I would suggest that you make sure your firewalling is allowing this traffic. Looking a kubelet's logs gives the answer most of the time