[BUG] Scale out with NAP recently started failing

vikas-rajvanshy commented 1 month ago

Describe the bug NodeClaims created by NAP are not launching, this causes scale out to fail. Seems to be a recent regression, describing the node claim leads to this message:

{ "error": { "code": "MissingApiVersionParameter", "message": "The api-version query parameter (?api-version=) is required for all requests." } }

To Reproduce Repros consistently on one of our clusters, but not the other. Perhaps this regression is starting to roll out.

Create a workload that needs to add nodes and uses NAP.

You will see the following message, but the node is never added to the cluster successfully. [Pod should schedule on: nodeclaim/default-x7kct]

kubectl describe nodeclaim -n kube-system

RESPO... Reason: LaunchFailed Status: False Type: Launched Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Ready Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Registered Events:

Expected behavior Nodes launch and scale out the workload as expected.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Kubernetes version 1.30.3

justindavies commented 1 month ago

@tallaxes @Bryce-Soghigian

Bryce-Soghigian commented 1 month ago

Searched logs based on the nodeclaim you provided, found this error message on the put for network interface

\"code\": \"CannotMixIPBasedAddressesAndIPConfigurationsOnLoadBalancerBackendAddressPool\",\n \"message\": \"Mixing backend ipconfigurations and IPAddresses in backend pool /subscriptions//resourceGroups//providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes is not allowed.\"

vikas-rajvanshy commented 1 month ago

Thanks for looking this up Bryce. What could cause this to happen - is there a setting in AKS that could cause this?

tallaxes commented 1 month ago

Is this cluster (possibly unlike others) using IP-based SLB?

vikas-rajvanshy commented 1 month ago

I'm using a common bicep file to provision both of my clusters so they should both have the same settings. I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this. It also uses Istio Mesh and Ingress gateway

tallaxes commented 1 month ago

I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this

That's what I suspect

vikas-rajvanshy commented 1 month ago

Thanks for the suggestion - I'll try turning it off later this evening to see if it mitigates the issue.

vikas-rajvanshy commented 1 month ago

I tried the mitigation - applying the fix required me to tear down and rebuild the cluster. It seemed to be working fine for 3-4 days and then I ran into a similar set of symptoms again this morning. The logs look different this time though.

NodeClaims fail with:

lastTransitionTime: '2024-09-19T19:35:34Z' message: Node not registered with cluster reason: NodeNotFound status: 'False' type: Registered

Any ideas? Could this be related to https://github.com/Azure/AKS/issues/4545?

CCOLLOT commented 1 month ago

The only way to find out if it's related to the other issue is to either:

check if we are talking about the same node image (the ubuntu2204 from 13.09)
SSH into the instance and verify whether the kubelet file is missing the default IMDS environment variables

Node not registered / not found issues are often related to a connectivity issue between the node's kubelet and the API server. I would suggest that you make sure your firewalling is allowing this traffic. Looking a kubelet's logs gives the answer most of the time

Azure / AKS

[BUG] Scale out with NAP recently started failing #4542