Open vikas-rajvanshy opened 1 month ago
@tallaxes @Bryce-Soghigian
Searched logs based on the nodeclaim you provided, found this error message on the put for network interface
\"code\": \"CannotMixIPBasedAddressesAndIPConfigurationsOnLoadBalancerBackendAddressPool\",\n \"message\": \"Mixing backend ipconfigurations and IPAddresses in backend pool /subscriptions/
Thanks for looking this up Bryce. What could cause this to happen - is there a setting in AKS that could cause this?
Is this cluster (possibly unlike others) using IP-based SLB?
I'm using a common bicep file to provision both of my clusters so they should both have the same settings. I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this. It also uses Istio Mesh and Ingress gateway
I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this
That's what I suspect
Thanks for the suggestion - I'll try turning it off later this evening to see if it mitigates the issue.
I tried the mitigation - applying the fix required me to tear down and rebuild the cluster. It seemed to be working fine for 3-4 days and then I ran into a similar set of symptoms again this morning. The logs look different this time though.
NodeClaims fail with:
Any ideas? Could this be related to https://github.com/Azure/AKS/issues/4545?
The only way to find out if it's related to the other issue is to either:
Node not registered / not found issues are often related to a connectivity issue between the node's kubelet and the API server. I would suggest that you make sure your firewalling is allowing this traffic. Looking a kubelet's logs gives the answer most of the time
Describe the bug NodeClaims created by NAP are not launching, this causes scale out to fail. Seems to be a recent regression, describing the node claim leads to this message:
{ "error": { "code": "MissingApiVersionParameter", "message": "The api-version query parameter (?api-version=) is required for all requests." } }
To Reproduce Repros consistently on one of our clusters, but not the other. Perhaps this regression is starting to roll out.
Create a workload that needs to add nodes and uses NAP.
You will see the following message, but the node is never added to the cluster successfully. [Pod should schedule on: nodeclaim/default-x7kct]
kubectl describe nodeclaim -n kube-system
RESPO... Reason: LaunchFailed Status: False Type: Launched Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Ready Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Registered Events:
Expected behavior Nodes launch and scale out the workload as expected.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):