Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 310 forks source link

[BUG] UpdateLoadBalancerFailed exceptions as node names generating by AKS are not compatible #4520

Open vikas-rajvanshy opened 2 months ago

vikas-rajvanshy commented 2 months ago

Describe the bug Events spam the event log with an error indicating the following (screenshot attached). Suggests that the node names being generated by AKS are not conforming to naming conventions (these are nodes generated by AKS)

Error updating load balancer with new hosts [aks-default-jh7bj aks-default-l4wnl aks-default-s82r4 aks-default-wj8wg aks-systempool-27185127-vmss000000 aks-systempool-27185127-vmss000001] [node names limited, total number of nodes: 6], error: bi.EnsureHostsInPool: failed to update backend pool kubernetes: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {"error":{"code":"InvalidResourceName","message":"Resource name is invalid. The name can be up to 80 characters long. It must begin with a word character, and it must end with a word character or with ''. The name may contain word characters or '.', '-', ''.","details":[]}} source: component: service-controller

To Reproduce Steps to reproduce the behavior: Update to 1.30.3 with Istio add on enabled View events in AKS blade

Expected behavior Event should not occur

Screenshots eventscreenshot

If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

biefy commented 2 months ago

@feiskyer, this seems to be related to cloud provider. Notice the last empty entry of loadBalancerBackendAddresses

PUT https://eastus.network.azure.com.../providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes?api-version=2022-07-01
{
    "id": "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes",
    "name": "kubernetes",
    "properties": {
        "loadBalancerBackendAddresses": [
            {
                "name": "aks-systempool-27185127-vmss000001",
                "properties": {
                    "ipAddress": "10.224.0.<x>"
                }
            },
            {
                "name": "aks-systempool-27185127-vmss000000",
                "properties": {
                    "ipAddress": "10.224.0.<y>"
                }
            },

>             {
>                 "name": "",
>                 "properties": {
>                     "ipAddress": ""
>                 }
>             }

        ],
        "virtualNetwork": {
            "id": "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Network/virtualNetworks/<vnet>"
        }
    }
}
feiskyer commented 2 months ago

this was because cloud-node-manager was scheduled to node around 10min later, which caused the Node internal ip to be empty. The retries after that have successfully reconciled the service. Please ensure the kube-system Pods (including cloud-node-manager) are scheduled in time to avoid such issues.

vikas-rajvanshy commented 2 months ago

Thanks for looking at this @feiskyer - the events seem to have self resolved. Are there any settings I should set as an AKS user to ensure kube-system pods are scheduled on time?