Azure / azure-container-networking

Azure Container Networking Solutions for Linux and Windows Containers
MIT License
376 stars 236 forks source link

CNI reports "No available addresses" due to orphan endpoints in azure-vnet.json #831

Closed zhiweiv closed 2 years ago

zhiweiv commented 3 years ago

What happened: Pods hang at ContainerCreating, the error is Failed to allocate pool: Failed to delegate: Failed to allocate address: No available addresses.

When this happen, there are 30 endpoints in azure-vnet.json, but the corresponding pods were not found. For the orphan endpoints, I found following error in azure cni logs. I aslo found multiple DEL calls in kubelet logs.

Seems the CNI failed to release endpoint, but the kubelet still deleted the pod after a few retries.

2021/03/24 00:30:24 [15540] [cni-net] Plugin azure-vnet version v1.2.2.
2021/03/24 00:30:24 [15540] [cni-net] Running on windows
2021/03/24 00:30:24 [15540] [net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:22:48:55:d7:21 Flags:up|broadcast|multicast} with IP: [fe80::8154:2d41:f3f:2011/64 10.6.0.4/16]
2021/03/24 00:30:24 [15540] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2021/03/24 00:30:24 [15540] [net] Network interface: {Index:12 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:73:c3:ac Flags:up|broadcast|multicast} with IP: [fe80::9d95:5b76:3fa9:1a25/64 172.28.240.1/20]
2021/03/24 00:30:24 [15540] Formatted Boot time: 2021-03-23 01:38:51
2021/03/24 00:30:24 [15540] [net] reboot time 2021-03-23 01:38:51 +0000 UTC store mod time 2021-03-24 00:09:47.5907853 +0000 UTC
2021/03/24 00:30:24 [15540] [net] Restored state, &{Version:v1.2.2 TimeStamp:2021-03-24 00:09:47.5897849 +0000 UTC ExternalInterfaces:map[vEthernet (Ethernet 2):0xc00012e200] store:0xc000399230 Mutex:{state:0 sema:0}}
2021/03/24 00:30:24 [15540] External Interface &{Name:vEthernet (Ethernet 2) Networks:map[azure:0xc00006ef70] Subnets:[10.6.0.0/16] BridgeName: DNSInfo:{Suffix: Servers:[] Options:[]} MacAddress:00:22:48:55:d7:21 IPAddresses:[] Routes:[] IPv4Gateway:0.0.0.0 IPv6Gateway:::}
2021/03/24 00:30:24 [15540] Number of endpoints: 30
2021/03/24 00:30:24 [15540] [cni-net] Plugin started.
2021/03/24 00:30:24 [15540] CNI_COMMAND environment variable set to DEL
2021/03/24 00:30:24 [15540] [cni-net] Processing DEL command with args {ContainerID:062334ae2b057f93bb8844b2c0bf998f2a9544660735b8c424b77331ecf09463 Netns: IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=alita-job-2119f5b4-5fb18aaa-03232310-af-000;K8S_POD_INFRA_CONTAINER_ID=062334ae2b057f93bb8844b2c0bf998f2a9544660735b8c424b77331ecf09463 Path:c:/k/azurecni/bin, StdinData:{"AdditionalArgs":[{"Name":"EndpointPolicy","Value":{"ExceptionList":["10.6.0.0/16"],"Type":"OutBoundNAT"}},{"Name":"EndpointPolicy","Value":{"DestinationPrefix":"10.0.0.0/16","NeedEncap":true,"Type":"ROUTE"}}],"bridge":"azure0","capabilities":{"dns":true,"portMappings":true},"cniVersion":"0.3.0","dns":{"Nameservers":["10.0.0.10","168.63.129.16"],"Search":["svc.cluster.local"]},"ipam":{"type":"azure-vnet-ipam"},"mode":"bridge","name":"azure","runtimeConfig":{"dns":{"Servers":["10.0.0.10"],"Searches":["default.svc.cluster.local","svc.cluster.local","cluster.local"],"Options":["ndots:5"]}},"type":"azure-vnet"}}.
2021/03/24 00:30:24 [15540] [cni-net] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: Bridge:azure0 LogLevel: LogTarget: InfraVnetAddressSpace: IPV6Mode: ServiceCidrs: VnetCidrs: PodNamespaceForDualNetwork:[] IPsToRouteViaHost:[] MultiTenancy:false EnableSnatOnHost:false EnableExactMatchForPodName:false DisableHairpinOnHostInterface:false DisableIPTableLock:false CNSUrl: Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet: Address: QueryInterval:} DNS:{Nameservers:[10.0.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} RuntimeConfig:{PortMappings:[] DNS:{Servers:[10.0.0.10] Searches:[default.svc.cluster.local svc.cluster.local cluster.local] Options:[ndots:5]}} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 54 46 48 46 48 47 49 54 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 48 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]}]}.
2021/03/24 00:30:24 [15540] Get number of endpoints for ifname vEthernet (Ethernet 2) network azure
2021/03/24 00:30:24 [15540] Trying to retrieve endpoint id 062334ae-eth0
2021/03/24 00:30:24 [15540] [net] Deleting endpoint 062334ae-eth0 from network azure.
2021/03/24 00:30:24 [15540] Trying to retrieve endpoint id 062334ae-eth0
2021/03/24 00:30:24 [15540] [net] Deleting hcn endpoint with id: 6fa6d738-c90b-49df-8dff-c87f92278f16
2021/03/24 00:30:24 [15540] [net] Failed to delete endpoint 062334ae-eth0, err:Failed to remove hcn endpoint: 6fa6d738-c90b-49df-8dff-c87f92278f16 from namespace: b0b21776-a2e8-4439-a82e-8442f482fa46 due to error: hcnOpenNamespace failed in Win32: Element not found. (0x490) {"Success":false,"Error":"Element not found. ","ErrorCode":2147943568}.
2021/03/24 00:30:24 [15540] [azure-vnet] Failed to delete endpoint: Failed to remove hcn endpoint: 6fa6d738-c90b-49df-8dff-c87f92278f16 from namespace: b0b21776-a2e8-4439-a82e-8442f482fa46 due to error: hcnOpenNamespace failed in Win32: Element not found. (0x490) {"Success":false,"Error":"Element not found. ","ErrorCode":2147943568}.
2021/03/24 00:30:24 [15540] [cni-net] DEL command completed with err:Failed to delete endpoint: Failed to remove hcn endpoint: 6fa6d738-c90b-49df-8dff-c87f92278f16 from namespace: b0b21776-a2e8-4439-a82e-8442f482fa46 due to error: hcnOpenNamespace failed in Win32: Element not found. (0x490) {"Success":false,"Error":"Element not found. ","ErrorCode":2147943568}.
2021/03/24 00:30:24 [15540] Failed to execute network plugin, err:Failed to delete endpoint: Failed to remove hcn endpoint: 6fa6d738-c90b-49df-8dff-c87f92278f16 from namespace: b0b21776-a2e8-4439-a82e-8442f482fa46 due to error: hcnOpenNamespace failed in Win32: Element not found. (0x490) {"Success":false,"Error":"Element not found. ","ErrorCode":2147943568}.
2021/03/24 00:30:24 [15540] [cni-net] Plugin stopped.

How to reproduce it:
It happens frequently(almost everyday) in cluster with high number for Pod Creation/Deletion, high CPU usage.

Orchestrator and Version (e.g. Kubernetes, Docker):
K8s 1.20.2 Containerd 1.14.3

Operating System (Linux/Windows):
Windows Server 2019 Datacenter 10.0.17763.1697

zhiweiv commented 3 years ago

Not seen recently.

zhiweiv commented 3 years ago

Occurred again in Windows nodes w/ AKS 1.21.2