Azure / azure-container-networking

Azure Container Networking Solutions for Linux and Windows Containers
MIT License
378 stars 241 forks source link

Azure CNI not working when upgrading to Ubuntu 18.04 #632

Closed terricain closed 2 years ago

terricain commented 4 years ago

We have a Kubernetes cluster in Azure, running Kubernetes 1.18.6 , Azure CNI 1.1.5 on Ubuntu 16.04. As the kubelet starts it runs the iptables command from here.

What happened:

Upgraded a worker to 18.04. The kubelet was failing to register then node.

Kubelet log: lots of

Jul 29 14:03:09 tools-test-worker-00 kubelet[10323]: I0729 14:03:09.127669   10323 nodeinfomanager.go:403] Failed to publish CSINode: nodes "tools-test-worker-00" not found
Jul 29 14:03:09 tools-test-worker-00 kubelet[10323]: E0729 14:03:09.128737   10323 kubelet.go:2268] node "tools-test-worker-00" not found

I eventually saw

kubelet[7777]: I0729 13:57:30.586855    7777 cloud_request_manager.go:115] Node addresses from cloud provider for node "tools-test-worker-00" not collected: Get http://169.254.169.254/metadata/instance?api-version=2019-03-11&format=json: dial tcp 169.254.169.254:80: i/o timeout

So I added /sbin/iptables -t nat -I POSTROUTING -d 169.254.169.254 -j RETURN before running kubelet (with systemd ExecStartPre), kubelet started but was failing to start a pod in the default namespace

Jul 29 14:23:10 tools-test-worker-00 kubelet[9145]: I0729 14:23:10.780779    9145 kuberuntime_manager.go:422] No sandbox for pod "liveness3_default(242a5264-a782-43e4-8489-a595b3d3ccb2)" can be found. Need to start a new one
Jul 29 14:23:10 tools-test-worker-00 kubelet[9145]: I0729 14:23:10.838185    9145 kuberuntime_manager.go:422] No sandbox for pod "debug3_default(553e14e2-a256-47cd-9c28-67b0fbe0b4fb)" can be found. Need to start a new one
Jul 29 14:23:21 tools-test-worker-00 kubelet[9145]: E0729 14:23:21.135658    9145 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to setup network for sandbox "115a3a68cea1f3274d7f69ef12be865176f89bda891c97a693377f6c3f4c2584": Failed to allocate pool: Failed to delegate: Failed to allocate pool: Invalid address space
Jul 29 14:23:21 tools-test-worker-00 kubelet[9145]: E0729 14:23:21.135770    9145 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "liveness3_default(242a5264-a782-43e4-8489-a595b3d3ccb2)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "115a3a68cea1f3274d7f69ef12be865176f89bda891c97a693377f6c3f4c2584": Failed to allocate pool: Failed to delegate: Failed to allocate pool: Invalid address space
Jul 29 14:23:21 tools-test-worker-00 kubelet[9145]: E0729 14:23:21.135805    9145 kuberuntime_manager.go:727] createPodSandbox for pod "liveness3_default(242a5264-a782-43e4-8489-a595b3d3ccb2)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "115a3a68cea1f3274d7f69ef12be865176f89bda891c97a693377f6c3f4c2584": Failed to allocate pool: Failed to delegate: Failed to allocate pool: Invalid address space

I also repeated it with CNI 1.1.3 and got the same errors.

What you expected to happen:

It to just work

How to reproduce it:

Not entirely sure how you would as everyone is doing AKS these days :D Stand up a VNet in azure, 2 VMs Ubuntu 18.04 Setup a master node, and a worker node (with the 30 IPs on the NIC) On the worker node start kubelet with the 2 iptables rules mentioned above

Orchestrator and Version (e.g. Kubernetes, Docker):
Kubernetes 1.18.6 Containerd 1.2.13

Operating System (Linux/Windows):
Ubuntu 18.04

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows): Linux tools-test-worker-00 5.3.0-1032-azure #33~18.04.1-Ubuntu SMP Fri Jun 26 15:01:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Anything else we need to know?:

iptables whilst kubelet was crashing (using /sbin/iptables -t nat -A POSTROUTING -m addrtype ! --dst-type local ! -d 10.4.0.0/18 -j MASQUERADE)

I snipped out some rules, but I don't think that they were relevant ``` root@tools-test-worker-00:~# iptables -t nat -L Chain PREROUTING (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */ Chain INPUT (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */ Chain POSTROUTING (policy ACCEPT) target prot opt source destination KUBE-POSTROUTING all -- anywhere anywhere /* kubernetes postrouting rules */ MASQUERADE all -- anywhere !10.4.0.0/18 ADDRTYPE match dst-type !LOCAL Chain KUBE-KUBELET-CANARY (0 references) target prot opt source destination Chain KUBE-MARK-DROP (2 references) target prot opt source destination MARK all -- anywhere anywhere MARK or 0x8000 Chain KUBE-MARK-MASQ (68 references) target prot opt source destination MARK all -- anywhere anywhere MARK or 0x4000 Chain KUBE-NODEPORTS (1 references) target prot opt source destination KUBE-MARK-MASQ tcp -- localhost/8 anywhere /* ingress-nginx/ingress-nginx:https */ tcp dpt:30001 KUBE-XLB-4E7KSV2ABIFJRAUZ tcp -- anywhere anywhere /* ingress-nginx/ingress-nginx:https */ tcp dpt:30001 KUBE-MARK-MASQ tcp -- localhost/8 anywhere /* ingress-nginx/ingress-nginx:http */ tcp dpt:30000 KUBE-XLB-REQ4FPVT7WYF4VLA tcp -- anywhere anywhere /* ingress-nginx/ingress-nginx:http */ tcp dpt:30000 Chain KUBE-POSTROUTING (1 references) target prot opt source destination RETURN all -- anywhere anywhere mark match ! 0x4000/0x4000 MARK all -- anywhere anywhere MARK xor 0x4000 MASQUERADE all -- anywhere anywhere /* kubernetes service traffic requiring SNAT */ Chain KUBE-PROXY-CANARY (0 references) target prot opt source destination ... Chain KUBE-SEP-4VDVLS7C74EHFEEO (1 references) target prot opt source destination KUBE-MARK-MASQ all -- 10.4.0.13 anywhere /* kube-system/coredns:dns */ DNAT udp -- anywhere anywhere /* kube-system/coredns:dns */ udp to:10.4.0.13:53 Chain KUBE-SEP-63BCXCZXITMJZ6KL (1 references) target prot opt source destination KUBE-MARK-MASQ all -- 10.4.0.37 anywhere /* default/kibana:http */ DNAT tcp -- anywhere anywhere /* default/kibana:http */ tcp to:10.4.0.37:5601 ... Chain KUBE-SERVICES (2 references) target prot opt source destination KUBE-MARK-MASQ tcp -- !10.4.0.0/18 172.16.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 172.16.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https KUBE-MARK-MASQ udp -- !10.4.0.0/18 172.16.0.10 /* kube-system/coredns:dns cluster IP */ udp dpt:domain KUBE-SVC-ZRLRAB2E5DTUX37C udp -- anywhere 172.16.0.10 /* kube-system/coredns:dns cluster IP */ udp dpt:domain KUBE-MARK-MASQ tcp -- !10.4.0.0/18 172.16.0.10 /* kube-system/coredns:dns-tcp cluster IP */ tcp dpt:domain KUBE-SVC-FAITROITGXHS3QVF tcp -- anywhere 172.16.0.10 /* kube-system/coredns:dns-tcp cluster IP */ tcp dpt:domain KUBE-MARK-MASQ tcp -- !10.4.0.0/18 172.16.11.131 /* default/kibana:http cluster IP */ tcp dpt:http KUBE-SVC-GYQBIT3U2LMZ4H3E tcp -- anywhere 172.16.11.131 /* default/kibana:http cluster IP */ tcp dpt:http KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL Chain KUBE-SVC-FAITROITGXHS3QVF (1 references) target prot opt source destination KUBE-SEP-V2TWSMP5HSEO77UF all -- anywhere anywhere /* kube-system/coredns:dns-tcp */ statistic mode random probability 0.50000000000 KUBE-SEP-AQIH42UCMJIZ52GE all -- anywhere anywhere /* kube-system/coredns:dns-tcp */ Chain KUBE-SVC-GYQBIT3U2LMZ4H3E (1 references) target prot opt source destination KUBE-SEP-63BCXCZXITMJZ6KL all -- anywhere anywhere /* default/kibana:http */ Chain KUBE-XLB-4E7KSV2ABIFJRAUZ (1 references) target prot opt source destination KUBE-SVC-4E7KSV2ABIFJRAUZ all -- 10.4.0.0/18 anywhere /* Redirect pods trying to reach external loadbalancer VIP to clusterIP */ KUBE-MARK-MASQ all -- anywhere anywhere /* masquerade LOCAL traffic for ingress-nginx/ingress-nginx:https LB IP */ ADDRTYPE match src-type LOCAL KUBE-SVC-4E7KSV2ABIFJRAUZ all -- anywhere anywhere /* route LOCAL traffic for ingress-nginx/ingress-nginx:https LB IP to service chain */ ADDRTYPE match src-type LOCAL KUBE-MARK-DROP all -- anywhere anywhere /* ingress-nginx/ingress-nginx:https has no local endpoints */ Chain KUBE-XLB-REQ4FPVT7WYF4VLA (1 references) target prot opt source destination KUBE-SVC-REQ4FPVT7WYF4VLA all -- 10.4.0.0/18 anywhere /* Redirect pods trying to reach external loadbalancer VIP to clusterIP */ KUBE-MARK-MASQ all -- anywhere anywhere /* masquerade LOCAL traffic for ingress-nginx/ingress-nginx:http LB IP */ ADDRTYPE match src-type LOCAL KUBE-SVC-REQ4FPVT7WYF4VLA all -- anywhere anywhere /* route LOCAL traffic for ingress-nginx/ingress-nginx:http LB IP to service chain */ ADDRTYPE match src-type LOCAL KUBE-MARK-DROP all -- anywhere anywhere /* ingress-nginx/ingress-nginx:http has no local endpoints */ ```


So I'm sure I'm missing something simple. If you need any more info/logs just ask, I can easily add a 18.04 worker to the cluster I have running.

terricain commented 4 years ago

^^ fixed cni version typo :)

cpressland commented 4 years ago

So, @terrycain and I are both working on this. If it's helpful I can convert the Chef Cookbooks we've written to deploy these clusters into a simplified bash version along with some basic Terraform. Hopefully would make things easier to replicate.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days