aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.26k stars 735 forks source link

Pod routing issues preventing EKS rollout #204

Closed conversicachrisr closed 5 years ago

conversicachrisr commented 5 years ago

This may/may not be the same issue as https://github.com/aws/amazon-vpc-cni-k8s/issues/180 but the descriptions in that ticket make the problem seem more rooted in a single node, this does not appear to be.

We have been experiencing strange routing issues on our EKSv2 clusters. We are running the typical EKS + AWS-CNI setup, not doing any other custom routing. All nodes are in the same security group with self access as defined in the EKS docs. We have tried both AWS-CNI 1.1.0 and 1.2.1, both experience this issue. We are running m5d.12xl in dev cluster and m5d.xl in test cluster. AMI is built from EKS packer repo with minor tweaks for internal use (like consuming the instance store volumes for docker storage).

Previously, I was only able to trigger this at scale however now I'm finding I can replicate it at will, on any EKS cluster we launch. This issue becomes worse with cluster pod count scaling. A deployment of pods has about a 1/20 pod failure rate on our test cluster (with 40~ pods across 3 nodes) and a 1/6 failure rate on our dev cluster (with 300~ pods across 3 nodes).

When a pod is launched, it is unable to be reached. The pattern is usually that connections fail to pods on other nodes, but this is not always the case. In the example below, you can see the problem occur to all pods on a node from another node, and the problem occur from one node's pod but not another identical node's pod. I haven't been able to replicate this issue with the source pod on the same node as the destination pod but, that may simply be my luck. You can see that some pods on a node are reachable from another node/pod but not all, and the combination changes depending on the source pod/nodes location.

This problem does not impact pod live or ready checks, they pass fine.

Typically, terminating a pod will allow the issue to be resolved, but occasionally, even this will still result in a pod that is unreachable from somewhere.

#php-apache deploy (is just a basic 200 ok server, same one used in k8s hpa example)
$ kubectl get deploy/php-apache
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
php-apache   12        12        12           12          5d

#pods i'm checking from (one per node)
NAME                         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/chris-test   3         3         3            3           17m

$ kubectl get endpoints php-apache -o yaml | egrep "ip: |nodeName: "
  - ip: 10.115.17.95
    nodeName: ip-10-115-29-228.us-west-2.compute.internal
  - ip: 10.115.19.83
    nodeName: ip-10-115-29-228.us-west-2.compute.internal
  - ip: 10.115.31.136
    nodeName: ip-10-115-29-228.us-west-2.compute.internal
  - ip: 10.115.32.251
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.34.170
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.36.166
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.41.98
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.42.176
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.44.175
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.46.97
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.47.194
    nodeName: ip-10-115-47-179.us-west-2.compute.internal
  - ip: 10.115.5.82
    nodeName: ip-10-115-6-159.us-west-2.compute.internal

#from pod in same namespace located on node 179:
nc: connect to 10.115.17.95 port 80 (tcp) timed out: Operation in progress
nc: connect to 10.115.19.83 port 80 (tcp) timed out: Operation in progress
nc: connect to 10.115.31.136 port 80 (tcp) timed out: Operation in progress
Connection to 10.115.32.251 80 port [tcp/http] succeeded!
Connection to 10.115.34.170 80 port [tcp/http] succeeded!
Connection to 10.115.36.166 80 port [tcp/http] succeeded!
Connection to 10.115.41.98 80 port [tcp/http] succeeded!
Connection to 10.115.42.176 80 port [tcp/http] succeeded!
Connection to 10.115.44.175 80 port [tcp/http] succeeded!
Connection to 10.115.46.97 80 port [tcp/http] succeeded!
Connection to 10.115.47.194 80 port [tcp/http] succeeded!
nc: connect to 10.115.5.82 port 80 (tcp) timed out: Operation in progress

#from pod in same namespace located on node 159:
nc: connect to 10.115.17.95 port 80 (tcp) timed out: Operation in progress
nc: connect to 10.115.19.83 port 80 (tcp) timed out: Operation in progress
nc: connect to 10.115.31.136 port 80 (tcp) timed out: Operation in progress
nc: connect to 10.115.32.251 port 80 (tcp) timed out: Operation in progress
Connection to 10.115.34.170 80 port [tcp/http] succeeded!
Connection to 10.115.36.166 80 port [tcp/http] succeeded!
Connection to 10.115.41.98 80 port [tcp/http] succeeded!
Connection to 10.115.42.176 80 port [tcp/http] succeeded!
Connection to 10.115.44.175 80 port [tcp/http] succeeded!
Connection to 10.115.46.97 80 port [tcp/http] succeeded!
Connection to 10.115.47.194 80 port [tcp/http] succeeded!
Connection to 10.115.5.82 80 port [tcp/http] succeeded!

#from pod in same namespace located on node 228:
Connection to 10.115.17.95 80 port [tcp/http] succeeded!
Connection to 10.115.19.83 80 port [tcp/http] succeeded!
Connection to 10.115.31.136 80 port [tcp/http] succeeded!
nc: connect to 10.115.32.251 port 80 (tcp) timed out: Operation in progress
Connection to 10.115.34.170 80 port [tcp/http] succeeded!
Connection to 10.115.36.166 80 port [tcp/http] succeeded!
Connection to 10.115.41.98 80 port [tcp/http] succeeded!
Connection to 10.115.42.176 80 port [tcp/http] succeeded!
Connection to 10.115.44.175 80 port [tcp/http] succeeded!
Connection to 10.115.46.97 80 port [tcp/http] succeeded!
Connection to 10.115.47.194 80 port [tcp/http] succeeded!
nc: connect to 10.115.5.82 port 80 (tcp) timed out: Operation in progress

These outputs are repeatable, aka the pod with the issue doesn't change or flap.

Same thing in the test cluster from three different pods/nodes like before:

$ kubectl get endpoints php-apache -o yaml | egrep "ip: |nodeName: "
  - ip: 10.118.1.20
    nodeName: ip-10-118-3-190.us-west-2.compute.internal
  - ip: 10.118.15.127
    nodeName: ip-10-118-3-190.us-west-2.compute.internal
  - ip: 10.118.16.244
    nodeName: ip-10-118-18-46.us-west-2.compute.internal
  - ip: 10.118.18.248
    nodeName: ip-10-118-18-46.us-west-2.compute.internal
  - ip: 10.118.2.52
    nodeName: ip-10-118-3-190.us-west-2.compute.internal
  - ip: 10.118.29.74
    nodeName: ip-10-118-18-46.us-west-2.compute.internal
  - ip: 10.118.30.67
    nodeName: ip-10-118-18-46.us-west-2.compute.internal
  - ip: 10.118.33.244
    nodeName: ip-10-118-43-157.us-west-2.compute.internal
  - ip: 10.118.36.172
    nodeName: ip-10-118-43-157.us-west-2.compute.internal
  - ip: 10.118.37.36
    nodeName: ip-10-118-43-157.us-west-2.compute.internal
  - ip: 10.118.39.228
    nodeName: ip-10-118-43-157.us-west-2.compute.internal
  - ip: 10.118.4.187
    nodeName: ip-10-118-3-190.us-west-2.compute.internal

#node 1 pod
Connection to 10.118.1.20 80 port [tcp/http] succeeded!
Connection to 10.118.15.127 80 port [tcp/http] succeeded!
Connection to 10.118.16.244 80 port [tcp/http] succeeded!
Connection to 10.118.18.248 80 port [tcp/http] succeeded!
nc: connect to 10.118.2.52 port 80 (tcp) timed out: Operation in progress
Connection to 10.118.29.74 80 port [tcp/http] succeeded!
Connection to 10.118.30.67 80 port [tcp/http] succeeded!
Connection to 10.118.33.244 80 port [tcp/http] succeeded!
Connection to 10.118.36.172 80 port [tcp/http] succeeded!
Connection to 10.118.37.36 80 port [tcp/http] succeeded!
Connection to 10.118.39.228 80 port [tcp/http] succeeded!
Connection to 10.118.4.187 80 port [tcp/http] succeeded!

#node 2 pod
Connection to 10.118.1.20 80 port [tcp/http] succeeded!
Connection to 10.118.15.127 80 port [tcp/http] succeeded!
Connection to 10.118.16.244 80 port [tcp/http] succeeded!
Connection to 10.118.18.248 80 port [tcp/http] succeeded!
Connection to 10.118.2.52 80 port [tcp/http] succeeded!
Connection to 10.118.29.74 80 port [tcp/http] succeeded!
Connection to 10.118.30.67 80 port [tcp/http] succeeded!
Connection to 10.118.33.244 80 port [tcp/http] succeeded!
Connection to 10.118.36.172 80 port [tcp/http] succeeded!
Connection to 10.118.37.36 80 port [tcp/http] succeeded!
Connection to 10.118.39.228 80 port [tcp/http] succeeded!
Connection to 10.118.4.187 80 port [tcp/http] succeeded!

#node 3 pod
Connection to 10.118.1.20 80 port [tcp/http] succeeded!
Connection to 10.118.15.127 80 port [tcp/http] succeeded!
Connection to 10.118.16.244 80 port [tcp/http] succeeded!
Connection to 10.118.18.248 80 port [tcp/http] succeeded!
nc: connect to 10.118.2.52 port 80 (tcp) timed out: Operation in progress
Connection to 10.118.29.74 80 port [tcp/http] succeeded!
Connection to 10.118.30.67 80 port [tcp/http] succeeded!
Connection to 10.118.33.244 80 port [tcp/http] succeeded!
Connection to 10.118.36.172 80 port [tcp/http] succeeded!
Connection to 10.118.37.36 80 port [tcp/http] succeeded!
Connection to 10.118.39.228 80 port [tcp/http] succeeded!
Connection to 10.118.4.187 80 port [tcp/http] succeeded!

I'm happy to provide the debug output from the script, I just need to know who to send it to.

liwenwu-amazon commented 5 years ago

@conversicachrisr , you can send debug output to liwenwu@amazon.com. Also, can you run /opt/cni/bin/aws-cni-support.sh on the node where it has "non-working" pod and send them to me as well?
thanks

conversicachrisr commented 5 years ago

@liwenwu-amazon sent you an email, thanks. I included the test cluster output where I was able to replicate the issue. If you also need it from the dev cluster let me know.

liwenwu-amazon commented 5 years ago

@conversicachrisr You might run into issue #35 . Here is one workaround for this:

conversicachrisr commented 5 years ago

This seems to resolve the issue for us. The link does help explain the cause of the issue but, it mostly covers external access. In our case, internal cluster access had an issue (pod to pod in same node group), but it appears to be based on which ENI the pod was assigned to, only being an issue if it was a secondary ENI?

xubofei1983 commented 5 years ago

@liwenwu-amazon We also have this issue (on EKS). Upgraded to 1.3 and set AWS_VPC_K8S_CNI_EXTERNALSNAT=true does not solve problem

When we scale out to hundreds of nodes, there will be some problematic nodes, on which all the kubernetes pods has network issues. but the network on the nodes (outside of container) are all good.

stongo commented 5 years ago

Also confirming the same issue with 1.3 and AWS_VPC_K8S_CNI_EXTERNALSNAT=true Pods either get created and are immediately unreachable, but can also become unreachable at a later time. We have been testing statefulsets with applications requiring raft consensus. when using 3 replicas, a cluster is successfully formed but then often lose quorum as 1 pod becomes unreachable. The other scenario is that a pod will become wedged because networking issues during init and need to be deleted, after which a new pod is created and starts okay.

mejran commented 5 years ago

We're hitting this issue as well with pods internal communication failing. Goes away once we recreate the nodes in question. We're also being hit by a lot of intermittent DNS resolution issues which may be due to the same underlying issue.

RAR commented 5 years ago

I've hit this as well, however its on a new KOPS created cluster. Generally nodes come up just fine, but at a later time - usually w/in ~30 min or so - kube-dns fails (along with anything else trying to use inter-node communications) after being unable to contact the kube internal API IP. Around this time I see a large increase in IPv4: martian source errors on the effected nodes. I do have AWS_VPC_K8S_CNI_EXTERNALSNAT=true, as this cluster lives on a private subnet and we do have API Gateways setup. This cluster also reaches back over a VPC peering to another network. AWS_VPC_CNI_NODE_PORT_SUPPORT is also set to true.

In my last bit of troubleshooting I found that it started after ec2net re-wrote aliases of eth0:

Feb 23 14:48:06 ip-10-7-161-129 dhclient[2256]: DHCPREQUEST on eth0 to 10.7.160.1 port 67 (xid=0x7d7b31f5)
Feb 23 14:48:06 ip-10-7-161-129 dhclient[2256]: DHCPACK from 10.7.160.1 (xid=0x7d7b31f5)
Feb 23 14:48:06 ip-10-7-161-129 dhclient[2256]: bound to 10.7.161.129 -- renewal in 1483 seconds.
Feb 23 14:48:06 ip-10-7-161-129 ec2net: [get_meta] Trying to get http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:63:72:fa:82:ee/local-ipv4s
Feb 23 14:48:06 ip-10-7-161-129 ec2net: [rewrite_aliases] Rewriting aliases of eth0
Feb 23 14:48:06 ip-10-7-161-129 kernel: IPv4: martian source 10.7.141.251 from 10.7.190.200, on dev eni80fc26778d0
Feb 23 14:48:06 ip-10-7-161-129 kernel: ll header: 00000000: 26 3f b8 a7 bc bf 6e 8e c2 98 9a 53 08 00        &?....n....S..
Feb 23 14:48:06 ip-10-7-161-129 kubelet: I0223 14:48:06.764845    3419 prober.go:111] Liveness probe for "kube-dns-6b4f4b544c-r8vwl_kube-system(d935b1b3-3776-11e9-afce-0a980781432e):sidecar" failed (failure): Get http://10.7.164.160:10054/metrics: dial tcp 10.7.164.160:10054: connect: connection refused
Feb 23 14:48:06 ip-10-7-161-129 kernel: IPv4: martian source 10.7.141.251 from 10.7.190.200, on dev eni80fc26778d0
Feb 23 14:48:06 ip-10-7-161-129 kernel: ll header: 00000000: 26 3f b8 a7 bc bf 6e 8e c2 98 9a 53 08 00        &?....n....S..
Feb 23 14:48:07 ip-10-7-161-129 kernel: IPv4: martian source 10.7.141.251 from 10.7.190.200, on dev eni80fc26778d0
Feb 23 14:48:07 ip-10-7-161-129 kernel: ll header: 00000000: 26 3f b8 a7 bc bf 6e 8e c2 98 9a 53 08 00        &?....n....S..

Removing the ec2-net-utils package on startup stops the cluster from failing over time, but I'm guessing this is just stopping the trigger (introducing the ips on the instance?) and not the actual issue. if I add back the package at a later time and run dhclient to trigger the script, it will immediately fail.

We do build our own AMI, its based on upstream amzn2-ami-hvm*. We also set a number of sysctl options, so it could be one of them is triggering this issue:

fs.suid_dumpable=0
kernel.randomize_va_space=2
net.ipv4.ip_forward=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.default.send_redirects=0
net.ipv4.conf.all.accept_source_route=0
net.ipv4.conf.default.accept_source_route=0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.default.secure_redirects=0
net.ipv4.conf.all.log_martians=1
net.ipv4.conf.default.log_martians=1
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.icmp_ignore_bogus_error_responses=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.default.rp_filter=1
net.ipv4.tcp_syncookies=1
net.ipv6.conf.all.accept_ra=0
net.ipv6.conf.default.accept_ra=0
net.ipv6.conf.all.accept_redirects=0
net.ipv6.conf.default.accept_redirects=0

In addition ipv6 is disabled in modprobe.d using options ipv6 disable=1

If this is unrelated or a different issue, I'm happy to open a separate issue.

*Edit: I removed almost all our modifications to the upstream AMI and tried again with that image w/o success. Due to time constraints I'm going to move back to weave which is known good for us, but not as optimal (I'd love the ALB -> POD IP).

sdavids13 commented 5 years ago

We seem to be hitting this same issue, it is slightly masked because a lot of our external facing pods are sitting behind ALBs which don't come into rotation due to ALB health checks. Just like the original poster, our readiness probes and liveness probes pass but external things (like the ALB or prometheus) can't end up talking to the pod. Once we kill the pod it seems to resolve itself. This tends to happen quite a bit. We are running on CentOS 7 using scripts based on the provided EKS packer scripts. @tabern

peterbroadhurst commented 5 years ago

I believe we are hitting this same issue. Like others on this thread we are using AWS_VPC_K8S_CNI_EXTERNALSNAT (we have managed AWS NAT gateways in each AZ, and VPN + VPC peering connections meaning we need to disable SNAT). We are using Calico network policy on top. In an environment experiencing the issue, I tried provisioning lots of pods onto the same k8s node to get a spread of IPs. All of the ones that have broken networking are on eth2 on that node. The pods on eth0 and eth1 IPs on that node work fine. No pods on eth2 are working However, on other k8s nodes pods are working on all three (eth0/eth1/eth2), so the problem is not as simple as scaling to use of three ENIs.

So for us the issue is Instance+ENI specific.

peterbroadhurst commented 5 years ago

An observed difference between working and non-working nodes:

This seems like a symptom to me, rather than root cause. Maybe some required processing was missed for this ENI when added by the CNI plugin.

(edit) - here's an error message from the ipamd.log which might be closer to root cause It seems the CNI plugin continued to assign IPs to this ENI after the below failure was logged.

2019-03-10T14:24:51Z [ERROR] Failed to increase pool size: failed to setup eni eni-0eebab16947b030f9 network: eni network setup: failed to find the link which uses mac address 02:7a:75:54:f3:38: no interface found which uses mac address 02:7a:75:54:f3:38

I'm speculating whether there's a timing window during attachment, where the below check needs to be retried rather than resulting in an immediate failure: https://github.com/aws/amazon-vpc-cni-k8s/blob/8f55c72d13b29f104fdeda5414bc079652539f22/pkg/networkutils/network.go#L557-L562

peterbroadhurst commented 5 years ago

We have begun work on a fix based on the above analysis. release-1.3 derived branch, not ready for a PR yet: https://github.com/aws/amazon-vpc-cni-k8s/compare/release-1.3...kaleido-io:release-1.3-fix-204 Setting up our build pipeline so we can begin testing.

mogren commented 5 years ago

Thanks @peterbroadhurst for working on this, and I agree that retrying is a sensible approach.

recollir commented 5 years ago

Could #318 be related to this here?

mogren commented 5 years ago

Resolving since v1.5.0 is released.

recollir commented 5 years ago

👍 Can verify that it already works with 1.4.0