Closed conversicachrisr closed 5 years ago
@conversicachrisr , you can send debug output to liwenwu@amazon.com. Also, can you run /opt/cni/bin/aws-cni-support.sh on the node where it has "non-working" pod and send them to me as well?
thanks
@liwenwu-amazon sent you an email, thanks. I included the test cluster output where I was able to replicate the issue. If you also need it from the dev cluster let me know.
@conversicachrisr You might run into issue #35 . Here is one workaround for this:
This seems to resolve the issue for us. The link does help explain the cause of the issue but, it mostly covers external access. In our case, internal cluster access had an issue (pod to pod in same node group), but it appears to be based on which ENI the pod was assigned to, only being an issue if it was a secondary ENI?
@liwenwu-amazon We also have this issue (on EKS). Upgraded to 1.3 and set AWS_VPC_K8S_CNI_EXTERNALSNAT=true does not solve problem
When we scale out to hundreds of nodes, there will be some problematic nodes, on which all the kubernetes pods has network issues. but the network on the nodes (outside of container) are all good.
Also confirming the same issue with 1.3 and AWS_VPC_K8S_CNI_EXTERNALSNAT=true Pods either get created and are immediately unreachable, but can also become unreachable at a later time. We have been testing statefulsets with applications requiring raft consensus. when using 3 replicas, a cluster is successfully formed but then often lose quorum as 1 pod becomes unreachable. The other scenario is that a pod will become wedged because networking issues during init and need to be deleted, after which a new pod is created and starts okay.
We're hitting this issue as well with pods internal communication failing. Goes away once we recreate the nodes in question. We're also being hit by a lot of intermittent DNS resolution issues which may be due to the same underlying issue.
I've hit this as well, however its on a new KOPS created cluster. Generally nodes come up just fine, but at a later time - usually w/in ~30 min or so - kube-dns fails (along with anything else trying to use inter-node communications) after being unable to contact the kube internal API IP. Around this time I see a large increase in IPv4: martian source
errors on the effected nodes. I do have AWS_VPC_K8S_CNI_EXTERNALSNAT=true, as this cluster lives on a private subnet and we do have API Gateways setup. This cluster also reaches back over a VPC peering to another network. AWS_VPC_CNI_NODE_PORT_SUPPORT
is also set to true.
In my last bit of troubleshooting I found that it started after ec2net re-wrote aliases of eth0:
Feb 23 14:48:06 ip-10-7-161-129 dhclient[2256]: DHCPREQUEST on eth0 to 10.7.160.1 port 67 (xid=0x7d7b31f5)
Feb 23 14:48:06 ip-10-7-161-129 dhclient[2256]: DHCPACK from 10.7.160.1 (xid=0x7d7b31f5)
Feb 23 14:48:06 ip-10-7-161-129 dhclient[2256]: bound to 10.7.161.129 -- renewal in 1483 seconds.
Feb 23 14:48:06 ip-10-7-161-129 ec2net: [get_meta] Trying to get http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:63:72:fa:82:ee/local-ipv4s
Feb 23 14:48:06 ip-10-7-161-129 ec2net: [rewrite_aliases] Rewriting aliases of eth0
Feb 23 14:48:06 ip-10-7-161-129 kernel: IPv4: martian source 10.7.141.251 from 10.7.190.200, on dev eni80fc26778d0
Feb 23 14:48:06 ip-10-7-161-129 kernel: ll header: 00000000: 26 3f b8 a7 bc bf 6e 8e c2 98 9a 53 08 00 &?....n....S..
Feb 23 14:48:06 ip-10-7-161-129 kubelet: I0223 14:48:06.764845 3419 prober.go:111] Liveness probe for "kube-dns-6b4f4b544c-r8vwl_kube-system(d935b1b3-3776-11e9-afce-0a980781432e):sidecar" failed (failure): Get http://10.7.164.160:10054/metrics: dial tcp 10.7.164.160:10054: connect: connection refused
Feb 23 14:48:06 ip-10-7-161-129 kernel: IPv4: martian source 10.7.141.251 from 10.7.190.200, on dev eni80fc26778d0
Feb 23 14:48:06 ip-10-7-161-129 kernel: ll header: 00000000: 26 3f b8 a7 bc bf 6e 8e c2 98 9a 53 08 00 &?....n....S..
Feb 23 14:48:07 ip-10-7-161-129 kernel: IPv4: martian source 10.7.141.251 from 10.7.190.200, on dev eni80fc26778d0
Feb 23 14:48:07 ip-10-7-161-129 kernel: ll header: 00000000: 26 3f b8 a7 bc bf 6e 8e c2 98 9a 53 08 00 &?....n....S..
Removing the ec2-net-utils package on startup stops the cluster from failing over time, but I'm guessing this is just stopping the trigger (introducing the ips on the instance?) and not the actual issue. if I add back the package at a later time and run dhclient to trigger the script, it will immediately fail.
We do build our own AMI, its based on upstream amzn2-ami-hvm*. We also set a number of sysctl options, so it could be one of them is triggering this issue:
fs.suid_dumpable=0
kernel.randomize_va_space=2
net.ipv4.ip_forward=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.default.send_redirects=0
net.ipv4.conf.all.accept_source_route=0
net.ipv4.conf.default.accept_source_route=0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.default.secure_redirects=0
net.ipv4.conf.all.log_martians=1
net.ipv4.conf.default.log_martians=1
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.icmp_ignore_bogus_error_responses=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.default.rp_filter=1
net.ipv4.tcp_syncookies=1
net.ipv6.conf.all.accept_ra=0
net.ipv6.conf.default.accept_ra=0
net.ipv6.conf.all.accept_redirects=0
net.ipv6.conf.default.accept_redirects=0
In addition ipv6 is disabled in modprobe.d using options ipv6 disable=1
If this is unrelated or a different issue, I'm happy to open a separate issue.
*Edit: I removed almost all our modifications to the upstream AMI and tried again with that image w/o success. Due to time constraints I'm going to move back to weave which is known good for us, but not as optimal (I'd love the ALB -> POD IP).
We seem to be hitting this same issue, it is slightly masked because a lot of our external facing pods are sitting behind ALBs which don't come into rotation due to ALB health checks. Just like the original poster, our readiness probes and liveness probes pass but external things (like the ALB or prometheus) can't end up talking to the pod. Once we kill the pod it seems to resolve itself. This tends to happen quite a bit. We are running on CentOS 7 using scripts based on the provided EKS packer scripts. @tabern
I believe we are hitting this same issue.
Like others on this thread we are using AWS_VPC_K8S_CNI_EXTERNALSNAT
(we have managed AWS NAT gateways in each AZ, and VPN + VPC peering connections meaning we need to disable SNAT). We are using Calico network policy on top.
In an environment experiencing the issue, I tried provisioning lots of pods onto the same k8s node to get a spread of IPs. All of the ones that have broken networking are on eth2
on that node.
The pods on eth0
and eth1
IPs on that node work fine. No pods on eth2
are working
However, on other k8s nodes pods are working on all three (eth0
/eth1
/eth2
), so the problem is not as simple as scaling to use of three ENIs.
So for us the issue is Instance+ENI specific.
An observed difference between working and non-working nodes:
$ ip route show table all | grep eth2
default via 10.10.80.1 dev eth2 table 3
10.10.80.1 dev eth2 table 3 scope link
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev eth2 table local metric 256 pref medium
$ ip route show table all | grep eth2
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev eth2 table local metric 256 pref medium
There are
table 2
entries there foreth1
This seems like a symptom to me, rather than root cause. Maybe some required processing was missed for this ENI when added by the CNI plugin.
(edit) - here's an error message from the ipamd.log
which might be closer to root cause
It seems the CNI plugin continued to assign IPs to this ENI after the below failure was logged.
2019-03-10T14:24:51Z [ERROR] Failed to increase pool size: failed to setup eni eni-0eebab16947b030f9 network: eni network setup: failed to find the link which uses mac address 02:7a:75:54:f3:38: no interface found which uses mac address 02:7a:75:54:f3:38
I'm speculating whether there's a timing window during attachment, where the below check needs to be retried rather than resulting in an immediate failure: https://github.com/aws/amazon-vpc-cni-k8s/blob/8f55c72d13b29f104fdeda5414bc079652539f22/pkg/networkutils/network.go#L557-L562
We have begun work on a fix based on the above analysis. release-1.3 derived branch, not ready for a PR yet: https://github.com/aws/amazon-vpc-cni-k8s/compare/release-1.3...kaleido-io:release-1.3-fix-204 Setting up our build pipeline so we can begin testing.
Thanks @peterbroadhurst for working on this, and I agree that retrying is a sensible approach.
Could #318 be related to this here?
Resolving since v1.5.0 is released.
👍 Can verify that it already works with 1.4.0
This may/may not be the same issue as https://github.com/aws/amazon-vpc-cni-k8s/issues/180 but the descriptions in that ticket make the problem seem more rooted in a single node, this does not appear to be.
We have been experiencing strange routing issues on our EKSv2 clusters. We are running the typical EKS + AWS-CNI setup, not doing any other custom routing. All nodes are in the same security group with self access as defined in the EKS docs. We have tried both AWS-CNI 1.1.0 and 1.2.1, both experience this issue. We are running m5d.12xl in dev cluster and m5d.xl in test cluster. AMI is built from EKS packer repo with minor tweaks for internal use (like consuming the instance store volumes for docker storage).
Previously, I was only able to trigger this at scale however now I'm finding I can replicate it at will, on any EKS cluster we launch. This issue becomes worse with cluster pod count scaling. A deployment of pods has about a 1/20 pod failure rate on our test cluster (with 40~ pods across 3 nodes) and a 1/6 failure rate on our dev cluster (with 300~ pods across 3 nodes).
When a pod is launched, it is unable to be reached. The pattern is usually that connections fail to pods on other nodes, but this is not always the case. In the example below, you can see the problem occur to all pods on a node from another node, and the problem occur from one node's pod but not another identical node's pod. I haven't been able to replicate this issue with the source pod on the same node as the destination pod but, that may simply be my luck. You can see that some pods on a node are reachable from another node/pod but not all, and the combination changes depending on the source pod/nodes location.
This problem does not impact pod live or ready checks, they pass fine.
Typically, terminating a pod will allow the issue to be resolved, but occasionally, even this will still result in a pod that is unreachable from somewhere.
These outputs are repeatable, aka the pod with the issue doesn't change or flap.
Same thing in the test cluster from three different pods/nodes like before:
I'm happy to provide the debug output from the script, I just need to know who to send it to.