aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.28k stars 741 forks source link

Pods that don't use hostNetwork on Ubuntu workers have no network connectivity #157

Closed hobbsh closed 6 years ago

hobbsh commented 6 years ago

I've been trying to make an Ubuntu worker work with EKS and it has come down to this last hurdle which I believe is related to this CNI. Basically, any pods not running with hostNetwork: true have no network connectivity. I have AL2 workers running alongside and they are all fine. I have an aws-cni-support.tar.gz ready.

So, in this scenario, an Ubuntu worker can join the cluster because aws-node and kube-proxy both run with hostNetwork: true.

wylie:us-west-2 whobbs$ kubectl get po -o wide --all-namespaces | grep "ip-10-99-61-115"
kube-system   aws-node-tm57h                                         1/1       Running            3          44m       10.99.61.115   ip-10-99-61-115.us-west-2.compute.internal
kube-system   fluentd-49g27                                          0/1       CrashLoopBackOff   14         44m       10.99.61.211   ip-10-99-61-115.us-west-2.compute.internal
kube-system   kube-proxy-jdtwv                                       1/1       Running            1          44m       10.99.61.115   ip-10-99-61-115.us-west-2.compute.internal
lumo          fv-consumer-1534269600-zgvfd                           0/1       Error              0          12m       10.99.61.75    ip-10-99-61-115.us-west-2.compute.internal
monitoring    kube-prometheus-exporter-node-n6m95                    1/1       Running            1          44m       10.99.61.115   ip-10-99-61-115.us-west-2.compute.internal

I see connectivity errors in the pods that don't use hostNetwork:

2018-08-14 16:45:36 +0000 [error]: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint https://172.20.0.1:443/api: Timed out connecting to server"
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='data.flightview.com', port=443): Max retries exceeded with url: /BatchService.asmx (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f5b508ca160>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
liwenwu-amazon commented 6 years ago

@hobbsh I have tested CNI using ubuntu in KOP cluster. Can you check iptables -nvL and see where packet is dropped? Have enabled forwarding for IPv4 traffic?

hobbsh commented 6 years ago

I was starting to suspect iptables as well and it looks like the Forward chain is set to DROP, despite having sudo iptables -P FORWARD ACCEPT in the AMI script and my saved rules looked right. Running that manually on the host solved the problem. I will fix my AMI script and iptables-persistence - it may also be that iptables-restore was not run on boot in which case its an issue with my script still. Thank you for your quick reply!

# Generated by iptables-save v1.6.0 on Tue Aug 14 20:06:51 2018
*filter
:INPUT ACCEPT [10:1360]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [10:772]
COMMIT
# Completed on Tue Aug 14 20:06:51 2018