flannel-io / flannel

flannel is a network fabric for containers, designed for Kubernetes
Apache License 2.0
8.81k stars 2.87k forks source link

AWS - one node in AZ eu-central-1-b failing to connect through flannel #814

Closed szuecs closed 7 years ago

szuecs commented 7 years ago

We have ~3 nodes in each AZ in region eu-central-1. All nodes, besides one of them can connect to each other fine and have no problems. We see one node failing to connect cross AZ to POD network, but EC2 network works cross AZ from the failing node.

Expected Behavior

ICMP Ping and TCP connections to POD ip to another AZ should work

Current Behavior

ICMP Ping and TCP connections to POD ip to another AZ did not work

Possible Solution

restart flanneld fixed the problem

Steps to Reproduce (for bugs)

unknown

Context

We run production workloads in Kubernetes using flannel and have 18 production clusters of different sizes.

Your Environment

Network investigation

ping node to POD IP

failing node:~#~ # ping 10.2.109.4
PING 10.2.109.4 (10.2.109.4) 56(84) bytes of data.
^C
working node:~ # ping 10.2.109.4
PING 10.2.109.4 (10.2.109.4) 56(84) bytes of data.
64 bytes from 10.2.109.4: icmp_seq=1 ttl=63 time=0.686 ms
^C

flannel config in etcd for the target

working node:~ # etcdctl get /coreos.com/network/subnets/10.2.109.0-24
{"PublicIP":"172.31.1.190","BackendType":"vxlan","BackendData":{"VtepMAC":"42:88:9e:f2:82:cb"}}
failing node:~ # etcdctl get /coreos.com/network/subnets/10.2.109.0-24
{"PublicIP":"172.31.1.190","BackendType":"vxlan","BackendData":{"VtepMAC":"42:88:9e:f2:82:cb"}}

local ARP table

working node:~# arp -n | grep 10.2.109.4
10.2.109.4               ether   42:88:9e:f2:82:cb   C                     flannel.1

failing node:~# arp -n | grep 10.2.109.4
10.2.109.4                       (incomplete)                              flannel.1

flanneld logs shows:

$ journalctl -u flanneld
Sep 12 08:53:20 ip-172-31-15-202.eu-central-1.compute.internal flannel-wrapper[972]: I0912 08:53:20.388592     972 network.go:243] L3 miss but route for 10.2.53.0 not found
Sep 12 08:53:21 ip-172-31-15-202.eu-central-1.compute.internal flannel-wrapper[972]: I0912 08:53:21.412813     972 network.go:243] L3 miss but route for 10.2.53.0 not found
Sep 12 08:53:26 ip-172-31-15-202.eu-central-1.compute.internal flannel-wrapper[972]: I0912 08:53:26.148887     972 network.go:243] L3 miss but route for 10.2.53.0 not found
Sep 12 08:53:27 ip-172-31-15-202.eu-central-1.compute.internal flannel-wrapper[972]: I0912 08:53:27.172774     972 network.go:243] L3 miss but route for 10.2.53.0 not found
tomdee commented 7 years ago

@szuecs Are you able to try with the latest flannel release? The vxlan has changed quite a bit since 0.7.0

szuecs commented 7 years ago

@tomdee we updated also production to version 0.9.0. It fixed our problem.