ecs-bridge configuration leads to occasional credentials endpoint unavailability in awsvpc

nmeyerhans commented 6 years ago

There's an issue in the network configuration applied to the task network namespace in awsvpc mode that can, in certain circumstances, result in the credentials endpoint being unreachable because of an inability of the host to resolve the task namespace's IP address to a MAC address.

Task NS has route table:

default via 10.0.254.1 dev eth14 
10.0.254.0/24 dev eth14 proto kernel scope link src 10.0.254.153 
169.254.170.2 via 169.254.172.1 dev ecs-eth0

and interface ecs-eth0:

3: ecs-eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default  link-netnsid 0
    inet 169.254.172.15/22 scope global ecs-eth0
       valid_lft forever preferred_lft forever

Host NS has route table:

default via 10.0.0.1 dev eth0 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.135 
169.254.169.254 dev eth0 
169.254.172.0/22 dev ecs-bridge  proto kernel  scope link  src 169.254.172.1 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

and interface ecs-bridge:

[ec2-user@ip-10-0-0-135 src]$ ip -4 addr show dev ecs-bridge
7: ecs-bridge: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 169.254.172.1/22 scope global ecs-bridge
       valid_lft forever preferred_lft forever

In the case where things work, version 1, nothing in ARP caches:

Task wants to talk to 169.254.170.2, so
1. Task sends ARP who-has for 169.254.172.1 via ecs-eth0 per static route
2. Host recievs ARP from 169.254.172.15 on ecs-bridge and updates its ARP cache per comment at http://elixir.free-electrons.com/linux/v4.9.77/source/net/ipv4/arp.c#L753 and code at http://elixir.free-electrons.com/linux/v4.9.77/source/net/ipv4/arp.c#L851
3. Host sends ARP reply
4. Task updates ARP cache
5. Task sends SYN
6. Host sends return traffic to Task based on MAC addr found in ARP cache

In the case where things work, version 2, task and host ARP caches have entries for the other end:

Task wants to talk to 169.254.170.2, so
1. Task finds existing entry for 169.254.172.1 in ARP cache
2. Task sends SYN
3. Host sends return traffic to Task based on MAC addr found in ARP cache

In the case where things don't work:

Task wants to talk to 169.254.170.2, so
1. Task finds existing entry for 169.254.172.1, per static route, in ARP cache.
2. Task sends SYN without sending ARP
3. Host does not find an ARP cache entry for the task's bridge IP, so Host sends ARP queries for 169.254.172.15 from 169.254.172.1 on ecs-bridge
4. Task discards the ARP queries, because "net.ipv4.conf.ecs-eth0.rp_filter = 1" and the route for 169.254.172.1 is not via ecs-eth0

This happens if, for some reason, the ARP cache entry for the task's bridge IP expires from the host's ARP cache but the task's cache still has an entry for the host's IP. The only way the host can learn the MAC address of the task's bridge interface is passively, based on incoming ARP queries coming from the task. The host can never successfully query for the task's MAC address.

With no way to resolve its MAC address, the host is unable to send traffic to the task, and the task's query will experience increased latency, possibly to the point of timing out. When the task's ARP cache entry times out, and it needs to resolve the host's MAC address again, the situation will recover. However, the 60 second ARP cache timeout is more than long enough for a client to consider the connectivity problem fatal.

This is a bug in our network configuration in awsvpc mode. We can consider a few options for fixing it:

Configure a static (never expiring) cache entry for all awsvpc tasks on the host.
Disable rp_filter on ecs-eth0 in the task.
Add a route to 169.254.172.0/22 on ecs-eth0 in the task.

nmeyerhans commented 6 years ago

The unreachability of the credentials endpoint was previously reported as #1146

adnxn commented 6 years ago

@nmeyerhans do you have a preference for one of the options you've listed? i'm not familiar with the known side effects of each one (if any).

nmeyerhans commented 6 years ago

The 169.254.172.0/22 route should be added to the ecs-eth0 interface in the task. I think the expectation when adding an address with a /22 CIDR length to an interface is that the corresponding /22 route is automatically added. Indeed that's what happens when you add such an address to an interface using the ip(8) command:

admin@ip-10-0-0-60:~$ ip addr show dev vtapfoo
9: vtapfoo@eth0: <BROADCAST,MULTICAST> mtu 9001 qdisc noop state DOWN group default qlen 500
    link/ether d6:38:0a:30:71:fc brd ff:ff:ff:ff:ff:ff
admin@ip-10-0-0-60:~$ ip ro
default via 10.0.0.1 dev eth0 
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.60 
169.254.172.0/22 dev ecs-bridge proto kernel scope link src 169.254.172.1 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
admin@ip-10-0-0-60:~$ sudo ip addr add 172.18.0.0/24 dev vtapfoo
admin@ip-10-0-0-60:~$ sudo ip link set vtapfoo up
admin@ip-10-0-0-60:~$ ip ro
default via 10.0.0.1 dev eth0 
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.60 
169.254.172.0/22 dev ecs-bridge proto kernel scope link src 169.254.172.1 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.18.0.0/24 dev vtapfoo proto kernel scope link src 172.18.0.0

Note the new route table entry in the last line.

aaithal commented 6 years ago

@nmeyerhans thanks for this detailed report! I'm glad that we have a root cause here. Some follow up questions/comments.

Configure a static (never expiring) cache entry for all awsvpc tasks on the host.

Do you mean an entry in the host's namespace for each awsvpc task that's launched on the instance? Something like this (169.254.172.2 is the task's link local IPv4 address in this example):

$ sudo arp -s -i ecs-bridge 169.254.172.2 0a:58:a9:fe:ac:1a
$ arp
Address                  HWtype  HWaddress           Flags Mask            Iface
169.254.172.2           ether   0a:58:a9:fe:ac:1a   CM                     ecs-bridge

Disable rp_filter on ecs-eth0 in the task

Preventing spoofing will be one reason to not do this, yeah?

The 169.254.172.0/22 route should be added to the ecs-eth0 interface in the task.

We chose to not do that as that'd mean all containers/tasks connected to the ecs-bridge bridge will now be able to discover and communicate with each other and we wanted to avoid that as the intention of the ecs-eth0 interface within a task is to only let containers/tasks communicate with the ECS agent.

nmeyerhans commented 6 years ago

Do you mean an entry in the host's namespace for each awsvpc task that's launched on the instance? Something like this (169.254.172.2 is the task's link local IPv4 address in this example):

yes

Disable rp_filter on ecs-eth0 in the task Preventing spoofing will be one reason to not do this, yeah?

yes

The 169.254.172.0/22 route should be added to the ecs-eth0 interface in the task.

We chose to not do that as that'd mean all containers/tasks connected to the ecs-bridge bridge will now be able to discover and communicate with each other and we wanted to avoid that as the intention of the ecs-eth0 interface within a task is to only let containers/tasks communicate with the ECS agent.

I suppose it doesn't need to be a whole /22 route. A /32 should work as well. If we don't want communication to happen, though, we should probably consider ebtables or iptables instead, rather than relying on not-entire-obvious routing behavior.

richardpen commented 6 years ago

This has been fixed in the aws/amazon-ecs-cni-plugins 2018.02.0 which is included in the agent v1.17.2.

aws / amazon-ecs-agent