Using Docker on AWS EC2 breaks EC2 metadata route because of DHCP

nh2 commented 3 years ago

For a workaround, scroll down a bit.

Issue description

On NixOS 20.09 running on an AWS machine, if you configure

virtualisation.docker.enable = true;

and start a container (e.g. docker run redis), then after 8 seconds any connection to the EC2 metadata service will break with No route to host.

How the error manifests

You may see the error as:

The ec2-metadata tool from PR #108804 will fail with:

# ec2-metadata 
[ERROR] Could not get IMDSv2 token. Instance Metadata might have been disabled or this is not an EC2 instance.

curl 'http://169.254.169.254' will fail with No route to host.

Explanation

route -n will show an entry:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
169.254.0.0     0.0.0.0         255.255.0.0     U     205    0        0 veth2c043f2

This route is why we get No route to host.

Where veth8402269 (veth plus some number) is the docker container's interface; there's one for each running container.

journalctl -f (while starting a container) shows that dhcpcd is assinging the bad route:

17:37:49 test-i-0e8e4c4de839503ae dhcpcd[960]: veth8402269: soliciting a DHCP lease
17:37:54 test-i-0e8e4c4de839503ae dhcpcd[960]: veth8402269: probing for an IPv4LL address
17:37:58 test-i-0e8e4c4de839503ae dhcpcd[960]: veth8402269: using IPv4LL address 169.254.33.249
17:37:58 test-i-0e8e4c4de839503ae dhcpcd[960]: veth8402269: adding route to 169.254.0.0/16

I believe that dhcpcd should not run on this veth interface.

The dhcpcd module has a denyinterfaces list to exclude virtual adapters:

https://github.com/NixOS/nixpkgs/blob/80badc893dca2fc5196a5664473b187d9e9cfca9/nixos/modules/services/networking/dhcpcd.nix#L54-L57

veth* was added to the deny list in 2012 by @peti: https://github.com/NixOS/nixpkgs/commit/8b841505ff16054f87be6f760d3ce7f1efb9e27b

But shortly afterwards reverted by @edolstra: https://github.com/NixOS/nixpkgs/commit/be189991e0fc973cca908fa44955a6374504da84 because stan apparently also created veth interfaces for which DHCP made sense.

Workaround

{
  networking.dhcpcd.denyInterfaces = [ "veth*" ];
}

What to do to make it work out of the box?

Currently installing Docker on an EC2 NixOS machine out-of-the-box breaks all AWS tooling that uses the EC2 metadata service.

Should veth* be added to the denyinterfaces by default?
Should the docker module do it instead?
Should it be default on EC2?
Should we suggest a default config that disables general DHCP on EC2 and sets it explicitly for the relevant interfaces?

CCing some people that have modified those respective parts in the past, or the Docker module's networking stuff, or the ec2-metadata-fetcher.nix who might know more:

@edolstra @peti @wkennington @fpletz @flokli @bachp @Mic92 @endgame @grahamc @nlewo

nh2 commented 3 years ago

This may also be relevant for the discussion of whether DHCP should be on by default for all interfaces: #75515

nh2 commented 3 years ago

I expect that it will also break time sync with the host machine:

https://github.com/NixOS/nixpkgs/blob/e1ac6eba349b3cb9d94566e0be9fca7a41c9c7fc/nixos/modules/virtualisation/amazon-image.nix#L150-L151

tomberek commented 3 years ago

I had some luck with this hack around the problem:

  virtualisation.docker.extraOptions = "--bip 192.168.1.1/24";
  networking.dhcpcd.runHook = ''
      iface=$(${pkgs.iproute}/bin/ip route get 8.8.8.8 | ${pkgs.gnused}/bin/sed -n 's/.*dev \([^\ ]*\) src.*/\1/p')
      ${pkgs.iproute}/bin/ip r a 169.254.169.254/32 dev "$iface" || true
  '';

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

takeda commented 3 years ago

I believe this is still a problem.

ius commented 3 years ago

I'd like to +1 this as I just ran into a similar issue, but way worse on Oracle's Cloud (OCI).

For OCI it's not just the metadata service but their network setup also originates ICMP unreachable from 169.254.0.0/16 on the primary interface. Starting a single docker/podman container will result in ICMP unreachable being dropped due to the newly acquired route (even though rp_filter is set to 2), which is problematic if you rely on PMTU discovery to work.

In my case, network connectivity to the machine failed, but only after the ~0-10 mins required for a discovered MTU entry to be evicted from the cache. Lots of fun was had debugging that one.

Assigning IPv4LL in the first place seems like a fringe use case to me - even more so when considering veth* specifically. I'd argue it makes more sense to make DHCP on such interfaces opt-in.

KarolisL commented 2 years ago

Spent almost a day trying to find the root cause. DNS would become unreachable around 10sec after container startup on GCE.

edit: I support @ius suggestion, at least on cloud images.

ketzacoatl commented 1 year ago

I'm not smart enough (or familiar enough with the details) to comment on the right solution, but it would be super if docker/nixos/network experts were able to decide on the right thing to do, and the nixos team see that update through.

douglaz commented 1 year ago

This is still a problem :disappointed:

fpletz commented 1 year ago

Using systemd-networkd by setting networking.useNetworkd solves this issue due to the different implementation of networking.useDHCP from #167327 which uses an allow list instead of a block list.

lostbean commented 6 months ago

Confirming this is still an issue! But also confirming that networking.useNetworkd solves it (thanks @fpletz ).

belevy commented 4 months ago

Just ran into this issue and wasted a fair bit of time. While networking.useNetworkd does work as suggested this seems like a really confusing problem and is far more wide reaching than just EC2 metadata. I believe GCE also uses the same metadata endpoint and there are a whole host of other services that expose 169.254 ips.

vst commented 4 months ago

is far more wide reaching than just EC2 metadata

Yep, same for DigitalOcean, too.

arianvp commented 3 months ago

Ran into this today at work and I'm the maintainer of the AWS image. I suggest we switch the image over to using networkd networking. I want to push in general for networking.useNetworkd to be the default for next release

arianvp commented 3 months ago

In case people were wondering where the link-local address comes from:

If dhcpcd failed to obtain a lease, it probes for a valid IPv4LL address (aka ZeroConf, aka APIPA). Once obtained it restarts the process of looking for a DHCP server to get a proper address.

This is fallback behaviour of dhcpcd

bbigras commented 3 months ago

With digitalocean we shouldn't use dhcp at all https://github.com/nix-community/srvos/blob/4098b95dde07ec1ef75cd2cba1ebdde0576b59f1/nixos/hardware/digitalocean/droplet.nix

arianvp commented 3 months ago

That is not the NixOS source code but the SrvOS source code. They piggyback on the DHCP client of cloud-init. Which we don't ship in NixOS by default. DHCP is definitely still at play

freelock commented 1 week ago

I just found this issue after coming to this diagnosis, posted to discourse - https://discourse.nixos.org/t/ec2-metadata-not-available-in-runcommand/14597/7 .

Is there any downside to the workaround above?

{
  networking.dhcpcd.denyInterfaces = [ "veth*" ];
}

... or is networking.useNetworkd the better way to go?

arianvp commented 1 week ago

There is no downside. The default in scripted networking is just bad

NixOS / nixpkgs