Open nh2 opened 3 years ago
This may also be relevant for the discussion of whether DHCP should be on by default for all interfaces: #75515
I expect that it will also break time sync with the host machine:
I had some luck with this hack around the problem:
virtualisation.docker.extraOptions = "--bip 192.168.1.1/24";
networking.dhcpcd.runHook = ''
iface=$(${pkgs.iproute}/bin/ip route get 8.8.8.8 | ${pkgs.gnused}/bin/sed -n 's/.*dev \([^\ ]*\) src.*/\1/p')
${pkgs.iproute}/bin/ip r a 169.254.169.254/32 dev "$iface" || true
'';
I marked this as stale due to inactivity. → More info
I believe this is still a problem.
I'd like to +1 this as I just ran into a similar issue, but way worse on Oracle's Cloud (OCI).
For OCI it's not just the metadata service but their network setup also originates ICMP unreachable from 169.254.0.0/16
on the primary interface. Starting a single docker/podman container will result in ICMP unreachable being dropped due to the newly acquired route (even though rp_filter
is set to 2), which is problematic if you rely on PMTU discovery to work.
In my case, network connectivity to the machine failed, but only after the ~0-10 mins required for a discovered MTU entry to be evicted from the cache. Lots of fun was had debugging that one.
Assigning IPv4LL in the first place seems like a fringe use case to me - even more so when considering veth*
specifically. I'd argue it makes more sense to make DHCP on such interfaces opt-in.
Spent almost a day trying to find the root cause. DNS would become unreachable around 10sec after container startup on GCE.
edit: I support @ius suggestion, at least on cloud images.
I'm not smart enough (or familiar enough with the details) to comment on the right solution, but it would be super if docker/nixos/network experts were able to decide on the right thing to do, and the nixos team see that update through.
This is still a problem :disappointed:
Using systemd-networkd by setting networking.useNetworkd
solves this issue due to the different implementation of networking.useDHCP
from #167327 which uses an allow list instead of a block list.
Confirming this is still an issue! But also confirming that networking.useNetworkd
solves it (thanks @fpletz ).
Just ran into this issue and wasted a fair bit of time. While networking.useNetworkd
does work as suggested this seems like a really confusing problem and is far more wide reaching than just EC2 metadata. I believe GCE also uses the same metadata endpoint and there are a whole host of other services that expose 169.254 ips.
is far more wide reaching than just EC2 metadata
Yep, same for DigitalOcean, too.
Ran into this today at work and I'm the maintainer of the AWS image. I suggest we switch the image over to using networkd networking. I want to push in general for networking.useNetworkd
to be the default for next release
In case people were wondering where the link-local address comes from:
If dhcpcd failed to obtain a lease, it probes for a valid IPv4LL address (aka ZeroConf, aka APIPA). Once obtained it restarts the process of looking for a DHCP server to get a proper address.
This is fallback behaviour of dhcpcd
With digitalocean we shouldn't use dhcp at all https://github.com/nix-community/srvos/blob/4098b95dde07ec1ef75cd2cba1ebdde0576b59f1/nixos/hardware/digitalocean/droplet.nix
That is not the NixOS source code but the SrvOS source code. They piggyback on the DHCP client of cloud-init. Which we don't ship in NixOS by default. DHCP is definitely still at play
I just found this issue after coming to this diagnosis, posted to discourse - https://discourse.nixos.org/t/ec2-metadata-not-available-in-runcommand/14597/7 .
Is there any downside to the workaround above?
{
networking.dhcpcd.denyInterfaces = [ "veth*" ];
}
... or is networking.useNetworkd
the better way to go?
There is no downside. The default in scripted networking is just bad
For a workaround, scroll down a bit.
Issue description
On NixOS 20.09 running on an AWS machine, if you configure
and start a container (e.g.
docker run redis
), then after 8 seconds any connection to the EC2 metadata service will break withNo route to host
.How the error manifests
You may see the error as:
ec2-metadata
tool from PR #108804 will fail with:curl 'http://169.254.169.254'
will fail withNo route to host
.Explanation
route -n
will show an entry:This route is why we get
No route to host
.Where
veth8402269
(veth
plus some number) is the docker container's interface; there's one for each running container.journalctl -f
(while starting a container) shows thatdhcpcd
is assinging the bad route:I believe that
dhcpcd
should not run on thisveth
interface.The
dhcpcd
module has adenyinterfaces
list to exclude virtual adapters:https://github.com/NixOS/nixpkgs/blob/80badc893dca2fc5196a5664473b187d9e9cfca9/nixos/modules/services/networking/dhcpcd.nix#L54-L57
veth*
was added to the deny list in 2012 by @peti: https://github.com/NixOS/nixpkgs/commit/8b841505ff16054f87be6f760d3ce7f1efb9e27bBut shortly afterwards reverted by @edolstra: https://github.com/NixOS/nixpkgs/commit/be189991e0fc973cca908fa44955a6374504da84 because
stan
apparently also createdveth
interfaces for which DHCP made sense.Workaround
What to do to make it work out of the box?
Currently installing Docker on an EC2 NixOS machine out-of-the-box breaks all AWS tooling that uses the EC2 metadata service.
veth*
be added to thedenyinterfaces
by default?CCing some people that have modified those respective parts in the past, or the Docker module's networking stuff, or the
ec2-metadata-fetcher.nix
who might know more:@edolstra @peti @wkennington @fpletz @flokli @bachp @Mic92 @endgame @grahamc @nlewo