Closed killua-eu closed 6 years ago
Not sure what's going on with resolved interacting with dnsmasq but that probably ought to be reported against resolved in Ubuntu.
The other problem is LXD failing to start when the IP address it's supposed to bind isn't available. That's obviously a deal breaker for a LXD daemon that's configured with clustering enabled as without being able to bind and connect to that address, it can't access its database and so would just crash.
I'm going to close this issue as there's nothing actionable for LXD itself, but you could take some of the following actions:
Hi Stéphane, thanks for assessing this. For further reference:
@stgraber , for now there seems to be zero activity on the systemd-networkd and snapd front. Could you kindly suggest any workarounds that you would consider as good practice in production? I'm still trying to wrap my head around what's possible/doesn't work with the whole new networking stack, there are too many degrees of freedom.
@killua-eu you should be able to add a systemd unit of your own which blocks until your network is in a suitable stage, then add an override to the lxd unit adding a After=your-unit.service
so that lxd startup blocks on your unit completing.
@stgraber, thanks for the tip! After poking around I'm not sure I really understand the cause of the problem correctly and I wouldn't want to confuse upstream. I'd like to rephrase what I believe is the cause - I suspect a race condition caused by the following:
Per the netplan's
network:
ethernets:
enp0s31f6:
addresses: []
dhcp4: true
optional: true
nameservers:
addresses: [240.102.0.1,1.1.1.1,8.8.8.8]
version: 2
where I use the fan gw in order to get container-name resolution, systemd-networkd expects upon start to receive a response from the 240.102.0.1 nameserver, but this one isn't up yet, because there's a delay between systemd-resolved and snap.lxd.daemon starts. After snap.lxd.daemon starts, the network doesn't work properly, because systemd-resolved thinks the 240.102.0.1 is dead. The deadlock gets resolved by restarting snap.lxd.daemon (which will get enough network to start) and after that restarting systemd-resolved too (which will make systemd-resolved forget, and upon restart find 240.102.0.1 in a working state).
With respect to this, I believe there's more 'blame' on systemd-resolved. Would you concur?
ah yeah, a bit of a chicken and egg problem you have here with sending DNS requests to a server that's not up yet because LXD itself is the one bringing it up... If the goal is to have .lxd
resolution working, then @simos recently posted what should be a better solution which effectively ties the DNS configuration to the particular bridge and should avoid that delay.
https://blog.simos.info/how-to-use-lxd-container-hostnames-on-the-host-in-ubuntu-18-04/
@stgraber thanks for the tip! Using the approach @simos takes makes things neater. Nontheless, the race-condition still applies. I tried to turn off ipv6 on veth and the lxdfan0, but the fun part is, that networkd insists on initializing the network interfaces with ipv6 to immediately disable the ipv6 right after they get picked up. So I still have the 10 minutes delay after reboot, before I can do the lxc list
or before I can ping c1.lxd
. The veth interfaces just don't show up before that.
Here the promissed report for @freeekanayaka as discussed in https://github.com/lxc/lxd/issues/4548.
Required information
Issue description
LXD snap fails to start due to networkd (?) problems. restarts itself after a 10 minutes timeout into a working state.
Steps to reproduce
lxc launch ubuntu:18.04 c1
etc.Information to attach
here i waited a bit for snap/lxd to catch up. the networkctl status stopped at this:
after waiting 10 minutes (getting the timeout), we continue with
now i get
yaay. this is on a fresh machine (18.04 server install) with just lxd installed as snap. Now if I dare to want to
ping c1.lxd
(as I asked in https://github.com/lxc/lxd/issues/4625), I go forlooks fine, but the ping actually waits cca. 3 seconds and then starts to ping. so there seems to be a lag in hostname resolution. this lag comes also with pinging google. you don't see it in the ping time. here some more logs:
and here from the reboot (including the reboot) for lxd snap
here also the full dmesg
example container info
no other logs seem to have anything interesting, but attaching anyway
[edit] Additional fun fact
ability to ping containers disappears after reboot:
looking at the netplan
gives no reason why this fails, but it still does
now the fun part, regenrating and applying netplan does the trick
again with the 3 seconds lag on before
PING c2.lxd (240.102.0.119) 56(84) bytes of data.
appears. Can't find anything relevant in any log.