canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

dnsmask process exited prematurely if raw.dnsmasq auth-zone set when using core20 snap #8905

Closed geodb27 closed 3 years ago

geodb27 commented 3 years ago

Hi,

There might have been an update that broke the dnsmasq binary that is provided by the last lxd snap. The 'snap start lxd' emits the following error : lvl=eror msg="The dnsmasq process exited prematurely" driver=bridge err="Process exited with non-zero value 1" network=lxdbr0 project=default

I've dug further and here is what I did :

To my knoledge, but I admit that I don't know that much about snap, the problem resides that I have an ubuntu 18.04 that uses snap core18 and lxd-4.15 has been built against core20...

I don't know how to get lxd's dnsmasq run anymore. If anyone can help. Thanks a lot !

tomponline commented 3 years ago

Hi, sorry to hear this.

Can you show me the output of lxc network show lxdbr0 please?

geodb27 commented 3 years ago

Sure, here it is :

config: dns.mode: managed dns.search: localdomain ipv4.address: 192.168.220.250/24 ipv4.nat: "true" ipv6.address: none raw.dnsmasq: | auth-zone=lxd dns-loop-detect description: "" name: lxdbr0 type: bridge used_by: (snip, contaier list) managed: true status: Created locations:

tomponline commented 3 years ago

Can you try this:

sudo nsenter --mount=/run/snapd/ns/lxd.mnt -- bash
LD_LIBRARY_PATH=/snap/lxd/current/lib/:/snap/lxd/current/lib/x86_64-linux-gnu/ /snap/lxd/current/bin/dnsmasq --help
esosan commented 3 years ago

Usage: dnsmasq [options]

Valid options are: -a, --listen-address= Specify local address(es) to listen on. -A, --address=// Return ipaddr for all hosts in specified domains. [...]

geodb27 commented 3 years ago

This seems to work. At least, it don't raise any error. I get the help from the dnsmasq process.

tomponline commented 3 years ago

@geodb27 does using the same approach allow you to start the dnsmasq process using the full command you originally posted?

geodb27 commented 3 years ago

this seems to work, provided I don't get out of the nsenter. I'm giving it a chance...

geodb27 commented 3 years ago

I've dropped the --keep-in-foreground and was able to launch dnsmasq, quit the namespace and launch all my containers. So, this solves the problem, but is not reliable at the moment... Anyway, thanks for your help, I wish this bug will be corrected soon.

tomponline commented 3 years ago

@geodb27 @esosan if you remove "auth-zone" from raw.dnsmasq using lxc network set lxdbr0 raw.dnsmasq does dnsmasq then start for you?

Ofc kill the existing manually started process first if you have started it.

esosan commented 3 years ago

now the instances have an IP!!

tomponline commented 3 years ago

OK so we're narrowing it down, its something to do with raw.dnsmasq either because a specific setting in dnsmasq is preventing it from starting, or perhaps because when that option is used we have to disable the apparmor profile (as the user may reference resources outside of the allowed profile), and perhaps this is causing the issue.

esosan commented 3 years ago

btw, not all instances have acquired an ip4 (only the ip6), but after restarting the instance the right ip it's assigned

tomponline commented 3 years ago

@esosan yes it probably needed the instances to be restarted to make a DHCP request again to get IPv4, whereas IPv6 are broadcasted using RA.

tomponline commented 3 years ago

Found the problem. Its the auth-zone=lxd setting:

dnsmasq: --auth-server required when an auth zone is defined.

I'll see if we can find a way to surface that error better.

jadjay commented 3 years ago

Hello I can confirme this

Try unset the auth-zone line, it is what make dnsmasq block

loxK commented 3 years ago

Hello, I can confirm too. But after unsetting raw.dnsmasq, I only get IPv6. I rebooted the host, but still, no IPv4.

It was working properly before I rebooted the host earlier today.

tomponline commented 3 years ago

Hello, I can confirm too. But after unsetting raw.dnsmasq, I only get IPv6. I rebooted the host, but still, no IPv4.

It was working properly before I rebooted the host earlier today.

Make sure its not a docker related firewall issue like https://discuss.linuxcontainers.org/t/containers-suddenly-stopped-working-no-more-ips-assigned/11360/19

geodb27 commented 3 years ago

The point would be to make the link between raw.dnsmasq, apparmor and the requirement for libnettle.so.7 ? As for me, my lxd clusters are setup on "pure" ubuntu-18.04 virtual machines. Docker is not involved at any point in the process. My setup was made for the dnsmasq respond to anything like instance.lxd and forward the rest to our dns servers. So, to have dnsmasq run after the nsenter, I had to add a server to the --auth-server parameter. Yet, this still doesn't explain the libnettle.so.7 thing.

loxK commented 3 years ago

Hello, I can confirm too. But after unsetting raw.dnsmasq, I only get IPv6. I rebooted the host, but still, no IPv4. It was working properly before I rebooted the host earlier today.

Make sure its not a docker related firewall issue like https://discuss.linuxcontainers.org/t/containers-suddenly-stopped-working-no-more-ips-assigned/11360/19

Thanks, I use ufw and I had to ufw allow in on lxdbr0. So all good now, but I don't get why it was working before I rebooted that box earlier today.

tomponline commented 3 years ago

@loxK most likely because in modifying the lxdbr0 network's raw.dnsmasq setting this would have caused LXD to remove and re-add its firewall rules, potentially changing the order of the rules in relation to another ruleset that is normally added after LXD has started.

See https://discuss.linuxcontainers.org/t/lxd-bridge-doesnt-work-with-ipv4-and-ufw-with-nftables/10034/17?u=tomp for a more thorough example.

tomponline commented 3 years ago

The point would be to make the link between raw.dnsmasq, apparmor and the requirement for libnettle.so.7 ? As for me, my lxd clusters are setup on "pure" ubuntu-18.04 virtual machines. Docker is not involved at any point in the process. My setup was made for the dnsmasq respond to anything like instance.lxd and forward the rest to our dns servers. So, to have dnsmasq run after the nsenter, I had to add a server to the --auth-server parameter. Yet, this still doesn't explain the libnettle.so.7 thing.

I think the libnettle thing is not the issue here. The command originally run to get that error would have always failed because LD_LIBRARY_PATH=/snap/lxd/current/lib/:/snap/lxd/current/lib/x86_64-linux-gnu/ environment var was not set.

This should work:

LD_LIBRARY_PATH=/snap/lxd/current/lib/:/snap/lxd/current/lib/x86_64-linux-gnu/ sudo --preserve-env=LD_LIBRARY_PATH nsenter --mount=/run/snapd/ns/lxd.mnt -- <command>

The AppArmor thing doesn't appear to be related either, no evidence of that now that we have seen that it is possible to start dnsmasq with raw.dnsmasq set as long as the auth-zone=lxd is specified.

The issue appears to be that due to the LXD's snap package switch to core20 based package, this introduced a newer version of dnsmasq that had additional rules around when the auth-zone=lxd setting can be used.

As dnsmasq error states:

dnsmasq: --auth-server required when an auth zone is defined.

This is one of the downsides of using raw.dnsmasq setting, in that it doesn't get tested because the settings that are passed are used defined and unknown.

From a LXD perspective we just need to better surface these dnsmasq start up errors to aid in debugging future issues like this.

Some users are experiencing DHCPv4 issues after unsetting raw.dnsmasq setting, but as they are getting IPv6 addresses, this shows dnsmasq is running and the original problem has been resolved. The cause of the DHCPv4 problem is likely a side effect caused when raw.dnsmasq is removed or modified, which would cause LXD to clear its firewall rules and re-add them, potentially causing the rules to be added after additional external rules that would normally be added after LXD's rules, but in this case now come before LXD's rules and could then potentially block LXD's DHCP traffic.

tomponline commented 3 years ago

If anyone is still experiencing firewall issues after fixing dnsmasq it may be due to the snap core20 change subtly affecting the cases where nftables would be used https://discuss.linuxcontainers.org/t/lxd-stopped-generating-firewall-rules-after-switch-to-core20/11367/9?u=tomp