canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

LXD 4.0.4 IP address not assigned to container (previously issue #7935) #8246

Closed xvd7 closed 3 years ago

xvd7 commented 3 years ago

After upgrading to LXD 4.0.4 the issue was resolved on roughly 1/3 of my servers. Unfortunately, the other 2/3 still suffer from this problem - IP address not assigned to the container unless I apply the bridge.mtu trick. systemctl restart snap.lxd.daemon does not fix it.

These are logs from a freshly re-installed server:

root@server /var/snap/lxd/common/lxd/logs # dmesg | grep DENIED
[   22.811524] audit: type=1400 audit(1607716199.515:56): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-test_</var/snap/lxd/common/lxd>" name="/run/systemd/unit-root/snap/" pid=3692 comm="(networkd)" flags="ro, remount, noatime, bind"
[   23.015912] audit: type=1400 audit(1607716199.719:57): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-test_</var/snap/lxd/common/lxd>" name="/run/systemd/unit-root/snap/" pid=3694 comm="(resolved)" flags="ro, remount, noatime, bind"
[  144.388010] audit: type=1400 audit(1607716321.108:58): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-test_</var/snap/lxd/common/lxd>" name="/run/systemd/unit-root/boot/" pid=4559 comm="(crub_all)" flags="ro, remount, noatime, bind"
[  144.473233] audit: type=1400 audit(1607716321.192:59): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-test_</var/snap/lxd/common/lxd>" name="/run/systemd/unit-root/snap/" pid=4566 comm="(d-logind)" flags="ro, remount, noatime, bind"
root@server /var/snap/lxd/common/lxd/logs # lxc network show lxdbr0
config:
  ipv4.address: 10.2.35.1/24
  ipv4.nat: "false"
  ipv6.address: none
description: ""
name: lxdbr0
type: bridge
used_by:
- /1.0/instances/test
- /1.0/profiles/default
managed: true
status: Created
locations:
- none
root@server /var/snap/lxd/common/lxd/logs # lxc list
+------+---------+------+------+-----------+-----------+
| NAME |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+------+---------+------+------+-----------+-----------+
| test | RUNNING |      |      | CONTAINER | 0         |
+------+---------+------+------+-----------+-----------+
root@server /var/snap/lxd/common/lxd/logs # cat ./lxd.log
t=2020-12-11T19:49:52+0000 lvl=info msg="LXD 4.0.4 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-12-11T19:49:52+0000 lvl=info msg="Kernel uid/gid map:" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - u 0 0 4294967295" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - g 0 0 4294967295" 
t=2020-12-11T19:49:52+0000 lvl=info msg="Configured LXD uid/gid map:" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - u 0 1000000 1000000000" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - g 0 1000000 1000000000" 
t=2020-12-11T19:49:52+0000 lvl=info msg="Kernel features:" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - closing multiple file descriptors efficiently: no" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - pidfds: yes" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - uevent injection: yes" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - seccomp listener: yes" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - seccomp listener continue syscalls: yes" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - seccomp listener add file descriptors: no" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - attach to namespaces via pidfds: no" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - safe native terminal allocation : yes" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - cgroup layout: hybrid" 
t=2020-12-11T19:49:52+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored" 
t=2020-12-11T19:49:52+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - shiftfs support: disabled" 
t=2020-12-11T19:49:52+0000 lvl=info msg="Initializing local database" 
t=2020-12-11T19:49:52+0000 lvl=info msg="Starting /dev/lxd handler:" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-12-11T19:49:52+0000 lvl=info msg="REST API daemon:" 
t=2020-12-11T19:49:52+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-12-11T19:49:52+0000 lvl=info msg=" - binding TCP socket" socket=10.2.35.1:8443
t=2020-12-11T19:49:52+0000 lvl=info msg="Initializing global database" 
t=2020-12-11T19:49:53+0000 lvl=info msg="Firewall loaded driver \"xtables\"" 
t=2020-12-11T19:49:53+0000 lvl=info msg="Initializing storage pools" 
t=2020-12-11T19:49:53+0000 lvl=info msg="Initializing daemon storage mounts" 
t=2020-12-11T19:49:53+0000 lvl=info msg="Initializing networks" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Pruning leftover image files" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Done pruning leftover image files" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Loading daemon configuration" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Started seccomp handler" path=/var/snap/lxd/common/lxd/seccomp.socket
t=2020-12-11T19:49:54+0000 lvl=info msg="Pruning expired images" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Done pruning expired images" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Pruning expired instance backups" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Done pruning expired instance backups" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Expiring log files" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Done expiring log files" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Updating instance types" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Done updating instance types" 
t=2020-12-11T19:49:54+0000 lvl=info msg="Starting container" action=start created=2020-12-10T07:52:29+0000 ephemeral=false name=test project=default stateful=false used=2020-12-11T19:45:50+0000
t=2020-12-11T19:49:55+0000 lvl=info msg="Started container" action=start created=2020-12-10T07:52:29+0000 ephemeral=false name=test project=default stateful=false used=2020-12-11T19:45:50+0000
root@server /var/snap/lxd/common/lxd/logs # lxc config show test --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20201201)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20201201"
  image.type: squashfs
  image.version: "20.04"
  volatile.base_image: 3e9403fe7645000fc49ec89bca056c7fd53e9a142a3a9054ee02c13a2f14b6d0
  volatile.eth0.host_name: veth1068a460
  volatile.eth0.hwaddr: 00:16:3e:0c:d5:9d
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
root@server ~ # snap list
Name    Version   Rev    Tracking       Publisher   Notes
core18  20200929  1932   latest/stable  canonical  base
lxd     4.0.4     18150  4.0/stable/    canonical  -
snapd   2.48.1    10492  latest/stable  canonical  snapd
root@server /var/log # cat ./syslog|grep audit
Dec 11 19:49:48 server kernel: [    1.285921] audit: initializing netlink subsys (disabled)
Dec 11 19:49:48 server kernel: [    1.285926] audit: type=2000 audit(1607716177.052:1): state=initialized audit_enabled=0 res=1
Dec 11 19:49:48 server kernel: [   10.895969] audit: type=1400 audit(1607716187.599:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=1199 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.895972] audit: type=1400 audit(1607716187.599:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1199 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.900539] audit: type=1400 audit(1607716187.603:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=1200 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.905644] audit: type=1400 audit(1607716187.607:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1201 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.905646] audit: type=1400 audit(1607716187.607:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1201 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.918467] audit: type=1400 audit(1607716187.623:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=1202 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.918469] audit: type=1400 audit(1607716187.623:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=1202 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.918470] audit: type=1400 audit(1607716187.623:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=1202 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.918472] audit: type=1400 audit(1607716187.623:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/{,usr/}sbin/dhclient" pid=1202 comm="apparmor_parser"
Dec 11 19:49:48 server kernel: [   10.996273] audit: type=1400 audit(1607716187.699:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/snapd/10492/usr/lib/snapd/snap-confine" pid=1217 comm="apparmor_parser"
Dec 11 19:49:54 server kernel: [   17.516443] kauditd_printk_skb: 14 callbacks suppressed
Dec 11 19:49:54 server kernel: [   17.516446] audit: type=1400 audit(1607716194.219:26): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd>" pid=2796 comm="apparmor_parser"
Dec 11 19:49:54 server kernel: [   18.095057] audit: type=1400 audit(1607716194.799:27): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-test_</var/snap/lxd/common/lxd>" pid=3169 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.619848] audit: type=1400 audit(1607716197.323:28): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="/usr/lib/snapd/snap-confine" pid=3629 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.619862] audit: type=1400 audit(1607716197.323:29): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=3629 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.636779] audit: type=1400 audit(1607716197.339:30): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="nvidia_modprobe" pid=3630 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.636787] audit: type=1400 audit(1607716197.339:31): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="nvidia_modprobe//kmod" pid=3630 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.641083] audit: type=1400 audit(1607716197.343:32): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="lsb_release" pid=3628 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.642462] audit: type=1400 audit(1607716197.347:33): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="/usr/bin/man" pid=3631 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.642468] audit: type=1400 audit(1607716197.347:34): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="man_filter" pid=3631 comm="apparmor_parser"
Dec 11 19:49:57 server kernel: [   20.642473] audit: type=1400 audit(1607716197.347:35): apparmor="STATUS" operation="profile_load" label="lxd-test_</var/snap/lxd/common/lxd>//&:lxd-test_<var-snap-lxd-common-lxd>:unconfined" name="man_groff" pid=3631 comm="apparmor_parser"
Dec 11 19:49:59 server kernel: [   22.811521] kauditd_printk_skb: 20 callbacks suppressed
Dec 11 19:49:59 server kernel: [   22.811524] audit: type=1400 audit(1607716199.515:56): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-test_</var/snap/lxd/common/lxd>" name="/run/systemd/unit-root/snap/" pid=3692 comm="(networkd)" flags="ro, remount, noatime, bind"
Dec 11 19:49:59 server kernel: [   23.015912] audit: type=1400 audit(1607716199.719:57): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-test_</var/snap/lxd/common/lxd>" name="/run/systemd/unit-root/snap/" pid=3694 comm="(resolved)" flags="ro, remount, noatime, bind"

Is there any other information that would be helpful?

Originally posted by @xvd7 in https://github.com/lxc/lxd/issues/7935#issuecomment-743397331

stgraber commented 3 years ago

@tomponline any suggestions?

I suspect you'd want to check what's on disk in /var/snap/lxd/common/lxd/networks as that's what dnsmasq consumes.

tomponline commented 3 years ago

Is dnsmasq running?

xvd7 commented 3 years ago
root@server /var/snap/lxd/common/lxd/networks # ls -la
total 12
drwx--x--x  3 root root 4096 Dec  9 20:36 .
drwx--x--x 17 root root 4096 Dec 11 19:49 ..
drwx--x--x  3 root root 4096 Dec  9 20:36 lxdbr0
root@server /var/snap/lxd/common/lxd/networks/lxdbr0 # ls -la
total 20
drwx--x--x 3 root root 4096 Dec  9 20:36 .
drwx--x--x 3 root root 4096 Dec  9 20:36 ..
drwxr-xr-x 2 root root 4096 Dec 11 19:49 dnsmasq.hosts
-rw-r--r-- 1 root root    0 Dec  9 20:36 dnsmasq.leases
-rw-r--r-- 1 root root  669 Dec 11 19:49 dnsmasq.pid
-rw-r--r-- 1 root root    1 Dec 11 19:49 dnsmasq.raw
root@server /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts # ls -la
total 12
drwxr-xr-x 2 root root 4096 Dec 11 19:49 .
drwx--x--x 3 root root 4096 Dec  9 20:36 ..
-rw-r--r-- 1 root root   23 Dec 11 19:49 test
root@server /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts # cat ./test 
00:16:3e:0c:d5:9d,test

After applying the bridge.mtu trick (setting and unsetting) the previously empty dnsmasq.leases file is now:

root@server /var/snap/lxd/common/lxd/networks/lxdbr0 # cat ./dnsmasq.leases 
1607725689 00:16:3e:0c:d5:9d 10.2.35.86 test ff:a7:63:ac:d2:00:02:00:00:ab:11:58:2f:15:ab:7b:1d:90:84
xvd7 commented 3 years ago

After rebooting the server:

root@server ~ # ps -ef|grep dnsmasq
root        3939    3883  0 21:34 pts/0    00:00:00 grep --color=auto dnsmasq

After applying the bridge.mtu trick:

root@server ~ # ps -ef|grep dnsmasq
lxd         4296    1891 20 21:34 ?        00:00:00 dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --pid-file= --no-ping --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.2.35.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.2.35.2,10.2.35.254,1h -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd -g lxd
root        4298    3883  0 21:34 pts/0    00:00:00 grep --color=auto dnsmasq
tomponline commented 3 years ago

So to confirm my understanding, dnsmasq is not starting automatically, but when it does IPs are allocated ok?

xvd7 commented 3 years ago

That's correct. Running the bridge.mtu trick starts dnsmasq which subsequently assigns IP address to the container.

stgraber commented 3 years ago

What's in lxd.log and journalctl -b 0 | grep dnsmasq?

xvd7 commented 3 years ago
root@server /var/snap/lxd/common/lxd/logs # cat ./lxd.log
t=2020-12-12T15:33:15+0000 lvl=info msg="LXD 4.0.4 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-12-12T15:33:15+0000 lvl=info msg="Kernel uid/gid map:" 
t=2020-12-12T15:33:15+0000 lvl=info msg=" - u 0 0 4294967295" 
t=2020-12-12T15:33:15+0000 lvl=info msg=" - g 0 0 4294967295" 
t=2020-12-12T15:33:15+0000 lvl=info msg="Configured LXD uid/gid map:" 
t=2020-12-12T15:33:15+0000 lvl=info msg=" - u 0 1000000 1000000000" 
t=2020-12-12T15:33:15+0000 lvl=info msg=" - g 0 1000000 1000000000" 
t=2020-12-12T15:33:16+0000 lvl=info msg="Kernel features:" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - closing multiple file descriptors efficiently: no" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - pidfds: yes" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - uevent injection: yes" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - seccomp listener: yes" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - seccomp listener continue syscalls: yes" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - seccomp listener add file descriptors: no" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - attach to namespaces via pidfds: no" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - safe native terminal allocation : yes" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - cgroup layout: hybrid" 
t=2020-12-12T15:33:16+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored" 
t=2020-12-12T15:33:16+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - shiftfs support: disabled" 
t=2020-12-12T15:33:16+0000 lvl=info msg="Initializing local database" 
t=2020-12-12T15:33:16+0000 lvl=info msg="Starting /dev/lxd handler:" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-12-12T15:33:16+0000 lvl=info msg="REST API daemon:" 
t=2020-12-12T15:33:16+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-12-12T15:33:16+0000 lvl=info msg=" - binding TCP socket" socket=10.2.35.1:8443
t=2020-12-12T15:33:16+0000 lvl=info msg="Initializing global database" 
t=2020-12-12T15:33:16+0000 lvl=info msg="Firewall loaded driver \"xtables\"" 
t=2020-12-12T15:33:16+0000 lvl=info msg="Initializing storage pools" 
t=2020-12-12T15:33:17+0000 lvl=info msg="Initializing daemon storage mounts" 
t=2020-12-12T15:33:17+0000 lvl=info msg="Initializing networks" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Pruning leftover image files" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Done pruning leftover image files" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Loading daemon configuration" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Started seccomp handler" path=/var/snap/lxd/common/lxd/seccomp.socket
t=2020-12-12T15:33:18+0000 lvl=info msg="Pruning expired images" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Done pruning expired images" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Pruning expired instance backups" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Done pruning expired instance backups" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Expiring log files" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Updating instance types" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Done expiring log files" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Done updating instance types" 
t=2020-12-12T15:33:18+0000 lvl=info msg="Starting container" action=start created=2020-12-12T08:57:17+0000 ephemeral=false name=test project=default stateful=false used=2020-12-12T09:37:16+0000
t=2020-12-12T15:33:18+0000 lvl=info msg="Started container" action=start created=2020-12-12T08:57:17+0000 ephemeral=false name=test project=default stateful=false used=2020-12-12T09:37:16+0000
root@server ~ # journalctl -b 0 | grep dnsmasq
Dec 12 15:33:17 server audit[3092]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd>" pid=3092 comm="apparmor_parser"
Dec 12 15:33:17 server kernel: audit: type=1400 audit(1607787197.863:26): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd>" pid=3092 comm="apparmor_parser"
Dec 12 15:33:18 server dnsmasq[3119]: unknown interface lxdbr0
Dec 12 15:33:18 server dnsmasq[3119]: FAILED to start up

However, lxdbr0 exists:

root@server ~ # ifconfig lxdbr0
lxdbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.2.35.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::216:3eff:fe7d:7425  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:7d:74:25  txqueuelen 1000  (Ethernet)
        RX packets 41  bytes 10280 (10.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Another piece of information from syslog:

root@server /var/log # cat ./syslog | grep lxdbr0
Dec 12 15:33:17 server systemd-udevd[473]: lxdbr0: Could not generate persistent MAC: No data available
Dec 12 15:33:17 server systemd-networkd[1248]: lxdbr0: Link UP
Dec 12 15:33:17 server kernel: [   18.194394] audit: type=1400 audit(1607787197.863:26): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd>" pid=3092 comm="apparmor_parser"
Dec 12 15:33:18 server kernel: [   18.658325] lxdbr0: port 1(veth2771ed3f) entered blocking state
Dec 12 15:33:18 server kernel: [   18.658328] lxdbr0: port 1(veth2771ed3f) entered disabled state
Dec 12 15:33:18 server kernel: [   18.658475] lxdbr0: port 1(veth2771ed3f) entered blocking state
Dec 12 15:33:18 server kernel: [   18.658477] lxdbr0: port 1(veth2771ed3f) entered forwarding state
Dec 12 15:33:18 server kernel: [   18.658540] lxdbr0: port 1(veth2771ed3f) entered disabled state
Dec 12 15:33:18 server dnsmasq[3119]: unknown interface lxdbr0
Dec 12 15:33:18 server systemd-networkd[1248]: lxdbr0: Gained carrier
Dec 12 15:33:18 server systemd-networkd[1248]: lxdbr0: Gained IPv6LL
Dec 12 15:33:18 server kernel: [   19.022889] lxdbr0: port 1(veth2771ed3f) entered blocking state
Dec 12 15:33:18 server kernel: [   19.022892] lxdbr0: port 1(veth2771ed3f) entered forwarding state
Dec 12 15:33:18 server kernel: [   19.022952] IPv6: ADDRCONF(NETDEV_CHANGE): lxdbr0: link becomes ready
stgraber commented 3 years ago

That's odd, it suggests the interface vanished prior to dnsmasq starting...

tomponline commented 3 years ago

Maybe something else is trying to configure it?

xvd7 commented 3 years ago

It has something to do with my VPN (tinc). If I turn tinc off then dnsmasq seems fine. I will keep digging.

xvd7 commented 3 years ago

I have two interfaces with the same IP (so that all LXD containers deployed on various physical servers can talk to each other through the VPN):

lxdbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.2.35.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::216:3eff:fe7d:7425  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:7d:74:25  txqueuelen 1000  (Ethernet)
        RX packets 135  bytes 24510 (24.5 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19  bytes 2800 (2.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vpn: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet 10.2.35.1  netmask 255.255.255.255  destination 10.2.35.1
        inet6 fe80::e131:51d1:f236:a5fd  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)
        RX packets 3483  bytes 448721 (448.7 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3236  bytes 683271 (683.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This used to work fine up until version 4.0.2. It looks like a race condition on some servers (it affects roughly 2/3 of my Ubuntu 20.04 servers; all Ubuntu 18.04 which are on LXD 3.0 are fine).

I have a hack for now - by adding ExecStartPre=/bin/sleep 15 to the tinc systemd service.

tomponline commented 3 years ago

Do you use the fan network at all?

xvd7 commented 3 years ago

I don't use fan network. I found the solution using tinc vpn long before fan network was introduced and I followed the "do not fix it if it ain't broke" rule. Looks like I should look into fan network setup ...

tomponline commented 3 years ago

It's fine, I only mentioned it because it can look for interfaces using IP so having two interfaces with same IP may confuse it, but that won't be it if you don't use it.

toby63 commented 3 years ago

It is interesting to see this, my macvlan setup also stopped working yesterday, but I have not investigated it yet.

xvd7 commented 3 years ago

My initial hack - adding ExecStartPre=/bin/sleep 15 to the tinc systemd service - has sometimes negative side-effects:

It prevents LXD from listening on port 8443.

root@server ~ # netstat -tulpn | grep lxd does not return any output.

It could be fixed by running snap restart lxd

On some servers an alternative hack works: it involves adding ExecStartPre=/bin/sleep 15 to [Service] section in /etc/systemd/system/snap.lxd.daemon.service

xvd7 commented 3 years ago

The same issue appears when wireguard interface has the same IP as the LXD bridge:

4: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.2.66.1/32 scope global wg0
       valid_lft forever preferred_lft forever
5: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:16:3e:63:3b:f3 brd ff:ff:ff:ff:ff:ff
    inet 10.2.66.1/24 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fe63:3bf3/64 scope link 
       valid_lft forever preferred_lft forever

This has not been an issue in LXD 3 and early releases of LXD 4. I think it started somewhere around version 4.0.2 or 4.0.3.

tomponline commented 3 years ago

Does dnsmasq start OK, can you show output of ps aux | grep dnsmasq as well as sudo ss -ulpn thanks

xvd7 commented 3 years ago
root@server:~# ps aux|grep dnsmasq
root        5443  0.0  0.0   8160  2580 pts/0    S+   11:06   0:00 grep --color=auto dnsmasq
root@server:~# ss -ulpn
State                      Recv-Q                     Send-Q                                               Local Address:Port                                            Peer Address:Port                     Process                                                                        
UNCONN                     0                          0                                                          0.0.0.0:2049                                                 0.0.0.0:*                                                                                                       
UNCONN                     0                          0                                                          0.0.0.0:34937                                                0.0.0.0:*                                                                                                       
UNCONN                     0                          0                                                          0.0.0.0:51820                                                0.0.0.0:*                                                                                                       
UNCONN                     0                          0                                                          0.0.0.0:53033                                                0.0.0.0:*                         users:(("rpc.mountd",pid=1771,fd=16))                                         
UNCONN                     0                          0                                                          0.0.0.0:53826                                                0.0.0.0:*                         users:(("rpc.mountd",pid=1771,fd=12))                                         
UNCONN                     0                          0                                                          0.0.0.0:47522                                                0.0.0.0:*                         users:(("rpc.mountd",pid=1771,fd=8))                                          
UNCONN                     0                          0                                                    127.0.0.53%lo:53                                                   0.0.0.0:*                         users:(("systemd-resolve",pid=1625,fd=12))                                    
UNCONN                     0                          0                                               X.X.X.X%eno3:68                                                   0.0.0.0:*                         users:(("systemd-network",pid=1619,fd=19))                                    
UNCONN                     0                          0                                                          0.0.0.0:111                                                  0.0.0.0:*                         users:(("rpcbind",pid=1595,fd=5),("systemd",pid=1,fd=36))                     
UNCONN                     0                          0                                                             [::]:2049                                                    [::]:*                                                                                                       
UNCONN                     0                          0                                                             [::]:34932                                                   [::]:*                         users:(("rpc.mountd",pid=1771,fd=10))                                         
UNCONN                     0                          0                                                             [::]:51555                                                   [::]:*                         users:(("rpc.mountd",pid=1771,fd=14))                                         
UNCONN                     0                          0                                                             [::]:51820                                                   [::]:*                                                                                                       
UNCONN                     0                          0                                                             [::]:43070                                                   [::]:*                                                                                                       
UNCONN                     0                          0                                                             [::]:111                                                     [::]:*                         users:(("rpcbind",pid=1595,fd=7),("systemd",pid=1,fd=38))                     
UNCONN                     0                          0                                                             [::]:50174                                                   [::]:*                         users:(("rpc.mountd",pid=1771,fd=18))                  

where X.X.X.X is the public IP address

xvd7 commented 3 years ago

I have a temporary fix (it is not pretty tough) by adding ExecStartPre=/bin/sleep 15 to the [Service] section in /etc/systemd/system/multi-user.target.wants/wg-quick@wg0.service

tomponline commented 3 years ago

Can you remove the workaround, enable debug mode and see if dnsmasq is failing to start with an error in syslog and then review the LXD start up logs and see if LXD reports failing to start dnsmasq:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon
sudo tail -f /var/snap/lxd/common/lxd/logs/lxd.log
xvd7 commented 3 years ago
root@server:~# cat /var/snap/lxd/common/lxd/logs/lxd.log
t=2021-01-30T08:24:08+0000 lvl=info msg="LXD 4.0.4 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2021-01-30T08:24:08+0000 lvl=info msg="Kernel uid/gid map:" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - u 0 0 4294967295" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - g 0 0 4294967295" 
t=2021-01-30T08:24:08+0000 lvl=info msg="Configured LXD uid/gid map:" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - u 0 1000000 1000000000" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - g 0 1000000 1000000000" 
t=2021-01-30T08:24:08+0000 lvl=info msg="Kernel features:" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - closing multiple file descriptors efficiently: no" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - pidfds: yes" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - uevent injection: yes" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - seccomp listener: yes" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - seccomp listener continue syscalls: yes" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - seccomp listener add file descriptors: no" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - attach to namespaces via pidfds: no" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - safe native terminal allocation : yes" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - cgroup layout: hybrid" 
t=2021-01-30T08:24:08+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored" 
t=2021-01-30T08:24:08+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored" 
t=2021-01-30T08:24:08+0000 lvl=info msg=" - shiftfs support: disabled" 
t=2021-01-30T08:24:08+0000 lvl=info msg="Initializing local database" 
t=2021-01-30T08:24:08+0000 lvl=dbug msg="Initializing database gateway" 
t=2021-01-30T08:24:08+0000 lvl=dbug msg="Start database node" address= id=1 role=voter
t=2021-01-30T08:24:09+0000 lvl=info msg="Starting /dev/lxd handler:" 
t=2021-01-30T08:24:09+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2021-01-30T08:24:09+0000 lvl=info msg="REST API daemon:" 
t=2021-01-30T08:24:09+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2021-01-30T08:24:09+0000 lvl=info msg="Initializing global database" 
t=2021-01-30T08:24:09+0000 lvl=dbug msg="Dqlite: attempt 0: server 1: connected" 
t=2021-01-30T08:24:09+0000 lvl=dbug msg="Firewall xtables detected iptables is in use" 
t=2021-01-30T08:24:09+0000 lvl=info msg="Firewall loaded driver \"xtables\"" 
t=2021-01-30T08:24:09+0000 lvl=info msg="Initializing storage pools" 
t=2021-01-30T08:24:09+0000 lvl=dbug msg="Initializing and checking storage pool \"default\"" 
t=2021-01-30T08:24:09+0000 lvl=dbug msg="Mount started" driver=zfs pool=default
t=2021-01-30T08:24:09+0000 lvl=dbug msg="Mount finished" driver=zfs pool=default
t=2021-01-30T08:24:10+0000 lvl=info msg="Initializing daemon storage mounts" 
t=2021-01-30T08:24:10+0000 lvl=info msg="Initializing networks" 
t=2021-01-30T08:24:10+0000 lvl=dbug msg=Start driver=bridge network=lxdbr0
t=2021-01-30T08:24:10+0000 lvl=dbug msg="Setting up network" driver=bridge network=lxdbr0
t=2021-01-30T08:24:10+0000 lvl=dbug msg="Stable MAC generated" driver=bridge hwAddr=00:16:3e:63:3b:f3 network=lxdbr0 seed=b1f38ff95313c1ba382c3f3c8584c3dc5c1bad42213aa4c700203871c7582c94.0.1
t=2021-01-30T08:24:10+0000 lvl=dbug msg="New task Operation: 8a206038-7090-4120-b6a5-f3db48a0669c" 
t=2021-01-30T08:24:10+0000 lvl=info msg="Pruning leftover image files" 
t=2021-01-30T08:24:10+0000 lvl=dbug msg="Started task operation: 8a206038-7090-4120-b6a5-f3db48a0669c" 
t=2021-01-30T08:24:10+0000 lvl=info msg="Done pruning leftover image files" 
t=2021-01-30T08:24:10+0000 lvl=info msg="Loading daemon configuration" 
t=2021-01-30T08:24:10+0000 lvl=dbug msg="Initialized inotify with file descriptor 29" 
t=2021-01-30T08:24:10+0000 lvl=dbug msg="Success for task operation: 8a206038-7090-4120-b6a5-f3db48a0669c" 
t=2021-01-30T08:24:10+0000 lvl=info msg="Started seccomp handler" path=/var/snap/lxd/common/lxd/seccomp.socket
t=2021-01-30T08:24:10+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:10+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:11+0000 lvl=dbug msg="New task Operation: a6870700-e583-4f42-8ae6-99b906c44e85" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Pruning expired images" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Started task operation: a6870700-e583-4f42-8ae6-99b906c44e85" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Done pruning expired images" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Success for task operation: a6870700-e583-4f42-8ae6-99b906c44e85" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:11+0000 lvl=dbug msg="New task Operation: c3d7099d-7a97-4b63-851f-4fb28e5f9baa" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Pruning expired instance backups" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Started task operation: c3d7099d-7a97-4b63-851f-4fb28e5f9baa" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Done pruning expired instance backups" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Success for task operation: c3d7099d-7a97-4b63-851f-4fb28e5f9baa" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="New task Operation: c7d66ec7-462b-4db4-a098-4cd1a0f4f9b6" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Updating instance types" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Started task operation: c7d66ec7-462b-4db4-a098-4cd1a0f4f9b6" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="New task Operation: a627aa6c-dfab-4e5c-ae08-7d5c70cec283" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Done updating instance types" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Expiring log files" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Started task operation: a627aa6c-dfab-4e5c-ae08-7d5c70cec283" 
t=2021-01-30T08:24:11+0000 lvl=info msg="Done expiring log files" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Success for task operation: a627aa6c-dfab-4e5c-ae08-7d5c70cec283" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="MountInstance started" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Mounted ZFS dataset" dev=tank/lxd/containers/test1 driver=zfs path=/var/snap/lxd/common/lxd/storage-pools/default/containers/test1 pool=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="MountInstance finished" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Starting device" device=eth0 instance=test1 project=default type=nic
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Scheduler: network: vetha78c31dd has been added: updating network priorities" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Scheduler: network: vethb2ddf64e has been added: updating network priorities" 
t=2021-01-30T08:24:11+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:11+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Starting device" device=root instance=test1 project=default type=disk
t=2021-01-30T08:24:11+0000 lvl=dbug msg="UpdateInstanceBackupFile started" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="UpdateInstanceBackupFile finished" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="MountInstance started" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="MountInstance finished" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=info msg="Starting container" action=start created=2020-12-23T20:38:43+0000 ephemeral=false name=test1 project=default stateful=false used=2021-01-30T08:03:52+0000
t=2021-01-30T08:24:11+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/containers/1/onstart username=root
t=2021-01-30T08:24:11+0000 lvl=dbug msg="MountInstance started" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="MountInstance finished" driver=zfs instance=test1 pool=default project=default
t=2021-01-30T08:24:11+0000 lvl=dbug msg="Scheduler: container test1 started: re-balancing" 
t=2021-01-30T08:24:12+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:12+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:12+0000 lvl=info msg="Started container" action=start created=2020-12-23T20:38:43+0000 ephemeral=false name=test1 project=default stateful=false used=2021-01-30T08:03:52+0000
t=2021-01-30T08:24:12+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:12+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:12+0000 lvl=dbug msg="Success for task operation: c7d66ec7-462b-4db4-a098-4cd1a0f4f9b6" 
t=2021-01-30T08:24:13+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:13+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:13+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:13+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:14+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:14+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:14+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:14+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:15+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:15+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:15+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:15+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:16+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:16+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:16+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:16+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:17+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:17+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root
t=2021-01-30T08:24:17+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-30T08:24:17+0000 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/internal/ready username=root

root@server:~# lxc list
+----------+---------+------+------+-----------+-----------+
|   NAME   |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+----------+---------+------+------+-----------+-----------+
| test1    | RUNNING |      |      | CONTAINER | 0         |
+----------+---------+------+------+-----------+-----------+
tomponline commented 3 years ago

Sounds to me like dnsmasq is starting and then exiting, so LXD doesn't know about it.

Can you look in syslog/journal for dnsmasq related message?

xvd7 commented 3 years ago

Logs around the time when the problem occurred:

root@server:~# cat ./syslog.2 | grep dns
Jan 30 08:23:59 seaborgium NetworkManager[1666]: <info>  [1611995037.8370] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 10-dns-resolved.conf, 10-globally-managed-devices.conf, no-mac-addr-change.conf) (etc: default-wifi-powersave-on.conf)
Jan 30 08:23:58 seaborgium systemd[1]: pppd-dns.service: Succeeded.
Jan 30 08:23:59 seaborgium kernel: [   48.227624] Key type dns_resolver registered
Jan 30 08:24:00 seaborgium NetworkManager[1666]: <info>  [1611995040.5319] dns-mgr[0x560640e0a290]: init: dns=systemd-resolved rc-manager=symlink, plugin=systemd-resolved
Jan 30 08:24:07 seaborgium lxd.daemon[3299]: - shared_pidns
Jan 30 08:24:10 seaborgium kernel: [   99.663938] audit: type=1400 audit(1611995050.704:30): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_</var/snap/lxd/common/lxd>" pid=3449 comm="apparmor_parser"
Jan 30 08:24:11 seaborgium dnsmasq[3460]: unknown interface lxdbr0
Jan 30 08:24:11 seaborgium dnsmasq[3460]: FAILED to start up

Is there any other keyword I should be searching for in the syslog?

tomponline commented 3 years ago

@xvd7 well I think that highlights the problem:

Jan 30 08:24:11 seaborgium dnsmasq[3460]: unknown interface lxdbr0
Jan 30 08:24:11 seaborgium dnsmasq[3460]: FAILED to start up

So as @stgraber said here https://github.com/lxc/lxd/issues/8246#issuecomment-743782852 it looks like something external to LXD on your system is tearing down the lxdbr0 interface as dnsmasq starts. I would check NetworkManager and systemd-networkd to make sure they are not trying to configure it.

xvd7 commented 3 years ago

I could not find anything that would be tearing down lxdbr0. It seems more like a timing issue:

Jan 30 08:24:10 seaborgium lxd.daemon[3310]: t=2021-01-30T08:24:10+0000 lvl=dbug msg=Start driver=bridge network=lxdbr0
Jan 30 08:24:10 seaborgium lxd.daemon[3310]: t=2021-01-30T08:24:10+0000 lvl=dbug msg="Setting up network" driver=bridge network=lxdbr0
Jan 30 08:24:10 seaborgium NetworkManager[1666]:   [1611995050.4649] manager: (lxdbr0): new Bridge device (/org/freedesktop/NetworkManager/Devices/5)
Jan 30 08:24:10 seaborgium systemd-udevd[566]: lxdbr0: Could not generate persistent MAC: No data available
Jan 30 08:24:10 seaborgium lxd.daemon[3310]: t=2021-01-30T08:24:10+0000 lvl=dbug msg="Stable MAC generated" driver=bridge hwAddr=00:16:3e:63:3b:f3 network=lxdbr0 seed=b1f38ff95313c1ba382c3f3c8584c3dc5c1bad42213aa4c700203871c7582c94.0.1
Jan 30 08:24:10 seaborgium systemd-networkd[1585]: lxdbr0: Link UP
Jan 30 08:24:10 seaborgium kernel: [   99.663938] audit: type=1400 audit(1611995050.704:30): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_" pid=3449 comm="apparmor_parser"
Jan 30 08:24:11 seaborgium dnsmasq[3460]: unknown interface lxdbr0
Jan 30 08:24:11 seaborgium dnsmasq[3460]: FAILED to start up
Jan 30 08:24:11 seaborgium kernel: [  100.589586] lxdbr0: port 1(vethb2ddf64e) entered blocking state
Jan 30 08:24:11 seaborgium kernel: [  100.589588] lxdbr0: port 1(vethb2ddf64e) entered disabled state
Jan 30 08:24:12 seaborgium systemd-networkd[1585]: lxdbr0: Gained carrier
Jan 30 08:24:12 seaborgium NetworkManager[1666]:   [1611995052.0412] device (lxdbr0): carrier: link connected
Jan 30 08:24:12 seaborgium kernel: [  100.980349] lxdbr0: port 1(vethb2ddf64e) entered blocking state
Jan 30 08:24:12 seaborgium kernel: [  100.980354] lxdbr0: port 1(vethb2ddf64e) entered forwarding state
Jan 30 08:24:12 seaborgium kernel: [  100.980468] IPv6: ADDRCONF(NETDEV_CHANGE): lxdbr0: link becomes ready
xvd7 commented 3 years ago

After adding ExecStartPre=/bin/sleep 15 to the wireguard service the log changes to:

Feb  1 16:11:39 seaborgium lxd.daemon[424204]: t=2021-02-01T16:11:39+0000 lvl=dbug msg=Start driver=bridge network=lxdbr0
Feb  1 16:11:39 seaborgium lxd.daemon[424204]: t=2021-02-01T16:11:39+0000 lvl=dbug msg="Setting up network" driver=bridge network=lxdbr0
Feb  1 16:11:39 seaborgium NetworkManager[1657]:   [1612195899.5600] manager: (lxdbr0): new Bridge device (/org/freedesktop/NetworkManager/Devices/19)
Feb  1 16:11:39 seaborgium systemd-udevd[423629]: lxdbr0: Could not generate persistent MAC: No data available
Feb  1 16:11:39 seaborgium lxd.daemon[424204]: t=2021-02-01T16:11:39+0000 lvl=dbug msg="Stable MAC generated" driver=bridge hwAddr=00:16:3e:63:3b:f3 network=lxdbr0 seed=b1f38ff95313c1ba382c3f3c8584c3dc5c1bad42213aa4c700203871c7582c94.0.1
Feb  1 16:11:39 seaborgium systemd-networkd[1588]: lxdbr0: Link UP
Feb  1 16:11:39 seaborgium kernel: [194907.750490] audit: type=1400 audit(1612195899.660:56): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd_dnsmasq-lxdbr0_" pid=424315 comm="apparmor_parser"
Feb  1 16:11:40 seaborgium kernel: [194908.104113] lxdbr0: port 1(veth91deecf4) entered blocking state
Feb  1 16:11:40 seaborgium kernel: [194908.104116] lxdbr0: port 1(veth91deecf4) entered disabled state
Feb  1 16:11:40 seaborgium kernel: [194908.104263] lxdbr0: port 1(veth91deecf4) entered blocking state
Feb  1 16:11:40 seaborgium kernel: [194908.104266] lxdbr0: port 1(veth91deecf4) entered forwarding state
Feb  1 16:11:40 seaborgium kernel: [194908.104473] lxdbr0: port 1(veth91deecf4) entered disabled state
Feb  1 16:11:40 seaborgium systemd-networkd[1588]: lxdbr0: Gained carrier
Feb  1 16:11:40 seaborgium NetworkManager[1657]:   [1612195900.3144] device (lxdbr0): carrier: link connected
Feb  1 16:11:40 seaborgium kernel: [194908.397895] lxdbr0: port 1(veth91deecf4) entered blocking state
Feb  1 16:11:40 seaborgium kernel: [194908.397898] lxdbr0: port 1(veth91deecf4) entered forwarding state
Feb  1 16:11:40 seaborgium kernel: [194908.397945] IPv6: ADDRCONF(NETDEV_CHANGE): lxdbr0: link becomes ready
Feb  1 16:11:40 seaborgium dnsmasq[424318]: started, version 2.79 cachesize 150
Feb  1 16:11:40 seaborgium dnsmasq[424318]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify
Feb  1 16:11:40 seaborgium dnsmasq-dhcp[424318]: DHCP, IP range 10.2.66.2 -- 10.2.66.254, lease time 1h
Feb  1 16:11:40 seaborgium dnsmasq[424318]: using local addresses only for domain lxd
Feb  1 16:11:40 seaborgium dnsmasq[424318]: reading /etc/resolv.conf
Feb  1 16:11:40 seaborgium dnsmasq[424318]: using local addresses only for domain lxd
Feb  1 16:11:40 seaborgium dnsmasq[424318]: using nameserver 213.186.33.99#53
Feb  1 16:11:40 seaborgium dnsmasq[424318]: read /etc/hosts - 9 addresses
Feb  1 16:11:40 seaborgium dnsmasq-dhcp[424318]: read /var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts/test1
Feb  1 16:11:40 seaborgium dnsmasq-dhcp[424318]: DHCPDISCOVER(lxdbr0) 00:16:3e:d1:47:cd 
Feb  1 16:11:40 seaborgium dnsmasq-dhcp[424318]: DHCPOFFER(lxdbr0) 10.2.66.2 00:16:3e:d1:47:cd 
Feb  1 16:11:40 seaborgium dnsmasq-dhcp[424318]: DHCPREQUEST(lxdbr0) 10.2.66.2 00:16:3e:d1:47:cd 
Feb  1 16:11:40 seaborgium dnsmasq-dhcp[424318]: DHCPACK(lxdbr0) 10.2.66.2 00:16:3e:d1:47:cd test1
Feb  1 16:11:41 seaborgium systemd-networkd[1588]: lxdbr0: Gained IPv6LL
Feb  1 16:41:41 seaborgium dnsmasq-dhcp[424318]: DHCPREQUEST(lxdbr0) 10.2.66.2 00:16:3e:d1:47:cd 
Feb  1 16:41:41 seaborgium dnsmasq-dhcp[424318]: DHCPACK(lxdbr0) 10.2.66.2 00:16:3e:d1:47:cd test1
tomponline commented 3 years ago

OK thanks, let me see if I can recreate here.

tomponline commented 3 years ago

Out of interest, what is the reason for using the lxdbr0 IP address on the wireguard interface? Why not use a separate address and then publish the lxdbr0 subnet as a route on that endpoint?

xvd7 commented 3 years ago

Out of interest, what is the reason for using the lxdbr0 IP address on the wireguard interface? Why not use a separate address and then publish the lxdbr0 subnet as a route on that endpoint?

It started historically - I was experimenting with LXC (yes - good old LXC before LXD was born). I used 10.X.Y.1 for the node and 10.X.Y.2 etc. for the containers. Then I got more servers and needed encrypted connection among all the containers. So I bolted on a peer-to-peer VPN. I did not know much about networking so I used the same IP for the vpn interface and everything just worked. I never looked back. It was a rock-solid setup up until around LXD 4.0.1.

tomponline commented 3 years ago

I've not been able to recreate it, using dummy interfaces, and @stgraber tried using wireguard interfaces also. Without being able to recreate there's not much we can do, especially as this is likely a dnsmasq bug (we bundle our own dnsmasq inside the snap so that may be why it correlates with a specific LXD version).

Does it occur if the wireguard interface doesn't share the same IP as the lxdbr0 interface?

It would seem beneficial to use a separate subnet for the wireguard network, and the publish the lxdbr0 route over the wireguard link so that peers on the wireguard link can then reach other containers on lxdbr0, not just lxdbr0 host?

xvd7 commented 3 years ago

Thank you for helping to narrow down the issue.

The problem does NOT occur if the wireguard interface has a different IP than the lxdbr0 interface.

I found few changes in dnsmasq from July 2020 such as one labeled "Handle listening on duplicate addresses" https://github.com/pi-hole/FTL/pull/848/commits. That would confirm the suspicion that the problem is related to dnsmasq bug.

tomponline commented 3 years ago

Good spotting. Yes the l->used = 1; bit in https://github.com/pi-hole/FTL/pull/848/commits/85c788d7c3febdac604d0319acbcd3f4b62520a8 is what is related to the unknown interface lxdbr0 error in the logs you see.

I'll close this now, but if you report the issue to dnsmasq, and can re-open once fixed upstream, we then can pull in the changes.

trombonebuster commented 3 years ago

Hi, I'm having a problem with lxd not assigning ipv4 or ipv6 to my containers as well. I'm getting this error in the lxd.log -Failed enabling VLAN filtering on bridge \"lxdbr0\": open /sys/class/net/lxdbr0/bridge/vlan_filtering: permission denied. Does anyone know how to fix this? Screenshot from 2021-06-09 14-03-39