canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.35k stars 930 forks source link

Networking does not work in fresh Bionic container #4510

Closed Kramerican closed 6 years ago

Kramerican commented 6 years ago

Tried with LXD v2.21 on Ubuntu 16.04 and LXD v3.0.0 on 18.04 (system upgraded from 16.04)

Networking does not come up and container does not get an Ip assigned on my network bridge.

On both my 16.04 and 18.04 host system, a xenial image comes up just fine.

I have tried provisioning from ubuntu:bionicas well as images:ubuntu/bionic/amd64 with identical results.

/var/log/syslog on the host shows in all cases lines similar to

Apr 29 20:25:15 krellide kernel: [6056886.886248] audit: type=1400 audit(1525026315.592:23530): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-bionic-template-xlemp72_</var/lib/lxd>" name="/sys/fs/cgroup/unified/" pid=19042 comm="systemd" fstype="cgroup2" srcname="cgroup" flags="rw, nosuid, nodev, noexec"
Apr 29 20:25:15 krellide kernel: [6056886.886297] audit: type=1400 audit(1525026315.592:23531): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-bionic-template-xlemp72_</var/lib/lxd>" name="/sys/fs/cgroup/unified/" pid=19042 comm="systemd" fstype="cgroup2" srcname="cgroup" flags="rw, nosuid, nodev, noexec"
Apr 29 20:25:16 krellide kernel: [6056887.323323] audit: type=1400 audit(1525026316.029:23532): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-bionic-template-xlemp72_</var/lib/lxd>" name="/run/systemd/unit-root/var/lib/lxcfs/" pid=19482 comm="(networkd)" flags="ro, nosuid, nodev, remount, bind"

These lines are not present in syslog when provisioning other versions of Ubuntu (Xenial/Zesty). Interestingly upgrading an existing Xenial container to Bionic does not cause any networking issues.

Without knowing much about apparmor, I am assuming that the DENIED ... networkd line is an indicator of the culprit here. Any assistance would be much appreciated :)

stgraber commented 6 years ago
stgraber@castiana:~$ lxc launch ubuntu:bionic bionic
Creating bionic
Starting bionic
stgraber@castiana:~$ lxc list bionic
+--------+---------+----------------------+----------------------------------------------+------------+-----------+
|  NAME  |  STATE  |         IPV4         |                     IPV6                     |    TYPE    | SNAPSHOTS |
+--------+---------+----------------------+----------------------------------------------+------------+-----------+
| bionic | RUNNING | 10.166.11.129 (eth0) | 2001:470:b368:4242:216:3eff:fe40:9c2c (eth0) | PERSISTENT | 0         |
+--------+---------+----------------------+----------------------------------------------+------------+-----------+

Anything special about your bionic container?

Can you post lxc config show --expanded NAME?

Kramerican commented 6 years ago

Ah beaverballs ... I was really hoping this would not be something specific to my system(s). Anyway, as requested lxc config show --expanded NAME

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20180426.2)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20180426.2"
  image.version: "18.04"
  limits.cpu.allowance: 50%
  limits.memory: 4GB
  limits.memory.enforce: soft
  linux.kernel_modules: ip_tables,ip6_tables
  volatile.base_image: b36ec647e374da4816104a98807633a2cc387488083d3776557081c4d0333618
  volatile.eth0.hwaddr: 00:16:3e:bf:58:5a
  volatile.idmap.base: "0"
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: lxd
    size: 60GB
    type: disk
ephemeral: false
profiles:
- webdock
stateful: false
description: ""

Thank you for your assistance as always

stgraber commented 6 years ago

Ok, that looks reasonable.

What does ps fauxww look like inside the container?

Kramerican commented 6 years ago

ps fauxww

root@bionic-template:~# ps fauxwww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       263  0.0  0.0  21900  2584 ?        Ss   08:50   0:00 bash
root       273  0.0  0.0  37792  2156 ?        R+   08:50   0:00  \_ ps fauxwww
root         1  0.0  0.0  77484  6004 ?        Ss   08:07   0:00 /sbin/init
root        58  0.0  0.0  42104  2016 ?        Ss   08:07   0:00 /lib/systemd/systemd-udevd
root        62  0.0  0.0  78436  7604 ?        Ss   08:07   0:00 /lib/systemd/systemd-journald
systemd+   112  0.0  0.0  71816  3844 ?        Ss   08:07   0:00 /lib/systemd/systemd-networkd
systemd+   113  0.0  0.0  70608  4024 ?        Ss   08:07   0:00 /lib/systemd/systemd-resolved
root       150  0.0  0.0  61996  3780 ?        Ss   08:09   0:00 /lib/systemd/systemd-logind
daemon     154  0.0  0.0  28332  1568 ?        Ss   08:09   0:00 /usr/sbin/atd -f
root       156  0.0  0.0  31748  1880 ?        Ss   08:09   0:00 /usr/sbin/cron -f
syslog     159  0.0  0.0 197636  2824 ?        Ssl  08:09   0:00 /usr/sbin/rsyslogd -n
message+   160  0.0  0.0  49928  2836 ?        Ss   08:09   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root       163  0.0  0.0 170860 14264 ?        Ssl  08:09   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher
root       164  0.0  0.0 287980  4496 ?        Ssl  08:09   0:00 /usr/lib/accountsservice/accounts-daemon
root       165  0.0  0.0 1210256 14000 ?       Ssl  08:09   0:00 /usr/lib/snapd/snapd
root       166  0.0  0.0  72296  3652 ?        Ss   08:09   0:00 /usr/sbin/sshd -D
root       173  0.0  0.0  16412  1484 console  Ss+  08:09   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
root       179  0.0  0.0 288868  4480 ?        Ssl  08:09   0:00 /usr/lib/policykit-1/polkitd --no-debug
stgraber commented 6 years ago

Also, what kernel are you running? You mentioned 18.04, so that's LXD 3.0.0 from the archive running on Ubuntu 18.04 with the 4.15 kernel correct?

Kramerican commented 6 years ago

@stgraber Yes that is correct, although the command outputs here are from my test container on my 16.04 system. I just provisioned a container with lxc init ubuntu:bionic on my 18.04 system and ran the same commands:

lxc config show --expanded NAME

architecture: x86_64
config:
  boot.autostart: "true"
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20180426.2)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20180426.2"
  image.version: "18.04"
  limits.cpu.allowance: 50%
  limits.memory: 8GB
  limits.memory.enforce: soft
  linux.kernel_modules: ip_tables,ip6_tables
  volatile.base_image: b36ec647e374da4816104a98807633a2cc387488083d3776557081c4d0333618
  volatile.eth0.hwaddr: 00:16:3e:d3:ff:75
  volatile.idmap.base: "0"
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    limits.read: 40000iops
    limits.write: 10000iops
    path: /
    pool: lxd
    size: 50GB
    type: disk
ephemeral: false
profiles:
- webdockssd50
stateful: false
description: ""

ps fauxwww

root@bionic-template:~# ps fauxwww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       169  0.0  0.0  21900  2572 ?        Ss   09:54   0:00 bash
root       294  0.0  0.0  37792  2104 ?        R+   09:55   0:00  \_ ps fauxwww
root         1  0.0  0.0  77544  5980 ?        Ss   09:53   0:00 /sbin/init
root        54  0.0  0.0  78436  7580 ?        Ss   09:53   0:00 /lib/systemd/systemd-journald
root        63  0.0  0.0  42104  2104 ?        Ss   09:53   0:00 /lib/systemd/systemd-udevd
systemd+   166  0.0  0.0  71816  3932 ?        Ss   09:53   0:00 /lib/systemd/systemd-networkd
systemd+   167  0.0  0.0  70608  4148 ?        Ss   09:53   0:00 /lib/systemd/systemd-resolved
root       212  0.0  0.0 767224 14912 ?        Ssl  09:55   0:00 /usr/lib/snapd/snapd
root       213  0.0  0.0 170864 14232 ?        Ssl  09:55   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher
root       217  0.0  0.0  61996  3792 ?        Ss   09:55   0:00 /lib/systemd/systemd-logind
syslog     218  0.0  0.0 197636  2812 ?        Ssl  09:55   0:00 /usr/sbin/rsyslogd -n
root       219  0.0  0.0 287980  4432 ?        Ssl  09:55   0:00 /usr/lib/accountsservice/accounts-daemon
message+   221  0.0  0.0  49928  2880 ?        Ss   09:55   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
daemon     235  0.0  0.0  28332  1612 ?        Ss   09:55   0:00 /usr/sbin/atd -f
root       236  0.0  0.0  31748  1996 ?        Ss   09:55   0:00 /usr/sbin/cron -f
root       237  0.0  0.0  72296  3712 ?        Ss   09:55   0:00 /usr/sbin/sshd -D
root       239  0.0  0.0 288868  4432 ?        Ssl  09:55   0:00 /usr/lib/policykit-1/polkitd --no-debug
root       244  0.0  0.0  16412  1496 console  Ss+  09:55   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux

for completeness, on my 16.04 system inside the container uname -r shows 4.13.0-32-generic

And on 18.04 4.15.0-20-generic

stgraber commented 6 years ago

What storage backend are you using?

Kramerican commented 6 years ago

@stgraber ZFS :)

Interesting questions you are asking! I would have thought we'd be looking more specifically at the network interface. I'd love to know what theories / causes you are thinking might be at work here ...

stgraber commented 6 years ago

No luck reproducing it here on a bunch of systems. Our automated testing also shows that this image has come up with working network for weeks now...

stgraber commented 6 years ago

In your container, can you try running systemctl start systemd-networkd see if that causes the same error to show up in dmesg?

Kramerican commented 6 years ago

Oww man - this is basically a worst case scenario then. There's something in my setup messing with networking in a Bionic container specifically, but not Zesty or Xenial. On the surface this is rather bizarre. I really hate wasting your time, so I really hope this isn't down to some retarded config I have done on my end.

In any case, I am not sure if I follow your instructions correctly here, but I execute systemctl start systemd-networkd and it completes with no message (success I assume). Running dmesg in the container gives me a dmesg: read kernel buffer failed: Operation not permitted which is expected AFAIK, and running dmesg on the host reveals nothing new (no new messages) - and cat /var/log/syslog on host doesn't reveal anything new (no new messages).

If we assume I have something "unrelated" to LXD in my setup here which is messing things up, any suggestions on steps I could take to narrow things down?

stgraber commented 6 years ago

That line:

Apr 29 20:25:16 krellide kernel: [6056887.323323] audit: type=1400 audit(1525026316.029:23532): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-bionic-template-xlemp72_</var/lib/lxd>" name="/run/systemd/unit-root/var/lib/lxcfs/" pid=19482 comm="(networkd)" flags="ro, nosuid, nodev, remount, bind"

Is clearly what's causing the network problem, most likely it's preventing systemd-networkd from starting dhclient or something, causing the network problem you're seeing...

Can you try:

lxc launch ubuntu:18.04 test lxc exec test bash

That may help figure out what's going on.

Kramerican commented 6 years ago

Sure thing :)

root@test:~# journalctl | cat
-- Logs begin at Mon 2018-04-30 14:48:10 UTC, end at Mon 2018-04-30 14:48:12 UTC. --
Apr 30 14:48:10 test systemd-journald[55]: Journal started
Apr 30 14:48:10 test systemd-journald[55]: Runtime journal (/run/log/journal/897ae676489745b4b5f6303bdde70ccf) is 8.0M, max 3.1G, 3.1G free.
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '4 4 1 7' to 'kernel/printk', ignoring: Permission denied
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '1' to 'kernel/kptr_restrict', ignoring: Permission denied
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '1' to 'fs/protected_hardlinks', ignoring: Permission denied
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '1' to 'fs/protected_symlinks', ignoring: Permission denied
Apr 30 14:48:10 test keyboard-setup.sh[52]: Couldn't get a file descriptor referring to the console
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '1024' to 'fs/inotify/max_user_instances', ignoring: Permission denied
Apr 30 14:48:10 test keyboard-setup.sh[52]: Couldn't get a file descriptor referring to the console
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '176' to 'kernel/sysrq', ignoring: Permission denied
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '1' to 'kernel/yama/ptrace_scope', ignoring: Permission denied
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write '65536' to 'vm/mmap_min_addr', ignoring: Permission denied
Apr 30 14:48:10 test systemd-sysctl[58]: Couldn't write 'fq_codel' to 'net/core/default_qdisc', ignoring: No such file or directory
Apr 30 14:48:10 test systemd[1]: Started Apply Kernel Variables.
Apr 30 14:48:10 test systemd-journald[55]: Forwarding to syslog missed 1 messages.
Apr 30 14:48:10 test systemd[1]: systemd-journal-flush.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:10 test systemd[1]: Starting Flush Journal to Persistent Storage...
Apr 30 14:48:10 test systemd[1]: Started udev Kernel Device Manager.
Apr 30 14:48:10 test systemd-journald[55]: Time spent on flushing to /var is 1.671ms for 18 entries.
Apr 30 14:48:10 test systemd-journald[55]: System journal (/var/log/journal/897ae676489745b4b5f6303bdde70ccf) is 512B, max 4.0G, 3.9G free.
Apr 30 14:48:10 test systemd[1]: Started Set the console keyboard layout.
Apr 30 14:48:10 test keyboard-setup.sh[52]: setupcon: We are not on the console, the console is left unconfigured.
Apr 30 14:48:10 test systemd[1]: Reached target Local File Systems (Pre).
Apr 30 14:48:10 test systemd[1]: Reached target Local File Systems.
Apr 30 14:48:10 test systemd[1]: apparmor.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:10 test systemd[1]: Starting AppArmor initialization...
Apr 30 14:48:10 test systemd[1]: console-setup.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:10 test systemd[1]: Starting Set console font and keymap...
Apr 30 14:48:10 test systemd[1]: ebtables.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:10 test systemd[1]: Starting ebtables ruleset management...
Apr 30 14:48:10 test systemd[1]: plymouth-read-write.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:10 test systemd[1]: Starting Tell Plymouth To Write Out Runtime Data...
Apr 30 14:48:10 test systemd[1]: Started Set console font and keymap.
Apr 30 14:48:10 test systemd[1]: Started Tell Plymouth To Write Out Runtime Data.
Apr 30 14:48:10 test systemd[1]: Started Flush Journal to Persistent Storage.
Apr 30 14:48:10 test systemd[1]: Started ebtables ruleset management.
Apr 30 14:48:10 test apparmor[85]:  * Not starting AppArmor in container
Apr 30 14:48:10 test apparmor[85]:    ...done.
Apr 30 14:48:10 test systemd[1]: systemd-tmpfiles-setup.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:10 test systemd[1]: Starting Create Volatile Files and Directories...
Apr 30 14:48:10 test systemd[1]: Started AppArmor initialization.
Apr 30 14:48:10 test systemd[1]: Started Create Volatile Files and Directories.
Apr 30 14:48:10 test systemd[1]: systemd-update-utmp.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:10 test systemd[1]: Starting Update UTMP about System Boot/Shutdown...
Apr 30 14:48:10 test systemd[1]: Reached target System Time Synchronized.
Apr 30 14:48:10 test systemd[1]: Started udev Coldplug all Devices.
Apr 30 14:48:10 test systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
Apr 30 14:48:10 test systemd[1]: Reached target Local Encrypted Volumes.
Apr 30 14:48:10 test systemd[1]: Started Update UTMP about System Boot/Shutdown.
Apr 30 14:48:11 test cloud-init[53]: Cloud-init v. 18.2 running 'init-local' at Mon, 30 Apr 2018 14:48:10 +0000. Up 0.00 seconds.
Apr 30 14:48:11 test systemd[1]: Started Initial cloud-init job (pre-networking).
Apr 30 14:48:11 test systemd[1]: Reached target Network (Pre).
Apr 30 14:48:11 test systemd[1]: systemd-networkd.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Starting Network Service...
Apr 30 14:48:11 test systemd-networkd[123]: Enumeration completed
Apr 30 14:48:11 test systemd[1]: Started Network Service.
Apr 30 14:48:11 test systemd[1]: systemd-networkd-wait-online.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Starting Wait for Network to be Configured...
Apr 30 14:48:11 test systemd[1]: systemd-resolved.service: Failed to reset devices.list: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Apr 30 14:48:11 test systemd[1]: Starting Network Name Resolution...
Apr 30 14:48:11 test systemd-resolved[125]: Positive Trust Anchors:
Apr 30 14:48:11 test systemd-resolved[125]: . IN DS 19036 8 2 49aac11d7b6f6446702e54a1607371607a1a41855200fd2ce1cdde32f24e8fb5
Apr 30 14:48:11 test systemd-resolved[125]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Apr 30 14:48:11 test systemd-resolved[125]: Negative trust anchors: 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.172.in-addr.arpa 19.172.in-addr.arpa 20.172.in-addr.arpa 21.172.in-addr.arpa 22.172.in-addr.arpa 23.172.in-addr.arpa 24.172.in-addr.arpa 25.172.in-addr.arpa 26.172.in-addr.arpa 27.172.in-addr.arpa 28.172.in-addr.arpa 29.172.in-addr.arpa 30.172.in-addr.arpa 31.172.in-addr.arpa 168.192.in-addr.arpa d.f.ip6.arpa corp home internal intranet lan local private test
Apr 30 14:48:11 test systemd-resolved[125]: Using system hostname 'test'.
Apr 30 14:48:11 test systemd[1]: Started Network Name Resolution.
Apr 30 14:48:11 test systemd[1]: Reached target Host and Network Name Lookups.
Apr 30 14:48:11 test systemd[1]: Reached target Network.
Apr 30 14:48:12 test systemd-networkd[123]: eth0: Gained IPv6LL
root@test:~# cat /proc/1/mountinfo
430 431 0:80 /rootfs / rw,relatime master:98 - zfs lxd/containers/test rw,xattr,noacl
475 430 0:81 / /dev rw,relatime - tmpfs none rw,size=492k,mode=755,uid=100000,gid=100000
476 430 0:83 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
477 430 0:84 / /sys rw,relatime - sysfs sysfs rw
478 475 0:6 /fuse /dev/fuse rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
479 475 0:6 /net/tun /dev/net/tun rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
480 476 0:44 / /proc/sys/fs/binfmt_misc rw,relatime master:82 - binfmt_misc binfmt_misc rw
481 477 0:42 / /sys/fs/fuse/connections rw,relatime master:29 - fusectl fusectl rw
482 477 0:28 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore pstore rw
483 477 0:8 / /sys/kernel/debug rw,relatime master:24 - debugfs debugfs rw
484 483 0:11 / /sys/kernel/debug/tracing rw,relatime master:26 - tracefs tracefs rw
485 477 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - securityfs securityfs rw
486 475 0:19 / /dev/mqueue rw,relatime master:25 - mqueue mqueue rw
487 475 0:47 / /dev/lxd rw,relatime master:88 - tmpfs tmpfs rw,size=100k,mode=755
488 475 0:46 /test /dev/.lxd-mounts rw,relatime master:86 - tmpfs tmpfs rw,size=100k,mode=711
489 476 0:45 /proc/cpuinfo /proc/cpuinfo rw,nosuid,nodev,relatime master:84 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
490 476 0:45 /proc/diskstats /proc/diskstats rw,nosuid,nodev,relatime master:84 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
491 476 0:45 /proc/meminfo /proc/meminfo rw,nosuid,nodev,relatime master:84 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
492 476 0:45 /proc/stat /proc/stat rw,nosuid,nodev,relatime master:84 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
493 476 0:45 /proc/swaps /proc/swaps rw,nosuid,nodev,relatime master:84 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
494 476 0:45 /proc/uptime /proc/uptime rw,nosuid,nodev,relatime master:84 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
495 430 0:45 / /var/lib/lxcfs rw,nosuid,nodev,relatime master:84 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
496 475 0:6 /null /dev/null rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
497 475 0:6 /zero /dev/zero rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
498 475 0:6 /full /dev/full rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
499 475 0:6 /urandom /dev/urandom rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
500 475 0:6 /random /dev/random rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
501 475 0:6 /tty /dev/tty rw,nosuid,relatime master:2 - devtmpfs udev rw,size=32854604k,nr_inodes=8213651,mode=755
502 475 0:22 /4 /dev/console rw,nosuid,noexec,relatime master:3 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
432 475 0:85 / /dev/pts rw,relatime - devpts devpts rw,gid=100005,mode=620,ptmxmode=666
433 475 0:85 /ptmx /dev/ptmx rw,relatime - devpts devpts rw,gid=100005,mode=620,ptmxmode=666
434 475 0:86 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw,uid=100000,gid=100000
435 430 0:87 / /run rw,nosuid,nodev - tmpfs tmpfs rw,mode=755,uid=100000,gid=100000
436 435 0:88 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=5120k,uid=100000,gid=100000
437 477 0:89 / /sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,mode=755,uid=100000,gid=100000
438 437 0:27 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
439 437 0:36 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
440 437 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
441 437 0:35 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
442 437 0:29 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
443 437 0:31 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
444 437 0:39 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
445 437 0:37 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
446 437 0:38 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
447 437 0:30 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset
448 437 0:34 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
449 437 0:33 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
stgraber commented 6 years ago

What do you see in /run/systemd/network/10-netplan-eth0.network?

stgraber commented 6 years ago

And lets do networkctl status eth0 too

Kramerican commented 6 years ago
cat /run/systemd/network/10-netplan-eth0.network
[Match]
Name=eth0

[Network]
DHCP=ipv4

[DHCP]
UseMTU=true
RouteMetric=100
root@test:~# networkctl status eth0
● 246: eth0
       Link File: n/a
    Network File: /run/systemd/network/10-netplan-eth0.network
            Type: ether
           State: degraded (configuring)
      HW Address: 00:16:3e:8d:f6:22 (Xensource, Inc.)
         Address: fe80::216:3eff:fe8d:f622

Seems to me, comparing journalctl output from a working Xenial container that your hunch is right in that DHCP isn't coming up

davidfavor commented 6 years ago

Tried...

sudo ln -s /etc/apparmor.d/usr.sbin.tcpdump /etc/apparmor.d/disable/
# Then container restart. No effect.
davidfavor commented 6 years ago

Any suggestions of a fix appreciated.

Complex/Manual steps are fine.

Thanks.

davidfavor commented 6 years ago

The problem appears to be in cloud-init.

The /etc/netplan/50-cloud-init.yaml runs + silently fails.

Cloud Init log output shows all public address assigned + the DHCP address fails to be assigned + there's no error.

Refer to Upstream Ubuntu Bug: /etc/netplan/50-cloud-init.yaml fails silently with no dhcp address assigned I opened this morning.

davidfavor commented 6 years ago

The info you've posted above is non-netplan type networking.

I'm using pure Netplan at machine level + inside containers, so my setup will be very different than yours.

Kramerican commented 6 years ago

@davidfavor Fair enough - I was grasping at straws :) I have delete my comment as not to crud up this thread too much. Seems like you may have found the cause - good work David!

stgraber commented 6 years ago

@Kramerican oh, so in your case you had some custom network config for your containers? It wasn't a straight lxc launch ubuntu:bionic blah without any additional modification?

stgraber commented 6 years ago

@davidfavor do you see something in your dmesg, like the DENIAL listed above here?

And to be clear, the problem is also specifically with Ubuntu 18.04 containers running on either a 18.04 or 16.04 host, non-18.04 containers work fine for you right?

Kramerican commented 6 years ago

@stgraber Yes and no. Just doing straing lxc launch ubuntu:bionic blahwithout any additional modification also resulted in the apparmor error message and dhcp not coming up. So I don't think it's my network config per se at fault here

stgraber commented 6 years ago

Ok, I don't suppose you've got that happening on a system that you could give me access to? That'd be way too easy to debug :)

Kramerican commented 6 years ago

@stgraber Of course! I'll send you a mail with details post-haste :)

davidfavor commented 6 years ago

@stgraber, Apologies. I should have been more clear about machine/container OS.

The container OS versions are yakkety, zesty, artful.

These containers run on machines running yakkety, zesty, artful, bionic.

As soon as any machine upgrades to bionic, all containers fail.

When Netplan processes /etc/netplan/50-cloud-init.yaml there are no errors or warnings. The DHCP address simply isn't defined. The cloud-init log files provide no hint of possible problem (see data in Ubuntu cloud-init ticket referenced above).

Bionic (18.04) container work find running on machines where yakkety, zesty, artful are installed.

It's only the machine update to bionic which causes all containers to fail.

davidfavor commented 6 years ago
dmesg | grep -i denial

Both machine level + container level produce no output.

This relates to all old containers + also the new test container I just launched which has zero modifications.

davidfavor commented 6 years ago

@stgraber, I can provide you full access to the machine having problem.

Skype me at ID davidfavor to talk about access.

davidfavor commented 6 years ago

Geez. Bionic packaging is all screwed up.

1) dnsmasq-base isn't installed at machine level.

FIX: apt-get install dnsmasq-base

2) named is running ahead of LXD setup which hijacks all interfaces, so named somehow listens on 10.X.X.X before dnsmasq, producing /var/log/syslog messages of...

dnsmasq[1846]: failed to create listening socket for 10.245.137.1: Address already in use

FIX: apt-get purge bind9

3) After reboot some progress is made.

LXD never starts + at least an error message is produced which might be useful.

Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: no such file or directory" t=2018-05-01T13:11:38+0000

So lxdbr0 never becomes available + dnsmasq can't run to route over 10.X.X.X address.

Let me know what to debug next.

davidfavor commented 6 years ago
net10 # snap refresh
All snaps up to date.
net10 # snap list
Name  Version    Rev   Tracking  Developer  Notes
core  16-2.32.5  4486  stable    canonical  core
lxd   3.0.0      6862  stable    canonical  -

Machine history. This machine had an Artful install running. Fresh install. No update from any previous OS.

LXD was installed via SNAP, so no migration of containers was done.

Containers used "lxc copy" from other machines, to arrive at this machine.

davidfavor commented 6 years ago

Hum...

net10 # service lxd restart
Failed to restart lxd.service: Unit lxd.service not found.
net10 # apt-file search lxd.service
lxd: /lib/systemd/system/lxd.service

So... next question is... if there's no /lib/systemd/system/lxd.service installed via snap install of LXD, how can lxd be restarted from command line.

snap start lxd

Produces loop of Failed connecting to LXD daemon (attempt 690): Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: no such file or directory till 660 timeout of waitready.

stgraber commented 6 years ago

@davidfavor sounds like you're mixing the LXD deb and snap here. When you're using the snap, you don't need dnsmasq-base installed on the host as it's included inside the snap.

The systemd unit for the snap is snap.lxd.daemon, you shouldn't have a lxd.service or lxd.socket unit around.

I suspect the problem in your case was very much bind9 conflicting with the LXD dnsmasq. To fix that, removing bind9 and then bouncing LXD with systemctl reload snap.lxd.daemon should take care of it.

You should have had a dnsmasq error in the LXD log at /var/snap/lxd/common/lxd/logs/lxd.log indicating some kind of port/address conflict. After getting rid of the conflicting and bouncing LXD, things should work again.

davidfavor commented 6 years ago

No APT + SNAP mixing on this machine.

net10 # dpkg -l | grep lxd
net10 # dpkg -l | grep lxc

There is no bind9/named conflict anymore as I've removed the packages + rebooted.

systemctl reload snap.lxd.daemon
Job for snap.lxd.daemon.service failed because the control process exited with error code.
See "systemctl status snap.lxd.daemon.service" and "journalctl -xe" for details.

Because lxd.daemon trapped in "waitready 600" after boot.

So let me know if I should just kill the "lxd waitready --timeout=600" process + try the restart.

Had to kill the "lxd waitready --timeout=600" process, because the code seems broken.

I'm guessing the code's broken as /var/log/syslog shows...

Connecting to LXD daemon (attempt 1030)

Which far exceeds --timeout=600 passed to lxd waitready.

stgraber commented 6 years ago

Yeah, kill that lxd waitready and then try the reload again, if that still fails, show the output of:

davidfavor commented 6 years ago

Issuing - systemctl reload snap.lxd.daemon - after killing off - lxd waitready --timeout=600 - just starts another - lxd waitready --timeout=600 - process, so this fails also.

davidfavor commented 6 years ago
net10 # systemctl status snap.lxd.daemon
● snap.lxd.daemon.service - Service for snap application lxd.daemon
   Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; enabled; vendor preset: enabled)
   Active: active (running) (Result: exit-code) since Tue 2018-05-01 08:50:33 CDT; 3min 41s ago
  Process: 34806 ExecReload=/usr/bin/snap run --command=reload lxd.daemon (code=exited, status=1/FAILURE)
 Main PID: 24257 (daemon.start)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/snap.lxd.daemon.service
           ‣ 24257 /bin/sh /snap/lxd/6862/commands/daemon.start

May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: ==> Escaping the systemd cgroups
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: ==> Escaping the systemd process resource limits
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: => Re-using existing LXCFS
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: => Starting LXD
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-05-01T13:50:34+0000
May 01 08:52:29 net10.bizcooker.com systemd[1]: Reloading Service for snap application lxd.daemon.
May 01 08:52:29 net10.bizcooker.com systemd[1]: Reloaded Service for snap application lxd.daemon.
May 01 08:53:05 net10.bizcooker.com systemd[1]: Reloading Service for snap application lxd.daemon.
May 01 08:53:05 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=1
May 01 08:53:05 net10.bizcooker.com systemd[1]: Reload failed for Service for snap application lxd.daemon.
davidfavor commented 6 years ago
-- Reboot --
May 01 08:10:27 net10.bizcooker.com systemd[1]: Started Service for snap application lxd.daemon.
May 01 08:10:31 net10.bizcooker.com lxd.daemon[1047]: => Preparing the system
May 01 08:10:31 net10.bizcooker.com lxd.daemon[1047]: ==> Loading snap configuration
May 01 08:10:31 net10.bizcooker.com lxd.daemon[1047]: ==> Setting up mntns symlink
May 01 08:10:31 net10.bizcooker.com lxd.daemon[1047]: ==> Setting up kmod wrapper
May 01 08:10:31 net10.bizcooker.com lxd.daemon[1047]: ==> Preparing /boot
May 01 08:10:31 net10.bizcooker.com lxd.daemon[1047]: ==> Preparing a clean copy of /run
May 01 08:10:31 net10.bizcooker.com lxd.daemon[1047]: ==> Preparing a clean copy of /etc
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: ==> Setting up ceph configuration
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: ==> Setting up LVM configuration
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: ==> Rotating logs
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: ==> Setting up ZFS (0.7)
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: ==> Escaping the systemd cgroups
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: ==> Escaping the systemd process resource limits
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: => Starting LXCFS
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: => Starting LXD
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: mount namespace: 5
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]: hierarchies:
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   0: fd:   6: devices
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   1: fd:   7: freezer
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   2: fd:   8: cpuset
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   3: fd:   9: perf_event
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   4: fd:  10: rdma
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   5: fd:  11: blkio
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   6: fd:  12: cpu,cpuacct
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   7: fd:  13: hugetlb
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   8: fd:  14: net_cls,net_prio
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:   9: fd:  15: memory
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:  10: fd:  16: pids
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:  11: fd:  17: name=systemd
May 01 08:10:32 net10.bizcooker.com lxd.daemon[1047]:  12: fd:  18: unified
May 01 08:20:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
May 01 08:20:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
May 01 08:20:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
May 01 08:20:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 1.
May 01 08:20:33 net10.bizcooker.com systemd[1]: Stopped Service for snap application lxd.daemon.
May 01 08:20:33 net10.bizcooker.com systemd[1]: Started Service for snap application lxd.daemon.
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: => Preparing the system
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Loading snap configuration
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Setting up mntns symlink
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Setting up kmod wrapper
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Preparing /boot
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Preparing a clean copy of /run
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Preparing a clean copy of /etc
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Setting up ceph configuration
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Setting up LVM configuration
May 01 08:20:33 net10.bizcooker.com lxd.daemon[6776]: ==> Rotating logs
May 01 08:20:34 net10.bizcooker.com lxd.daemon[6776]: ==> Setting up ZFS (0.7)
May 01 08:20:34 net10.bizcooker.com lxd.daemon[6776]: ==> Escaping the systemd cgroups
May 01 08:20:34 net10.bizcooker.com lxd.daemon[6776]: ==> Escaping the systemd process resource limits
May 01 08:20:34 net10.bizcooker.com lxd.daemon[6776]: => Re-using existing LXCFS
May 01 08:20:34 net10.bizcooker.com lxd.daemon[6776]: => Starting LXD
May 01 08:20:34 net10.bizcooker.com lxd.daemon[6776]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-05-01T13:20:34+0000
May 01 08:30:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
May 01 08:30:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
May 01 08:30:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
May 01 08:30:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 2.
May 01 08:30:34 net10.bizcooker.com systemd[1]: Stopped Service for snap application lxd.daemon.
May 01 08:30:34 net10.bizcooker.com systemd[1]: Started Service for snap application lxd.daemon.
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: => Preparing the system
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Loading snap configuration
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Setting up mntns symlink
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Setting up kmod wrapper
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Preparing /boot
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Preparing a clean copy of /run
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Preparing a clean copy of /etc
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Setting up ceph configuration
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Setting up LVM configuration
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Rotating logs
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Setting up ZFS (0.7)
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Escaping the systemd cgroups
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: ==> Escaping the systemd process resource limits
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: => Re-using existing LXCFS
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: => Starting LXD
May 01 08:30:34 net10.bizcooker.com lxd.daemon[12938]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-05-01T13:30:34+0000
May 01 08:34:20 net10.bizcooker.com systemd[1]: Reloading Service for snap application lxd.daemon.
May 01 08:34:20 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=1
May 01 08:34:20 net10.bizcooker.com systemd[1]: Reload failed for Service for snap application lxd.daemon.
May 01 08:34:34 net10.bizcooker.com systemd[1]: Reloading Service for snap application lxd.daemon.
May 01 08:34:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=1
May 01 08:34:34 net10.bizcooker.com systemd[1]: Reload failed for Service for snap application lxd.daemon.
May 01 08:40:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
May 01 08:40:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
May 01 08:40:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
May 01 08:40:34 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 3.
May 01 08:40:34 net10.bizcooker.com systemd[1]: Stopped Service for snap application lxd.daemon.
May 01 08:40:34 net10.bizcooker.com systemd[1]: Started Service for snap application lxd.daemon.
May 01 08:40:34 net10.bizcooker.com lxd.daemon[18639]: => Preparing the system
May 01 08:40:34 net10.bizcooker.com lxd.daemon[18639]: ==> Loading snap configuration
May 01 08:40:34 net10.bizcooker.com lxd.daemon[18639]: ==> Setting up mntns symlink
May 01 08:40:34 net10.bizcooker.com lxd.daemon[18639]: ==> Setting up kmod wrapper
May 01 08:40:34 net10.bizcooker.com lxd.daemon[18639]: ==> Preparing /boot
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Preparing a clean copy of /run
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Preparing a clean copy of /etc
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Setting up ceph configuration
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Setting up LVM configuration
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Rotating logs
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Setting up ZFS (0.7)
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Escaping the systemd cgroups
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: ==> Escaping the systemd process resource limits
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: => Re-using existing LXCFS
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: => Starting LXD
May 01 08:40:35 net10.bizcooker.com lxd.daemon[18639]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-05-01T13:40:35+0000
May 01 08:50:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=143/n/a
May 01 08:50:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
May 01 08:50:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
May 01 08:50:33 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 4.
May 01 08:50:33 net10.bizcooker.com systemd[1]: Stopped Service for snap application lxd.daemon.
May 01 08:50:33 net10.bizcooker.com systemd[1]: Started Service for snap application lxd.daemon.
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: => Preparing the system
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Loading snap configuration
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Setting up mntns symlink
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Setting up kmod wrapper
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Preparing /boot
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Preparing a clean copy of /run
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Preparing a clean copy of /etc
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Setting up ceph configuration
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Setting up LVM configuration
May 01 08:50:33 net10.bizcooker.com lxd.daemon[24257]: ==> Rotating logs
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: ==> Setting up ZFS (0.7)
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: ==> Escaping the systemd cgroups
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: ==> Escaping the systemd process resource limits
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: => Re-using existing LXCFS
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: => Starting LXD
May 01 08:50:34 net10.bizcooker.com lxd.daemon[24257]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-05-01T13:50:34+0000
May 01 08:52:29 net10.bizcooker.com systemd[1]: Reloading Service for snap application lxd.daemon.
May 01 08:52:29 net10.bizcooker.com systemd[1]: Reloaded Service for snap application lxd.daemon.
May 01 08:53:05 net10.bizcooker.com systemd[1]: Reloading Service for snap application lxd.daemon.
May 01 08:53:05 net10.bizcooker.com systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=1
May 01 08:53:05 net10.bizcooker.com systemd[1]: Reload failed for Service for snap application lxd.daemon.
stgraber commented 6 years ago

Ok, so LXD is pretty unhappy about something.

What do you get if you run:

lxd --debug --group lxd
davidfavor commented 6 years ago

Dude! You've got me laughing now.

LXD == unhappy. Yep.

net10 # lxd --debug --group lxd
DBUG[05-01|14:08:03] Connecting to a local LXD over a Unix socket 
DBUG[05-01|14:08:03] Sending request to LXD                   etag= method=GET url=http://unix.socket/1.0
DBUG[05-01|14:08:03] Got response struct from LXD 
DBUG[05-01|14:08:03] 
    {
        "config": {
            "core.https_address": ":8443",
            "core.trust_password": true
        },
        "api_extensions": [
            "storage_zfs_remove_snapshots",
            "container_host_shutdown_timeout",
            "container_stop_priority",
            "container_syscall_filtering",
            "auth_pki",
            "container_last_used_at",
            "etag",
            "patch",
            "usb_devices",
            "https_allowed_credentials",
            "image_compression_algorithm",
            "directory_manipulation",
            "container_cpu_time",
            "storage_zfs_use_refquota",
            "storage_lvm_mount_options",
            "network",
            "profile_usedby",
            "container_push",
            "container_exec_recording",
            "certificate_update",
            "container_exec_signal_handling",
            "gpu_devices",
            "container_image_properties",
            "migration_progress",
            "id_map",
            "network_firewall_filtering",
            "network_routes",
            "storage",
            "file_delete",
            "file_append",
            "network_dhcp_expiry",
            "storage_lvm_vg_rename",
            "storage_lvm_thinpool_rename",
            "network_vlan",
            "image_create_aliases",
            "container_stateless_copy",
            "container_only_migration",
            "storage_zfs_clone_copy",
            "unix_device_rename",
            "storage_lvm_use_thinpool",
            "storage_rsync_bwlimit",
            "network_vxlan_interface",
            "storage_btrfs_mount_options",
            "entity_description",
            "image_force_refresh",
            "storage_lvm_lv_resizing",
            "id_map_base",
            "file_symlinks",
            "container_push_target",
            "network_vlan_physical",
            "storage_images_delete",
            "container_edit_metadata",
            "container_snapshot_stateful_migration",
            "storage_driver_ceph",
            "storage_ceph_user_name",
            "resource_limits",
            "storage_volatile_initial_source",
            "storage_ceph_force_osd_reuse",
            "storage_block_filesystem_btrfs",
            "resources",
            "kernel_limits",
            "storage_api_volume_rename",
            "macaroon_authentication",
            "network_sriov",
            "console",
            "restrict_devlxd",
            "migration_pre_copy",
            "infiniband",
            "maas_network",
            "devlxd_events",
            "proxy",
            "network_dhcp_gateway",
            "file_get_symlink",
            "network_leases",
            "unix_device_hotplug",
            "storage_api_local_volume_handling",
            "operation_description",
            "clustering",
            "event_lifecycle",
            "storage_api_remote_volume_handling",
            "nvidia_runtime"
        ],
        "api_status": "stable",
        "api_version": "1.0",
        "auth": "trusted",
        "public": false,
        "auth_methods": [
            "tls"
        ],
        "environment": {
            "addresses": [
                ":8443"
            ],
            "architectures": [
                "x86_64",
                "i686"
            ],
            "certificate": "-----BEGIN CERTIFICATE-----\nMIIFTzCCAzegAwIBAgIQGsJzQkIE1hqVEWDhRnSHmTANBgkqhkiG9w0BAQsFADAz\nMRwwGgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRMwEQYDVQQDDApyb290QG5l\ndDEwMB4XDTE4MDMwNTE4MjIyMVoXDTI4MDMwMjE4MjIyMVowMzEcMBoGA1UEChMT\nbGludXhjb250YWluZXJzLm9yZzETMBEGA1UEAwwKcm9vdEBuZXQxMDCCAiIwDQYJ\nKoZIhvcNAQEBBQADggIPADCCAgoCggIBANQGHoWaz4caeFjFOok26mENtjhrhm/G\nJ3+9JJEdJreC6Jm3hqo6pypZVgxPZ5sdYRWWr1cSWJ4QgvkuKGGfprHsSxyJ2loL\n2yROd4uvWcMUIe3f+1B5lrSR5aC+RRFCaC8g/G+EfOWJvZDNJmAqTBDkZqtrgUR/\ng5UA4WTtwcDo8P2a+LnlcjXrdNCnyJn2Uqy6M1i5SPaPffgLL709nlWyvb1tBUg3\nshs139ZeMaxI9ZbO3Ucilv8qgQXPLUuzJofDAa9So2wDhtaL2Vd4zZWWSZXoBU9f\nqa8R+cEMpm4yWpNQ1l3RymSsshidkG9YGbGp2bfYq5nFAU4vDQX4HFNWXRQh9HTJ\n2JZHai4gEgtsiz9XUW32HcFYPp2+I0lERwYAfuMvYo5Exyfufm3VsvTQhc+7X8Vh\nCylSqqPieKYCzd1P11gOU6kDkBGTUCddnR5JgbCQaW/uwkATcyvHiuWmb8KKjMDU\nyYmoGlG8UM1zUi2pZYEi+y1VnMaMAty3CzM78IgU2kmHbViiVXjMu3VDgxX5h+IM\nP+2CZagMcsZ+aEMTh4lcgacC0xubNJkDISWlDReaGqZlD9SIHpSFq785rRxlMahR\nifHA2TPS78PUF/IoNOb56NOjFcQovgWq+tgp/EM3kE6SdoXqLbSyVruS8swdp2hX\n6cwW2w6WNwiTAgMBAAGjXzBdMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggr\nBgEFBQcDATAMBgNVHRMBAf8EAjAAMCgGA1UdEQQhMB+CBW5ldDEwhwSeRX6vhxAm\nB1MAAGCTrwAAAAAAAAAAMA0GCSqGSIb3DQEBCwUAA4ICAQCabzlzG4+ISjk6e9Ca\nt0CZ6HpnYCZfd7uPc/Wqc2emfse1NOfpdTY/x98o3mSi74x4wi7JHZ/CAQM6XnSo\nGJeHmKO1UPGS8xzgRew7wOrxcQY1b/txTicpGbzJA5n1MXHzxzBLs1DM7gttUCwn\n9qDLWGeZAgECJnBjOk2Y/0Wt0PPWnzMn2T7POf9R4h2gIyZHRfdJOqXDI/RMf+Zh\nh5cw3AvS9zRpy92QOzhGy9QxdbFE+ZViO0NzA2RC8lMzZprqbrPWglGOo/vnK7Vx\n8wPECiCzdDxZ90yuK5Hk9dVx+haXlitQJLEQ0/S2D1ZA9SES6Wm+kzucANdRB0JX\ne9GfTPWw6uyFBYB5aBppjy06lSJXDTdi0lqjbKWFER/3ZI6Dy1xYOdv6Z2i3ZRhM\nAN615gIb21jCnZltQvj+ICp2UGh6nQxo/2y80nJCMSPBO/R4+qnIBqcEUh3lDKHt\n0OYlAirfYqKE6lqrbOHjpwIdsyMfWTY9/rppZotV7m2oacBEq5aMU+gZqbN57vnb\nV6kJbSs0e7ayjgeB/p5OIgARphSiuTCHNTBhG0pRu4JN4PzFLXVTzqkt1drQOKAp\n+SdJnr0hvPZSxPXRPOTer73PvyyCASwiUE3ED2G2ESF6kPjmHES+PGd6tiIpkFbs\n1sJpykR/jOxj6Z4gce/gJQbcHQ==\n-----END CERTIFICATE-----\n",
            "certificate_fingerprint": "e61592926f6a55523d6edaf736b38aa71d75d53d409a8684b74c1a72057e4876",
            "driver": "lxc",
            "driver_version": "3.0.0",
            "kernel": "Linux",
            "kernel_architecture": "x86_64",
            "kernel_version": "4.15.0-20-generic",
            "server": "lxd",
            "server_pid": 38205,
            "server_version": "3.0.0",
            "storage": "dir",
            "storage_version": "1",
            "server_clustered": false,
            "server_name": "net10.bizcooker.com"
        }
    } 
EROR[05-01|14:08:03] Failed to start the daemon: LXD is already running 
INFO[05-01|14:08:03] Starting shutdown sequence 
DBUG[05-01|14:08:03] Not unmounting temporary filesystems (containers are still running) 
INFO[05-01|14:08:03] Saving simplestreams cache 
INFO[05-01|14:08:03] Saved simplestreams cache 
Error: LXD is already running
stgraber commented 6 years ago

already running hmm, can you paste a full ps fauxww? I wonder if there are some leftovers somewhere in there.

davidfavor commented 6 years ago

Rebooting now. To provide clean info, as killing of lxd waitready created many partially running containers + made the ps 1000s of lines long.

davidfavor commented 6 years ago

After reboot, with lxd waitready hung...

net10 # lxd --debug --group lxd
INFO[05-01|14:34:38] LXD 3.0.0 is starting in normal mode     path=/var/snap/lxd/common/lxd
INFO[05-01|14:34:38] Kernel uid/gid map: 
INFO[05-01|14:34:38]  - u 0 0 4294967295 
INFO[05-01|14:34:38]  - g 0 0 4294967295 
INFO[05-01|14:34:38] Configured LXD uid/gid map: 
INFO[05-01|14:34:38]  - u 0 1000000 1000000000 
INFO[05-01|14:34:38]  - g 0 1000000 1000000000 
WARN[05-01|14:34:38] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[05-01|14:34:38] Initializing local database 
INFO[05-01|14:34:38] Initializing database gateway 
INFO[05-01|14:34:38] Start database node                      address= id=1
INFO[05-01|14:34:38] Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}] 
INFO[05-01|14:34:38] Raft: Node at 0 [Leader] entering Leader state 
INFO[05-01|14:34:38] LXD isn't socket activated 
INFO[05-01|14:34:38] Starting /dev/lxd handler: 
INFO[05-01|14:34:38]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[05-01|14:34:38] REST API daemon: 
INFO[05-01|14:34:38]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[05-01|14:34:38]  - binding TCP socket                    socket=[::]:8443
INFO[05-01|14:34:38] Initializing global database
# hangs here, likely waiting for lxd waitready to end

lxc command hangs same place...

net10 # lxd --debug --group lxd
DBUG[05-01|14:35:33] Connecting to a local LXD over a Unix socket 
DBUG[05-01|14:35:33] Sending request to LXD                   etag= method=GET url=http://unix.socket/1.0
INFO[05-01|14:35:33] LXD 3.0.0 is starting in normal mode     path=/var/snap/lxd/common/lxd
INFO[05-01|14:35:33] Kernel uid/gid map: 
INFO[05-01|14:35:33]  - u 0 0 4294967295 
INFO[05-01|14:35:33]  - g 0 0 4294967295 
INFO[05-01|14:35:33] Configured LXD uid/gid map: 
INFO[05-01|14:35:33]  - u 0 1000000 1000000000 
INFO[05-01|14:35:33]  - g 0 1000000 1000000000 
WARN[05-01|14:35:33] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[05-01|14:35:33] Initializing local database 
INFO[05-01|14:35:33] Initializing database gateway 
INFO[05-01|14:35:33] Start database node                      address= id=1
INFO[05-01|14:35:33] Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}] 
INFO[05-01|14:35:33] Raft: Node at 0 [Leader] entering Leader state 
INFO[05-01|14:35:33] LXD isn't socket activated 
DBUG[05-01|14:35:33] Connecting to a local LXD over a Unix socket 
DBUG[05-01|14:35:33] Sending request to LXD                   etag= method=GET url=http://unix.socket/1.0
DBUG[05-01|14:35:33] Detected stale unix socket, deleting 
DBUG[05-01|14:35:33] Detected stale unix socket, deleting 
INFO[05-01|14:35:33] Starting /dev/lxd handler: 
INFO[05-01|14:35:33]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[05-01|14:35:33] REST API daemon: 
INFO[05-01|14:35:33]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[05-01|14:35:33]  - binding TCP socket                    socket=[::]:8443
INFO[05-01|14:35:33] Initializing global database 
davidfavor commented 6 years ago

Bottom line of problem seems to be /var/snap/lxd/common/lxd/unix.socket never gets created.

Let me know how to debug the process which creates the socket + maybe that will net some better information.

davidfavor commented 6 years ago

lxd --debug finally came back. Took 2+ minutes to create socket.

net10 # r
lxd --debug --group lxd
INFO[05-01|14:38:55] LXD 3.0.0 is starting in normal mode     path=/var/snap/lxd/common/lxd
INFO[05-01|14:38:55] Kernel uid/gid map: 
INFO[05-01|14:38:55]  - u 0 0 4294967295 
INFO[05-01|14:38:55]  - g 0 0 4294967295 
INFO[05-01|14:38:55] Configured LXD uid/gid map: 
INFO[05-01|14:38:55]  - u 0 1000000 1000000000 
INFO[05-01|14:38:55]  - g 0 1000000 1000000000 
WARN[05-01|14:38:55] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[05-01|14:38:55] Initializing local database 
INFO[05-01|14:38:55] Initializing database gateway 
INFO[05-01|14:38:55] Start database node                      address= id=1
INFO[05-01|14:38:55] Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}] 
INFO[05-01|14:38:55] Raft: Node at 0 [Leader] entering Leader state 
INFO[05-01|14:38:55] LXD isn't socket activated 
INFO[05-01|14:38:55] Starting /dev/lxd handler: 
INFO[05-01|14:38:55]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[05-01|14:38:55] REST API daemon: 
INFO[05-01|14:38:55]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[05-01|14:38:55]  - binding TCP socket                    socket=[::]:8443
INFO[05-01|14:38:55] Initializing global database 
INFO[05-01|14:40:11] Initializing storage pools 
DBUG[05-01|14:40:11] Initializing and checking storage pool "default". 
DBUG[05-01|14:40:11] Initializing a DIR driver. 
DBUG[05-01|14:40:11] Checking DIR storage pool "default". 
DBUG[05-01|14:40:11] Initializing a DIR driver. 
INFO[05-01|14:40:11] Initializing networks 
INFO[05-01|14:40:12] Loading configuration 
DBUG[05-01|14:40:12] Initialized inotify with file descriptor 16 
INFO[05-01|14:40:12] Pruning expired images 
DBUG[05-01|14:40:12] handling                                 ip=@ method=GET url=/1.0
INFO[05-01|14:40:12] Done pruning expired images 
INFO[05-01|14:40:12] Expiring log files 
INFO[05-01|14:40:12] Updating instance types 
DBUG[05-01|14:40:12] handling                                 ip=@ method=GET url=/internal/ready
INFO[05-01|14:40:12] Updating images 
DBUG[05-01|14:40:12] Processing image                         alias=18.04 fp=b36ec647e374da4816104a98807633a2cc387488083d3776557081c4d0333618 protocol=simplestreams server=https://cloud-images.ubuntu.com/releases
DBUG[05-01|14:40:12] Connecting to a remote simplestreams server 
INFO[05-01|14:40:13] Done expiring log files 
DBUG[05-01|14:40:13] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:13] handling                                 ip=@ method=GET url=/internal/ready
DBUG[05-01|14:40:14] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:14] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:14] handling                                 ip=@ method=GET url=/internal/ready
DBUG[05-01|14:40:14] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:14] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:14] handling                                 ip=@ method=GET url=/internal/ready
INFO[05-01|14:40:15] Done updating instance types 
DBUG[05-01|14:40:15] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:15] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:15] handling                                 ip=@ method=GET url=/internal/ready
DBUG[05-01|14:40:16] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:16] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:16] handling                                 ip=@ method=GET url=/internal/ready
DBUG[05-01|14:40:16] Image already exists in the db           image=b36ec647e374da4816104a98807633a2cc387488083d3776557081c4d0333618
DBUG[05-01|14:40:16] Already up to date                       fp=b36ec647e374da4816104a98807633a2cc387488083d3776557081c4d0333618
DBUG[05-01|14:40:16] Processing image                         alias=17.10 fp=f7febb8cbebc6aa8a993eb1ce534963a6b288fde23b9594bb3ba4560704dd65c protocol=simplestreams server=https://cloud-images.ubuntu.com/releases
DBUG[05-01|14:40:16] Using SimpleStreams cache entry          expiry=2018-05-01T15:40:16+0000 server=https://cloud-images.ubuntu.com/releases
DBUG[05-01|14:40:16] Image already exists in the db           image=f7febb8cbebc6aa8a993eb1ce534963a6b288fde23b9594bb3ba4560704dd65c
DBUG[05-01|14:40:16] Already up to date                       fp=f7febb8cbebc6aa8a993eb1ce534963a6b288fde23b9594bb3ba4560704dd65c
INFO[05-01|14:40:16] Done updating images 
DBUG[05-01|14:40:17] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:17] handling                                 ip=@ method=GET url=/1.0
DBUG[05-01|14:40:17] handling                                 ip=@ method=GET url=/internal/ready
davidfavor commented 6 years ago

Sigh... And socket is unusable...

net10 # ll /var/snap/lxd/common/lxd/unix.socket
srw-rw---- 1 root lxd 0 May  1 09:42 /var/snap/lxd/common/lxd/unix.socket=

net10 # lxc list
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused
stgraber commented 6 years ago

And you still have that first shell running right, you didn't ctrl+c that shell you're running "lxd --debug --group lxd" from?

Looks like your database took two minutes to come online which is a bit unusual, @freeekanayaka is already looking into a similar issue.

davidfavor commented 6 years ago

Yes I left the original - lxd --debug --group lxd - command running, which is when I saw the 2min lag.

If I run the above command again, then I get the same hang at Initializing global database again.

davidfavor commented 6 years ago

Running this command again, I get...

net10# lxd --debug --group lxd
INFO[05-01|14:51:04] LXD 3.0.0 is starting in normal mode     path=/var/snap/lxd/common/lxd
INFO[05-01|14:51:04] Kernel uid/gid map: 
INFO[05-01|14:51:04]  - u 0 0 4294967295 
INFO[05-01|14:51:04]  - g 0 0 4294967295 
INFO[05-01|14:51:04] Configured LXD uid/gid map: 
INFO[05-01|14:51:04]  - u 0 1000000 1000000000 
INFO[05-01|14:51:04]  - g 0 1000000 1000000000 
WARN[05-01|14:51:04] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[05-01|14:51:04] Initializing local database 
INFO[05-01|14:51:04] Initializing database gateway 
INFO[05-01|14:51:04] Start database node                      address= id=1
INFO[05-01|14:51:04] Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}] 
INFO[05-01|14:51:04] Raft: Node at 0 [Leader] entering Leader state 
INFO[05-01|14:51:04] LXD isn't socket activated 
INFO[05-01|14:51:04] Starting /dev/lxd handler: 
INFO[05-01|14:51:04]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[05-01|14:51:04] REST API daemon: 
INFO[05-01|14:51:04]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[05-01|14:51:04]  - binding TCP socket                    socket=[::]:8443
INFO[05-01|14:51:04] Initializing global database 
2018/05/01 14:51:50 http: TLS handshake error from 196.52.43.109:6666: tls: client offered an unsupported, maximum protocol version of 302
DBUG[05-01|14:52:04] Database error: failed to begin transaction: gRPC BEGIN response error: rpc error: code = Unknown desc = failed to handle BEGIN request: FSM out of sync: timed out enqueuing operation 
EROR[05-01|14:52:04] Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to begin transaction: gRPC BEGIN response error: rpc error: code = Unknown desc = failed to handle BEGIN request: FSM out of sync: timed out enqueuing operation 
INFO[05-01|14:52:04] Starting shutdown sequence 
INFO[05-01|14:52:04] Stopping REST API handler: 
INFO[05-01|14:52:04]  - closing socket                        socket=[::]:8443
INFO[05-01|14:52:04]  - closing socket                        socket=/var/snap/lxd/common/lxd/unix.socket
INFO[05-01|14:52:04] Stopping /dev/lxd handler 
INFO[05-01|14:52:04]  - closing socket                        socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[05-01|14:52:04] Stop database gateway 
INFO[05-01|14:52:04] Stop raft instance 
INFO[05-01|14:52:05] Stopping REST API handler: 
INFO[05-01|14:52:05] Stopping /dev/lxd handler 
INFO[05-01|14:52:05] Stopping REST API handler: 
INFO[05-01|14:52:05] Stopping /dev/lxd handler 
DBUG[05-01|14:52:05] Not unmounting temporary filesystems (containers are still running) 
INFO[05-01|14:52:05] Saving simplestreams cache 
INFO[05-01|14:52:05] Saved simplestreams cache 
Error: failed to open cluster database: failed to ensure schema: failed to begin transaction: gRPC BEGIN response error: rpc error: code = Unknown desc = failed to handle BEGIN request: FSM out of sync: timed out enqueuing operation
davidfavor commented 6 years ago

Ah... I also see lxd.db is missing.

net10 # ll /var/snap/lxd/common/lxd
total 136K
drwx------  2 root root 4.0K Mar  5 12:58 cache/
drwx--x--x  2 root root 4.0K Apr 30 13:57 containers/
drwx------  3 root root 4.0K Apr 30 06:14 database/
drwx--x--x 18 root root 4.0K Apr 30 13:57 devices/
drwxr-xr-x  2 root root 4.0K Mar  5 12:22 devlxd/
drwx------  2 root root 4.0K Mar  5 12:22 disks/
drwx------  2 root root 4.0K Apr 30 13:54 images/
drwx------ 18 root root 4.0K May  1 09:40 logs/
drwx--x--x  3 root root 4.0K Mar  5 12:24 networks/
drwx------  4 root root 4.0K Mar  8 07:59 security/
drwx--x--x  2 root root 4.0K Mar  5 12:22 shmounts/
drwx------  2 root root 4.0K Mar  5 12:22 snapshots/
drwx--x--x  3 root root 4.0K Mar  5 12:24 storage-pools/
-rw-r--r--  1 root root  74K Apr  3 03:52 lxd.db.bak
-rw-r--r--  1 root root 1.9K Mar  5 12:22 server.crt
-rw-------  1 root root 3.2K Mar  5 12:22 server.key