canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.33k stars 927 forks source link

LXD/LXC non-functional after upgrading from Xenial -> Bionic #4509

Closed samvde closed 6 years ago

samvde commented 6 years ago

Required information

Issue description

I have upgrade a home NAS to Bionic using 'do-release-upgrade -d' which apparently upgrades LXD from version 2 to 3 as well. This system is a fileserver, and it has a few non-critical containers (which is the sole reason I haven't rolled back to before the upgrade).

I followed the standard upgrade procedure. After upgrading, LXD is completely non-functional for me.

I see 2 big issues:

  1. lxd-containers/lxd.service won't start.

lxc-containers status:

systemctl status lxd-containers.service ● lxd-containers.service - LXD - container startup/shutdown Loaded: loaded (/lib/systemd/system/lxd-containers.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Sun 2018-04-29 13:55:33 CEST; 8min ago Docs: man:lxd(1) Process: 25371 ExecStart=/usr/bin/lxd activateifneeded (code=exited, status=1/FAILURE) Main PID: 25371 (code=exited, status=1/FAILURE) Apr 29 13:55:30 willie systemd[1]: Starting LXD - container startup/shutdown... Apr 29 13:55:33 willie lxd[25371]: Error: Get http://unix.socket/1.0: EOF Apr 29 13:55:33 willie systemd[1]: lxd-containers.service: Main process exited, code=exited, status=1/FAILURE Apr 29 13:55:33 willie systemd[1]: lxd-containers.service: Failed with result 'exit-code'. Apr 29 13:55:33 willie systemd[1]: Failed to start LXD - container startup/shutdown.

I have not been able to troubleshoot the "Error: Get http://unix.socket/1.0: EOF" error. As I noticed this sometimes is related to user permissions: these did not change, and I have tried relevant commands as my own user as well as root.

Current permissions (removed 'x' from output):

grep lxd /etc/group lxd::124:sam grep lxd /etc/passwd lxd::117:65534::/var/lib/lxd/:/bin/false

lxd.socket status:

systemctl status lxd.socket ● lxd.socket - LXD - unix socket Loaded: loaded (/lib/systemd/system/lxd.socket; enabled; vendor preset: enabled) Active: active (running) since Sun 2018-04-29 02:54:58 CEST; 11h ago Docs: man:lxd(1) Listen: /var/lib/lxd/unix.socket (Stream) Tasks: 0 (limit: 4607) CGroup: /system.slice/lxd.socket Apr 29 02:54:58 willie systemd[1]: Starting LXD - unix socket. Apr 29 02:54:58 willie systemd[1]: Listening on LXD - unix socket.

lxd.service status:

systemctl status lxd.service ● lxd.service - LXD - main daemon Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled) Active: activating (start-post) (Result: exit-code) since Sun 2018-04-29 14:05:31 CEST; 2min 0s ago Docs: man:lxd(1) Process: 30833 ExecStart=/usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log (code=exited, status=1/FAILURE) Process: 30814 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS) Main PID: 30833 (code=exited, status=1/FAILURE); Control PID: 30834 (lxd) Tasks: 7 CGroup: /system.slice/lxd.service └─30834 /usr/lib/lxd/lxd waitready --timeout=600 Apr 29 14:05:33 willie lxd[30833]: lvl=warn msg="Storage volumes database already contains an entry for the container." t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: lvl=warn msg="Storage volumes database already contains an entry for the container." t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: lvl=warn msg="Storage volumes database already contains an entry for the container." t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: lvl=warn msg="Storage volumes database already contains an entry for the container." t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: lvl=warn msg="Storage volumes database already contains an entry for the container." t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: lvl=eror msg="Failed to clear old profile configuration for profile default: no such table: profiles_config." t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: lvl=eror msg="Failed to clear old profile configuration for profile docker: no such table: profiles_config." t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: lvl=eror msg="Failed to start the daemon: ZFS storage pool \"default\" could not be imported: cannot import 'default': no such pool available\n" t=2018-04-29T14:05:33+0200 Apr 29 14:05:33 willie lxd[30833]: Error: ZFS storage pool "default" could not be imported: cannot import 'default': no such pool available Apr 29 14:05:33 willie systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE

I'm able to run sudo lxd --group lxd --verbose --debug:

sudo lxd --group lxd --verbose --debug DBUG[04-29|14:19:46] Connecting to a local LXD over a Unix socket DBUG[04-29|14:19:46] Sending request to LXD etag= method=GET url=http://unix.socket/1.0 INFO[04-29|14:25:35] LXD 3.0.0 is starting in normal mode path=/var/lib/lxd INFO[04-29|14:25:35] Kernel uid/gid map: INFO[04-29|14:25:35] - u 0 0 4294967295 INFO[04-29|14:25:35] - g 0 0 4294967295 INFO[04-29|14:25:35] Configured LXD uid/gid map: INFO[04-29|14:25:35] - u 0 231072 65536 INFO[04-29|14:25:35] - g 0 231072 65536 WARN[04-29|14:25:35] CGroup memory swap accounting is disabled, swap limits will be ignored. INFO[04-29|14:25:35] Initializing database gateway INFO[04-29|14:25:35] Start database node address= id=1 INFO[04-29|14:25:35] Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}] INFO[04-29|14:25:35] Raft: Node at 0 [Leader] entering Leader state INFO[04-29|14:25:35] LXD isn't socket activated DBUG[04-29|14:19:46] Connecting to a local LXD over a Unix socket DBUG[04-29|14:19:46] Sending request to LXD etag= method=GET url=http://unix.socket/1.0

Then this basically repeats.

  1. lxd.db has been replaced with a new, empty/standard one

After the upgrade, I noticed the original lxd.db file was replaced with a new one:

ls -l /var/lib/lxd/lxd.db* -rw-r--r-- 1 root root 72704 Apr 29 00:51 /var/lib/lxd/lxd.db -rw-r--r-- 1 root root 58368 Apr 29 00:51 /var/lib/lxd/lxd.db.bak

I can confirm lxd.db.bak is de one I had before. I have tried restarting everything with the original lxd.db file, which failed.

It's clear all of my config has been lost when looking at the db files using sqlitebrowser. A screenshot comparing the two:

image

So I'm quite stuck here. All zfs datasets are mounted where they should be mounted, everything seems fine from that perspective:

ls -l /var/lib/lxd/containers/lxc-notrack lrwxrwxrwx 1 root root 57 Apr 29 00:51 /var/lib/lxd/containers/lxc-notrack -> /var/lib/lxd/storage-pools/default/containers/lxc-notrack ls -l /var/lib/lxd/containers/lxc-notrack/ total 22 -rw-r--r-- 1 root root 1566 Mar 30 2017 metadata.yaml drwxr-xr-x 22 231072 231072 22 Apr 26 2017 rootfs drwxr-xr-x 2 root root 8 Mar 30 2017 templates

no LXD entries, I do see LXC entries:

At start:

[ 71.303745] audit: type=1400 audit(1524962728.796:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=5156 comm="apparmor_parser" [ 71.303748] audit: type=1400 audit(1524962728.796:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-cgns" pid=5156 comm="apparmor_parser" [ 71.303750] audit: type=1400 audit(1524962728.796:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-mounting" pid=5156 comm="apparmor_parser" [ 71.303751] audit: type=1400 audit(1524962728.796:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-nesting" pid=5156 comm="apparmor_parser"

Then repeating:

[42074.449732] audit: type=1400 audit(1525004731.627:391): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=30989 comm="apparmor_parser" [42074.461308] audit: type=1400 audit(1525004731.639:392): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxc-container-default" pid=30993 comm="apparmor_parser" [42074.461313] audit: type=1400 audit(1525004731.639:393): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxc-container-default-cgns" pid=30993 comm="apparmor_parser" [42074.461316] audit: type=1400 audit(1525004731.639:394): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxc-container-default-with-mounting" pid=30993 comm="apparmor_parser" [42074.461318] audit: type=1400 audit(1525004731.639:395): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxc-container-default-with-nesting" pid=30993 comm="apparmor_parser"

lxc list --debug --verbose DBUG[04-29|14:41:24] Connecting to a local LXD over a Unix socket DBUG[04-29|14:41:24] Sending request to LXD etag= method=GET url=http://unix.socket/1.0

sudo lxd --group lxd --verbose --debug DBUG[04-29|14:19:46] Connecting to a local LXD over a Unix socket DBUG[04-29|14:19:46] Sending request to LXD etag= method=GET url=http://unix.socket/1.0 INFO[04-29|14:25:35] LXD 3.0.0 is starting in normal mode path=/var/lib/lxd INFO[04-29|14:25:35] Kernel uid/gid map: INFO[04-29|14:25:35] - u 0 0 4294967295 INFO[04-29|14:25:35] - g 0 0 4294967295 INFO[04-29|14:25:35] Configured LXD uid/gid map: INFO[04-29|14:25:35] - u 0 231072 65536 INFO[04-29|14:25:35] - g 0 231072 65536 WARN[04-29|14:25:35] CGroup memory swap accounting is disabled, swap limits will be ignored. INFO[04-29|14:25:35] Initializing database gateway INFO[04-29|14:25:35] Start database node address= id=1 INFO[04-29|14:25:35] Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}] INFO[04-29|14:25:35] Raft: Node at 0 [Leader] entering Leader state INFO[04-29|14:25:35] LXD isn't socket activated DBUG[04-29|14:19:46] Connecting to a local LXD over a Unix socket DBUG[04-29|14:19:46] Sending request to LXD etag= method=GET url=http://unix.socket/1.0

stgraber commented 6 years ago

The upgrade to 3.0 moves most of the DB data from lxd.db over to raft/ so that's expected. lxd.db then just stores a partial schema with node-specific information.

So the fact that lxd.db appears to shrink when upgrading to 3.0 shouldn't be a worry. To try and get things straight, can you:

If all goes well, that last one will hang, indicating that LXD is running, if that's the case, the try to run lxc list from a separate terminal (without interrupting the one running the lxd daemon).

samvde commented 6 years ago

Hi Stéphane

I logged all output into this file: lxd.log-a.txt

The procedure fails. Relevant error:

DBUG[04-29|21:25:02] Initializing a ZFS driver. DBUG[04-29|21:25:02] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:02] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:03] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:03] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:03] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:03] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:03] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:04] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:04] Database error: &errors.errorString{s:"sql: no rows in result set"} DBUG[04-29|21:25:04] Database error: &errors.errorString{s:"sql: no rows in result set"} EROR[04-29|21:25:04] Failed to clear old profile configuration for profile default: no such table: profiles_config. EROR[04-29|21:25:04] Failed to clear old profile configuration for profile docker: no such table: profiles_config. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:04] Initializing a ZFS driver. DBUG[04-29|21:25:05] Initializing a ZFS driver. DBUG[04-29|21:25:05] Initializing a ZFS driver. DBUG[04-29|21:25:05] Initializing and checking storage pool "default". DBUG[04-29|21:25:05] Initializing a ZFS driver. DBUG[04-29|21:25:05] Checking ZFS storage pool "default". DBUG[04-29|21:25:05] ZFS storage pool "default" does not exist. Trying to import it. EROR[04-29|21:25:05] Failed to start the daemon: ZFS storage pool "default" could not be imported: cannot import 'default': no such pool available

I don't know what to make of the database errors, but the zfs error corresponds to the error from my initial post. It seems it wants to import a zfs pool called default, which I don't have. I used an existing zpool for LXD and if I recall well, the LXD storage pool was called default.

It's correct in the original database:

image

I have no idea how to query or look at the contents of db.bin.

I also tried using a version of lxd.db from 2 weeks ago, it has the same same error output.

I can keep this machine in this state for troubleshooting for the time being.

stgraber commented 6 years ago

Ok, so that error actually makes sense and it's something I fixed already.

Please do:

This will effectively do the same as before but this time with a LXD that will set the right pool name.

samvde commented 6 years ago

Ok, that worked :-) After this procedure I was able to start the lxd.service as well. It seems the correct container states are restored and all are running.

Here's the log file (sorry for the bad layout, I should have logged printable output only): lxd-a.log

Should I be cautious for upgrades to LXD for the time being or is the core issue fixed? If you want I can provide you with the lxd.db.bak file if you would need to reproduce this?

BTW: very impressive support here, many thanks!

stgraber commented 6 years ago

That machine is now fine, it's a one-time step that was failing so you won't ever run that code again. If you have other machines with a similar zfs configuration, then they'll likely hit the same problem and will also need the custom binary until we push the next package update with it included.