canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

Containers in lvm storage do not survive server reboot #6828

Closed jsnjack closed 4 years ago

jsnjack commented 4 years ago

Required information

Issue description

Container with lvm storage doesn't survive the server reboot

Steps to reproduce

  1. Initialize lxd cluster with lvm backend
  2. Restart the server
  3. None of the containers are able to start

Information to attach

lxc start output:

Error: Common start logic: Failed to mount LVM logical volume: Failed to mount '/dev/local/containers_c1' on '/var/snap/lxd/common/lxd/storage-pools/local/containers/c1': no such file or directory

Container log is empty.

client@node2:~$ sudo cat  /var/snap/lxd/common/lxd/logs/lxd.log
t=2020-02-03T21:30:00+0100 lvl=info msg="LXD 3.19 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-02-03T21:30:00+0100 lvl=info msg="Kernel uid/gid map:" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - u 0 0 4294967295" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - g 0 0 4294967295" 
t=2020-02-03T21:30:00+0100 lvl=info msg="Configured LXD uid/gid map:" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - u 0 1000000 1000000000" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - g 0 1000000 1000000000" 
t=2020-02-03T21:30:00+0100 lvl=warn msg="AppArmor support has been disabled because of lack of kernel support" 
t=2020-02-03T21:30:00+0100 lvl=info msg="Kernel features:" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - uevent injection: yes" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - seccomp listener: no" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - seccomp listener continue syscalls: no" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - cgroup layout: legacy" 
t=2020-02-03T21:30:00+0100 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored" 
t=2020-02-03T21:30:00+0100 lvl=info msg=" - shiftfs support: disabled" 
t=2020-02-03T21:30:00+0100 lvl=info msg="Initializing local database" 
t=2020-02-03T21:30:01+0100 lvl=info msg="Starting /dev/lxd handler:" 
t=2020-02-03T21:30:01+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-02-03T21:30:01+0100 lvl=info msg="REST API daemon:" 
t=2020-02-03T21:30:01+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-02-03T21:30:01+0100 lvl=info msg=" - binding TCP socket" socket=10.0.0.2:8443
t=2020-02-03T21:30:01+0100 lvl=info msg="Initializing global database" 
t=2020-02-03T21:30:22+0100 lvl=info msg="Initializing storage pools" 
t=2020-02-03T21:30:23+0100 lvl=info msg="Initializing networks" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Pruning leftover image files" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Done pruning leftover image files" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Loading daemon configuration" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Pruning expired images" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Done pruning expired images" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Synchronizing images across the cluster" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Done synchronizing images across the cluster" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Pruning expired container backups" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Done pruning expired container backups" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Updating instance types" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Done updating instance types" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Expiring log files" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Done expiring log files" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Updating images" 
t=2020-02-03T21:30:24+0100 lvl=info msg="Done updating images" 
t=2020-02-03T21:30:27+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:10.0.0.4:8443 RaftID:0 Raft:false LastHeartbeat:2020-02-03 18:25:36.719892808 +0100 CET Online:false updated:false}" 
t=2020-02-03T21:30:34+0100 lvl=eror msg="Failed to start container 'c1': Common start logic: Failed to mount LVM logical volume: Failed to mount '/dev/local/containers_c1' on '/var/snap/lxd/common/lxd/storage-pools/local/containers/c1': no such file or directory" 
t=2020-02-03T21:30:38+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:10.0.0.4:8443 RaftID:0 Raft:false LastHeartbeat:2020-02-03 18:25:36.719892808 +0100 CET Online:false updated:false}" 
t=2020-02-03T21:30:44+0100 lvl=eror msg="Failed to start container 'c3': Common start logic: Failed to mount LVM logical volume: Failed to mount '/dev/local/containers_c3' on '/var/snap/lxd/common/lxd/storage-pools/local/containers/c3': no such file or directory" 
client@node2:~$ lxc storage show local
config:
  lvm.thinpool_name: LXDThinPool
description: ""
name: local
driver: lvm
used_by:
- /1.0/containers/c1
- /1.0/containers/c3
- /1.0/images/793ac61f572b7512805b91795936e5dfbd4608b312a399b95b74ecf42f35c402
- /1.0/profiles/default
status: Created
locations:
- node2
- node3
stgraber commented 4 years ago

Do you have that source path in /dev/local/?

stgraber commented 4 years ago

@tomponline giving you this one to take a look at

jsnjack commented 4 years ago

No, /dev/local/ does not exist and /var/snap/lxd/common/lxd/storage-pools/local/containers/c1 directory is empty

tomponline commented 4 years ago

@jsnjack can you post output of lvs please.

Also @stgraber I notice the source property is missing from the storage pool config in this case, so this is likely the issue, is this expected with a cluster config?

jsnjack commented 4 years ago

Output for lvs command:

  LV                                                                      VG    Attr       LSize   Pool        Origin                                                                  Data%  Meta%  Move Log Cpy%Sync Convert
  LXDThinPool                                                             local twi---tz-- <11.97g                                                                                                                            
  containers_c1                                                           local Vwi---tz--  <9.32g LXDThinPool images_793ac61f572b7512805b91795936e5dfbd4608b312a399b95b74ecf42f35c402                                        
  containers_c3                                                           local Vwi---tz--  <9.32g LXDThinPool images_793ac61f572b7512805b91795936e5dfbd4608b312a399b95b74ecf42f35c402                                        
  images_793ac61f572b7512805b91795936e5dfbd4608b312a399b95b74ecf42f35c402 local Vwi---tz--  <9.32g LXDThinPool                                                                                                                
tomponline commented 4 years ago

OK so the container volumes are there, thats good.

How did you create the storage pool local? As I believe the issue is that the container's volumes are not marked as 'active' (should have an 'a' in the Attr col), which in turn is why they are not in /dev/local

Perhaps the volume group itself is not active, can you show the output of vgs please.

tomponline commented 4 years ago

Also @stgraber I notice the source property is missing from the storage pool config in this case, so this is likely the issue, is this expected with a cluster config?

I notice that this is an old driver behaviour we should probably handle in the new driver, if lvm.vg_name is missing from the config, assume it is the storage pool name.

Anyway, apparently running vgchange -ay with an empty argument activates all of the volume groups so this shouldnt be preventing volumes from being accessible.

jsnjack commented 4 years ago
client@node2:~$ sudo vgs
  WARNING: PV /dev/loop2 in VG local is using an old PV header, modify the VG to update.
  VG    #PV #LV #SN Attr   VSize   VFree
  local   1   4   0 wz--n- <13.97g    0 

I created the storage pool by following the instructions from lxd init command

tomponline commented 4 years ago

@jsnjack if you run vgchange -ay local and then repeat the run of vgs please can you show the output.

Also does that then show files in /dev/local?

jsnjack commented 4 years ago

Thanks @tomponline! That solved the issue:

client@node2:~$ sudo vgchange -ay local
  WARNING: PV /dev/loop2 in VG local is using an old PV header, modify the VG to update.
  4 logical volume(s) in volume group "local" now active
client@node2:~$ sudo vgs
  WARNING: PV /dev/loop2 in VG local is using an old PV header, modify the VG to update.
  VG    #PV #LV #SN Attr   VSize   VFree
  local   1   4   0 wz--n- <13.97g    0 
client@node2:~$ cd /dev/local/
client@node2:/dev/local$ ls
containers_c1  containers_c3  images_793ac61f572b7512805b91795936e5dfbd4608b312a399b95b74ecf42f35c402
client@node2:/dev/local$ lxc list
+------+---------+-----------------------+------+-----------+-----------+----------+
| NAME |  STATE  |         IPV4          | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+------+---------+-----------------------+------+-----------+-----------+----------+
| c1   | RUNNING | 10.192.128.206 (eth0) |      | CONTAINER | 0         | node2    |
+------+---------+-----------------------+------+-----------+-----------+----------+
| c2   | STOPPED |                       |      | CONTAINER | 0         | node3    |
+------+---------+-----------------------+------+-----------+-----------+----------+
| c3   | RUNNING | 10.192.128.216 (eth0) |      | CONTAINER | 0         | node2    |
+------+---------+-----------------------+------+-----------+-----------+----------+
tomponline commented 4 years ago

@jsnjack great! The question is why didn't LXD do that for you on start, can you reboot the machine and try starting LXD again and see if that activates them this time. If not can you try running just vgchange -ay and then repeat the vgs command, as I'm wondering if it needs the specific volume group name added to it (and I've already identified an issue with that not occurring).

jsnjack commented 4 years ago

after running sudo vgchange -ay local command, containers keep starting automatically after the reboot. :)

jsnjack commented 4 years ago

Many thanks for such a fast response @tomponline and @stgraber!

tomponline commented 4 years ago

@jsnjack you're welcome, I'm going to keep this open a little longer to put up a patch that I think will avoid this occurring in the future.