canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.35k stars 930 forks source link

lxd forgets container FS and treats all containers as ext4 on ceph, lxd 4.0.4 #8301

Closed vvolas closed 3 years ago

vvolas commented 3 years ago

For some reason better or worse I had set this at some point and have containers with xfs and ext4. This is on ceph. volume.block.filesystem xfs And when I try to resize container it treats it as ext4, ceph volume gets resized. I fixed size with:

mount /dev/rbd1 /mnt
xfs_growfs /mnt

Probably if I didint mess around with that option everything would have been fine. Solution would be store containers FS in config or detect it. Not very common problem I think.

lxc config edit katalogas-stage
Config parsing error: Failed to update device "root": Failed to mount '/dev/rbd1' on '/var/snap/lxd/common/lxd/storage-pools/ceph-lxd/containers/katalogas-stage': invalid argument
Press enter to start the editor again

kernel EXT4-fs (rbd1): VFS: Can't find ext4 filesystem
tomponline commented 3 years ago

The setting storage pool setting volume.block.filesystem is only applied to new volumes created after that is changed, it is not applied to existing volumes.

Please can you show output of lxc storage volume <pool> ls and then lxc storage volume show <pool> container/<instance volume>

E.g. I'm using an LVM storage pool here as it also has the same volume.block.filesystem setting as ceph pools.

lxc storage create lvm lvm
lxc launch images:ubuntu/focal c1 -s lvm

lxc storage volume ls lvm
+-----------+------------------------------------------------------------------+-------------+--------------+---------+
|   TYPE    |                               NAME                               | DESCRIPTION | CONTENT TYPE | USED BY |
+-----------+------------------------------------------------------------------+-------------+--------------+---------+
| container | c1                                                               |             | filesystem   | 1       |
+-----------+------------------------------------------------------------------+-------------+--------------+---------+

lxc storage volume show lvm container/c1
config:
  block.filesystem: ext4
  block.mount_options: discard
description: ""
name: c1
type: container
used_by:
- /1.0/instances/c1
location: none
content_type: filesystem
/dev/mapper/lvm-containers_c1 on /var/lib/lxd/storage-pools/lvm/containers/c1 type ext4 (rw,relatime,discard,stripe=16)

This shows the volume is using the default ext4 filesystem.

Now changing the volume.block.filesystem setting on the storage pool, the existing volume remains ext4 and the new ones are xfs:

lxc storage set lvm volume.block.filesystem xfs
lxc storage volume show lvm container/c1
config:
  block.filesystem: ext4
  block.mount_options: discard
description: ""
name: c1
type: container
used_by:
- /1.0/instances/c1
location: none
content_type: filesystem

lxc launch images:ubuntu/focal c2 -s lvm
lxc storage volume show lvm container/c2
config:
  block.filesystem: xfs
  block.mount_options: discard
description: ""
name: c2
type: container
used_by:
- /1.0/instances/c2
location: none
content_type: filesystem
mount | grep c2
/dev/mapper/lvm-containers_c2 on /var/lib/lxd/storage-pools/lvm/containers/c2 type xfs (rw,relatime,attr2,discard,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota)

Resizing the volume keeps the filesystem type:

lxc config device set c2 root size=12GB
mount | grep c2
/dev/mapper/lvm-containers_c2 on /var/lib/lxd/storage-pools/lvm/containers/c2 type xfs (rw,relatime,attr2,discard,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota)

If you have manually changed the filesystem and haven't informed LXD of this then you need to update the block.filesystem property setting on the volume itself. However the validation prevents you from doing that using the lxc command as we do not support changing filesystems. You would need to do this via a manual DB query using lxd sql global...

Are you able to provide the steps to reproduce LXD changing the filesystem of a volume without manual actions happening outside of LXD?

vvolas commented 3 years ago

lxc storage volume list ceph-lxd

|            TYPE            |                               NAME                               | DESCRIPTION | USED BY |     LOCATION      |  
| container                  | katalogas-stage                                                  |             | 1       | cluster-linux     |

and that line appears on all cluster members

`root@cluster-linux:~# lxc storage volume show ceph-lxd container/katalogas-stage config: block.filesystem: xfs block.mount_options: discard size: 60GB actual size is 120+ now. description: "" name: katalogas-stage type: container used_by:

vvolas commented 3 years ago

I doubt I can reproduce this and cant expiremnt now, as I said I ceph volume got resized, but needed partition growing.

What happened, when I created ceph I set XFS then some time later I changed back to default ext4. And this is leftower xfs vm, then when resizing this bug occurs. If nobody else is complaining maybe this can be left behind.

tomponline commented 3 years ago

@vvolas did you manually change the filesystem on the volume without updating the LXD volume meta data about its filesystem?

vvolas commented 3 years ago

No, I did not make any manual changes to filesystem beyond xfs_grow command ever.

It could have changed(as far as I understand how lxd works) when I copied container with lxc copy, now I cant tell exact scenario because that was long time ago. For example from btrfs to -> ceph. I often had problems simply moving containers from one to another cluster member or different volume, so I might had used lxc copy instead and then update mac. I know this is probably not very helpful.

tomponline commented 3 years ago

@vvolas thanks, so it could be that there is an issue with inheriting the volume specific block.filesystem setting for the ceph driver, and its reverting to using the current pool volume.block.filesystem setting (which if missing also defaults to ext4).

You've shown that the volume config itself has the block.filesystem: xfs value set, so one would expect that if resizing volume using lxc config device set <instance> root size=x would work using xfs filesystem commands.

What is interesting with your original reported error is that it isn't the resize command that is failing, but the actual mount attempt (which is required before running `xfs_growfs). I will try and reproduce it.

vvolas commented 3 years ago

I encountered same errors when trying to copy vm, to same or different pools, this is from ceph -> ceph, ceph -> different ceph, or ceph-> hdd (btrfs)

tomponline commented 3 years ago

@vvolas can you enable debug log and then show the output when you're trying to perform one of these operations.

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon
sudo tail -f /var/snap/lxd/common/lxd/logs/lxd.log
vvolas commented 3 years ago

This is copy to same pool

lxc copy  katalogas-ng/snap0 katalogas-ng-dev  --target cluster-linux -s ceph-lxd

t=2021-01-12T13:30:27+0200 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0 username=root
t=2021-01-12T13:30:27+0200 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0/instances/katalogas-ng/snapshots/snap0 username=root
t=2021-01-12T13:30:27+0200 lvl=dbug msg="GetInstanceUsage started" driver=ceph instance=katalogas-ng/snap0 pool=ceph-lxd project=default
t=2021-01-12T13:30:32+0200 lvl=dbug msg="GetInstanceUsage finished" driver=ceph instance=katalogas-ng/snap0 pool=ceph-lxd project=default
t=2021-01-12T13:30:32+0200 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url=/1.0/instances/katalogas-ng username=root
t=2021-01-12T13:30:32+0200 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url="/1.0/events?target=cluster-linux" username=root
t=2021-01-12T13:30:32+0200 lvl=dbug msg="New event listener: 68c26991-7755-45da-8076-e46132a7153f"
t=2021-01-12T13:30:32+0200 lvl=dbug msg=Handling ip=@ method=POST protocol=unix url="/1.0/instances?target=cluster-linux" username=root
t=2021-01-12T13:30:32+0200 lvl=dbug msg="\n\t{\n\t\t\"architecture\": \"x86_64\",\n\t\t\"config\": {\n\t\t\t\"boot.autostart\": \"true\",\n\t\t\t\"boot.autostart.priority\": \"180\",\n\t\t\t\"image.architecture\": \"amd64\",\n\t\t\t\"image.os\": \"ubuntu\",\n\t\t\t\"volatile.base_image\": \"e7efe234857d8f2e6f4b9351949e0b9c8f6db8f94266855aacd6ee18b7c1d0f8\"\n\t\t},\n\t\t\"devices\": {\n\t\t\t\"root\": {\n\t\t\t\t\"path\": \"/\",\n\t\t\t\t\"pool\": \"ceph-lxd\",\n\t\t\t\t\"size\": \"125GB\",\n\t\t\t\t\"type\": \"disk\"\n\t\t\t}\n\t\t},\n\t\t\"ephemeral\": false,\n\t\t\"profiles\": [\n\t\t\t\"ceph-ssd\"\n\t\t],\n\t\t\"stateful\": false,\n\t\t\"description\": \"\",\n\t\t\"name\": \"katalogas-ng-dev\",\n\t\t\"source\": {\n\t\t\t\"type\": \"copy\",\n\t\t\t\"certificate\": \"\",\n\t\t\t\"base-image\": \"e7efe234857d8f2e6f4b9351949e0b9c8f6db8f94266855aacd6ee18b7c1d0f8\",\n\t\t\t\"source\": \"katalogas-ng/snap0\"\n\t\t},\n\t\t\"instance_type\": \"\",\n\t\t\"type\": \"\"\n\t}"
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Responding to instance create"
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Skipping volatile key from copy source" key=volatile.eth0.host_name
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Skipping volatile key from copy source" key=volatile.eth0.hwaddr
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Skipping volatile key from copy source" key=volatile.idmap.next
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Skipping volatile key from copy source" key=volatile.idmap.base
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Skipping volatile key from copy source" key=volatile.idmap.current
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Skipping volatile key from copy source" key=volatile.last_state.power
t=2021-01-12T13:30:32+0200 lvl=dbug msg="New task Operation: 3a07f1b4-2458-4adc-96f7-7873d3d232c6"
t=2021-01-12T13:30:32+0200 lvl=dbug msg="Started task operation: 3a07f1b4-2458-4adc-96f7-7873d3d232c6"
t=2021-01-12T13:30:32+0200 lvl=dbug msg="\n\t{\n\t\t\"type\": \"async\",\n\t\t\"status\": \"Operation created\",\n\t\t\"status_code\": 100,\n\t\t\"operation\": \"/1.0/operations/3a07f1b4-2458-4adc-96f7-7873d3d232c6\",\n\t\t\"error_code\": 0,\n\t\t\"error\": \"\",\n\t\t\"metadata\": {\n\t\t\t\"id\": \"3a07f1b4-2458-4adc-96f7-7873d3d232c6\",\n\t\t\t\"class\": \"task\",\n\t\t\t\"description\": \"Creating container\",\n\t\t\t\"created_at\": \"2021-01-12T13:30:32.255707972+02:00\",\n\t\t\t\"updated_at\": \"2021-01-12T13:30:32.255707972+02:00\",\n\t\t\t\"status\": \"Running\",\n\t\t\t\"status_code\": 103,\n\t\t\t\"resources\": {\n\t\t\t\t\"containers\": [\n\t\t\t\t\t\"/1.0/containers/katalogas-ng-dev\",\n\t\t\t\t\t\"/1.0/containers/katalogas-ng/snap0\"\n\t\t\t\t],\n\t\t\t\t\"instances\": [\n\t\t\t\t\t\"/1.0/instances/katalogas-ng-dev\",\n\t\t\t\t\t\"/1.0/instances/katalogas-ng/snap0\"\n\t\t\t\t]\n\t\t\t},\n\t\t\t\"metadata\": null,\n\t\t\t\"may_cancel\": false,\n\t\t\t\"err\": \"\",\n\t\t\t\"location\": \"cluster-linux\"\n\t\t}\n\t}"
t=2021-01-12T13:30:32+0200 lvl=dbug msg=Handling ip=@ method=GET protocol=unix url="/1.0/operations/3a07f1b4-2458-4adc-96f7-7873d3d232c6?target=cluster-linux" username=root
t=2021-01-12T13:30:32+0200 lvl=info msg="Creating container" ephemeral=false name=katalogas-ng-dev project=default
t=2021-01-12T13:30:32+0200 lvl=info msg="Created container" ephemeral=false name=katalogas-ng-dev project=default
t=2021-01-12T13:30:32+0200 lvl=dbug msg="CreateInstanceFromCopy started" driver=ceph instance=katalogas-ng-dev pool=ceph-lxd project=default snapshots=true src=katalogas-ng/snap0
t=2021-01-12T13:30:32+0200 lvl=dbug msg="CreateInstanceFromCopy same-pool mode detected" driver=ceph instance=katalogas-ng-dev pool=ceph-lxd project=default snapshots=true src=katalogas-ng/snap0
t=2021-01-12T13:30:34+0200 lvl=dbug msg="Found cert" name=0
t=2021-01-12T13:30:46+0200 lvl=dbug msg="Found cert" name=0
t=2021-01-12T13:30:54+0200 lvl=dbug msg="Found cert" name=0
t=2021-01-12T13:31:00+0200 lvl=dbug msg="DeleteInstance started" driver=ceph instance=katalogas-ng-dev pool=ceph-lxd project=default
t=2021-01-12T13:31:00+0200 lvl=dbug msg="Deleting instance volume" driver=ceph instance=katalogas-ng-dev pool=ceph-lxd project=default volName=katalogas-ng-dev
t=2021-01-12T13:31:00+0200 lvl=dbug msg="DeleteInstance finished" driver=ceph instance=katalogas-ng-dev pool=ceph-lxd project=default
t=2021-01-12T13:31:00+0200 lvl=dbug msg="CreateInstanceFromCopy finished" driver=ceph instance=katalogas-ng-dev pool=ceph-lxd project=default snapshots=true src=katalogas-ng/snap0
t=2021-01-12T13:31:00+0200 lvl=info msg="Deleting container" created=2021-01-12T13:30:32+0200 ephemeral=false name=katalogas-ng-dev project=default used=1970-01-01T03:00:00+0300
t=2021-01-12T13:31:00+0200 lvl=dbug msg="Database error: &errors.errorString{s:\"No such object\"}"
t=2021-01-12T13:31:00+0200 lvl=info msg="Deleted container" created=2021-01-12T13:30:32+0200 ephemeral=false name=katalogas-ng-dev project=default used=1970-01-01T03:00:00+0300
t=2021-01-12T13:31:00+0200 lvl=dbug msg="Failure for task operation: 3a07f1b4-2458-4adc-96f7-7873d3d232c6: Create instance from copy: Failed to mount '/dev/rbd6' on '/var/snap/lxd/common/lxd/storage-pools/ceph-lxd/containers/katalogas-ng-dev': invalid argument"
t=2021-01-12T13:31:00+0200 lvl=dbug msg="Event listener finished: 68c26991-7755-45da-8076-e46132a7153f"
t=2021-01-12T13:31:00+0200 lvl=dbug msg="Disconnected event listener: 68c26991-7755-45da-8076-e46132a7153f"
dmesg

[An saus. 12 13:25:27 2021] rbd: rbd6: capacity 126000000000 features 0x1
[An saus. 12 13:25:28 2021] EXT4-fs (rbd6): VFS: Can't find ext4 filesystem
[An saus. 12 13:25:28 2021] EXT4-fs (rbd6): VFS: Can't find ext4 filesystem
[An saus. 12 13:25:29 2021] EXT4-fs (rbd6): VFS: Can't find ext4 filesystem
[An saus. 12 13:25:29 2021] EXT4-fs (rbd6): VFS: Can't find ext4 filesystem
[An saus. 12 13:25:30 2021] EXT4-fs (rbd6): VFS: Can't find ext4 filesystem
tomponline commented 3 years ago

Please show output of

lxc storage volume show ceph-lxd container/katalogas-ng
lxc storage volume show ceph-lxd container/katalogas-ng/snap0
lxc config show katalogas-ng --expanded
lxc storage show ceph-lxd
vvolas commented 3 years ago
root@cluster-linux:~# lxc storage volume show ceph-lxd container/katalogas-ng
config:
  block.filesystem: xfs
  block.mount_options: discard
  size: 60GB **incorect actual size 120 something**
description: ""
name: katalogas-ng
type: container
used_by:
- /1.0/instances/katalogas-ng
location: cluster-linux
content_type: ""

root@cluster-linux:~# lxc storage volume show ceph-lxd container/katalogas-ng/snap0
config:
  block.filesystem: ext4 **??? probably error is here**
  block.mount_options: discard
description: ""
name: katalogas-ng/snap0
type: container
used_by:
- /1.0/instances/katalogas-ng/snapshots/snap0
location: cluster-linux
content_type: ""

root@cluster-linux:~# lxc config show katalogas-ng --expanded
architecture: x86_64
config:
  boot.autostart: "true"
  boot.autostart.priority: "180"
  image.architecture: amd64
  image.os: ubuntu
  limits.memory: 16GB
  volatile.base_image: e7efe234857d8f2e6f4b9351949e0b9c8f6db8f94266855aacd6ee18b7c1d0f8
  volatile.eth0.host_name: veth4737977c
  volatile.eth0.hwaddr: 00:16:3e:92:ba:0f
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: ceph-lxd
    size: 125GB
    type: disk
ephemeral: false
profiles:
- ceph-ssd
stateful: false
description: ""

root@cluster-linux:~# lxc storage show ceph-lxd
config:       
  ceph.cluster_name: ceph
  ceph.osd.pg_num: "128"
  ceph.osd.pool_name: ceph-lxd
  ceph.user.name: lxd
  volatile.pool.pristine: "true"
  volume.size: 35GB **I increase size if needed**
description: ""
name: ceph-lxd
driver: ceph
used_by:
- /1.0/images/567c1693452aa159d5626f63a6b4b2dc61e89e616ddacc549d39f72c2dbc419f
-  * many *
- /1.0/instances/katalogas-ng
- /1.0/profiles/ceph-ssd
status: Created
locations:
- servers
- servers
tomponline commented 3 years ago

Yes looks like when the snapshot was taken the volume's meta data indicated the filesystem was ext4, but was actually something else (xfs). You could try altering the DB record for the snapshot and see if that helps:

# Get the `id` column value for the snapshot volume.
lxd sql global 'select storage_volumes_snapshots.* from storage_volumes_snapshots join storage_volumes on storage_volumes.id = storage_volumes_snapshots.storage_volume_id where storage_volumes.name = "katalogas-ng" and storage_volumes_snapshots.name = "snap0"'

# Update config for snapshot volume.
lxd sql global 'update storage_volumes_snapshots_config set value = "xfs" where key = "block.filesystem" and storage_volume_snapshot_id = <snapshot ID column value from previous query>'
vvolas commented 3 years ago

Thanks for workaround, but is this output correct, because I m getting multiple IDS, should I update them all?

root@cluster-linux:~# lxd sql global 'select storage_volumes_snapshots.* from storage_volumes_snapshots join storage_volumes on storage_volumes.id = storage_volumes_snapshots.storage_volume_id where storage_volumes.name = "katalogas-ng" and storage_volumes_snapshots.name = "snap0"'
+------+-------------------+-------+-------------+---------------------------+
|  id  | storage_volume_id | name  | description |        expiry_date        |
+------+-------------------+-------+-------------+---------------------------+
| 2227 | 2108              | snap0 |             | 0001-01-01T01:41:16+01:41 |
| 2228 | 735               | snap0 |             | 0001-01-01T01:41:16+01:41 |
| 2229 | 272               | snap0 |             | 0001-01-01T01:41:16+01:41 |
| 2230 | 762               | snap0 |             | 0001-01-01T01:41:16+01:41 |
| 2231 | 274               | snap0 |             | 0001-01-01T01:41:16+01:41 |
| 2232 | 947               | snap0 |             | 0001-01-01T01:41:16+01:41 |
+------+-------------------+-------+-------------+---------------------------+
tomponline commented 3 years ago

@vvolas do you have multiple projects containing instance of the same name?

Try this:

lxd sql global 'select * from storage_volumes_snapshots join storage_volumes on storage_volumes.id = storage_volumes_snapshots.storage_volume_id where storage_volumes.name = "katalogas-ng" and storage_volumes_snapshots.name = "snap0"'
vvolas commented 3 years ago

No I dont use projects, maybe this have something to do with cluster members?

root@cluster-linux:~# lxd sql global 'select * from storage_volumes_snapshots join storage_volumes on storage_volumes.id = storage_volumes_snapshots.storage_volume_id where storage_volumes.name = "katalogas-ng" and storage_volumes_snapshots.name = "snap0"'
+------+-------------------+-------+-------------+---------------------------+------+--------------+-----------------+---------+------+-------------+------------+
|  id  | storage_volume_id | name  | description |        expiry_date        |  id  |     name     | storage_pool_id | node_id | type | description | project_id |
+------+-------------------+-------+-------------+---------------------------+------+--------------+-----------------+---------+------+-------------+------------+
| 2227 | 2108              | snap0 |             | 0001-01-01T01:41:16+01:41 | 2108 | katalogas-ng | 5               | 11      | 0    |             | 1          |
| 2228 | 735               | snap0 |             | 0001-01-01T01:41:16+01:41 | 735  | katalogas-ng | 5               | 8       | 0    |             | 1          |
| 2229 | 272               | snap0 |             | 0001-01-01T01:41:16+01:41 | 272  | katalogas-ng | 5               | 3       | 0    |             | 1          |
| 2230 | 762               | snap0 |             | 0001-01-01T01:41:16+01:41 | 762  | katalogas-ng | 5               | 9       | 0    |             | 1          |
| 2231 | 274               | snap0 |             | 0001-01-01T01:41:16+01:41 | 274  | katalogas-ng | 5               | 2       | 0    |             | 1          |
| 2232 | 947               | snap0 |             | 0001-01-01T01:41:16+01:41 | 947  | katalogas-ng | 5               | 10      | 0    |             | 1          |
+------+-------------------+-------+-------------+---------------------------+------+--------------+-----------------+---------+------+-------------+------------+
tomponline commented 3 years ago

@vvolas ah yeah older versions of LXD stored a copy of each volume's config for each cluster member (newer versions don't do that and have a single row with a null node_id value instead).

You should update the config for each on of them.

vvolas commented 3 years ago

copying to same pool failed in with same error and log messages as before, but copying to btrfs hdd seems ongoing.

tomponline commented 3 years ago

Found the bug.

lxc snapshot <instance> appears to be copying the current value of the pool's volume.block.filesystem setting into the snapshot's volume config, rather than using the instance volume's config that it is snapshotting.

Also confirmed this is a problem in current master, and with LVM too.