canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 928 forks source link

LXD does not send nested BTRFS subvolume snapshots efficiently in relation to parent snapshots #8410

Open tomponline opened 3 years ago

tomponline commented 3 years ago

Originally discussed here https://discuss.linuxcontainers.org/t/no-space-left-on-device-while-changing-the-btrfs-storage-pool-of-an-lxd-container-with-docker/10040/34?u=tomp

While LXD correctly provides the parent subvolume to the BTRFS send tool in relation to LXD created snapshots when performing an optimized migration or export, it does not appear to use the relationship when transferring subvolumes inside the container.

This causes size amplification on the target when using a tool that creates many subvolumes (such as docker) inside the container.

tomponline commented 3 years ago

I've confirmed that sub volumes are sent/exported with the correct parent set to the appropriate LXD created snapshot.

E.g. /var/lib/lxd/storage-pools/btrfs/containers/c1/rootfs/mnt/somesubvol will be sent with a parent of /var/lib/lxd/storage-pools/btrfs/containers-snapshots/c1/snap0/rootfs/mnt/somesubvol so only the differences between the two LXD snapshots will be sent.

However where we are inefficiently sending/exporting subvolumes is where the snapshots are created inside the container and do not necessarily have a consistent hierarchical directory relationship.

This issue rears its head with Docker inside a LXD container using the BTRFS storage driver, because Docker creates BTRFS snapshots of the different layers that make up a Docker container.

Then when that LXD container is sent/exported the relationship between those layer snapshot subvolumes is lost, and the disk usage amplification as each layer is duplicated.

Because we cannot relay on the directory structure to indicate relationships between subvolumes and their snapshots we need to use the BTRFS tools to get an understanding of the relationship.

We can see this relationship here:

lxc launch images:ubuntu/focal c1 -s btrfs
lxc exec c1 -- apt install btrfs-progs -y
lxc exec c1 -- btrfs subvolume create /mnt/testvol
lxc exec c1 -- btrfs subvolume snapshot /mnt/testvol /mnt/testvolsnap1

sudo btrfs subvolume list -q -u -o  /var/lib/lxd/storage-pools/btrfs/containers/c1/
ID 308 gen 220 top level 307 parent_uuid -                                    uuid ac2b25df-5930-614f-a4d3-a11b4125537e path containers/c1/rootfs/mnt/testvol
ID 309 gen 220 top level 307 parent_uuid ac2b25df-5930-614f-a4d3-a11b4125537e uuid a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d path containers/c1/rootfs/mnt/testvolsnap1

The containers/c1/rootfs/mnt/testvolsnap1 subvolume has a parent UUID of ac2b25df-5930-614f-a4d3-a11b4125537e that matches the uuid of the containers/c1/rootfs/mnt/testvol subvolume.

So in principle it should be possible to scan container source before sending volumes to build up a map of what needs to be sent using the correct relationship.

However we send the snapshots first (oldest first), so we need to work out the relationship between subvolume snapshots in the LXD snapshot itself. This is a problem because the snapshotted subvolume snapshots have a parent of the UUID of the subvolume in the main volume.

E.g.

lxc snapshot c1
lxc snapshot c1

sudo btrfs subvolume list -q -u -a -o  /var/lib/lxd/storage-pools/btrfs/containers/c1
ID 257 gen 214 top level 5 parent_uuid -                                    uuid eb6d4f2c-4faa-6e48-8608-d33e1bd817ee path <FS_TREE>/images/e88e00f8cc6312c328093106faf3a7145200bdea4f76619e27f53fcdac86210c
ID 307 gen 233 top level 5 parent_uuid eb6d4f2c-4faa-6e48-8608-d33e1bd817ee uuid 4296bb59-35ca-1949-9785-53ef69c97ec8 path <FS_TREE>/containers/c1
ID 308 gen 234 top level 307 parent_uuid -                                    uuid ac2b25df-5930-614f-a4d3-a11b4125537e path containers/c1/rootfs/mnt/testvol
ID 309 gen 235 top level 307 parent_uuid ac2b25df-5930-614f-a4d3-a11b4125537e uuid a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d path containers/c1/rootfs/mnt/testvolsnap1
ID 310 gen 227 top level 5 parent_uuid 4296bb59-35ca-1949-9785-53ef69c97ec8 uuid 39ecb8b3-f416-b84c-8c10-1fbd8533364f path <FS_TREE>/containers-snapshots/c1/snap0
ID 311 gen 226 top level 310 parent_uuid ac2b25df-5930-614f-a4d3-a11b4125537e uuid cd5262e0-f39b-5c47-abdf-00301d8ac61c path <FS_TREE>/containers-snapshots/c1/snap0/rootfs/mnt/testvol
ID 312 gen 227 top level 310 parent_uuid a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d uuid 60730de8-d929-fd42-8f43-f39da74500aa path <FS_TREE>/containers-snapshots/c1/snap0/rootfs/mnt/testvolsnap1
ID 313 gen 235 top level 5 parent_uuid 4296bb59-35ca-1949-9785-53ef69c97ec8 uuid 0adc5856-a159-e443-9b7c-94a92f2bb443 path <FS_TREE>/containers-snapshots/c1/snap1
ID 314 gen 234 top level 313 parent_uuid ac2b25df-5930-614f-a4d3-a11b4125537e uuid 180b5f8e-900d-b148-89d1-c8f4651ce151 path <FS_TREE>/containers-snapshots/c1/snap1/rootfs/mnt/testvol
ID 315 gen 235 top level 313 parent_uuid a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d uuid 925fe28a-fc53-a241-953c-bd36cbda2645 path <FS_TREE>/containers-snapshots/c1/snap1/rootfs/mnt/testvolsnap1

We can see that the subvolume testvol has the same parent in both LXD snapshots snap0 and snap1, which relates to the main container's subvolume containers/c1/rootfs/mnt/testvol

ID 308 gen 234 top level 307 parent_uuid -                                    uuid ac2b25df-5930-614f-a4d3-a11b4125537e path containers/c1/rootfs/mnt/testvol
ID 311 gen 226 top level 310 parent_uuid ac2b25df-5930-614f-a4d3-a11b4125537e uuid cd5262e0-f39b-5c47-abdf-00301d8ac61c path <FS_TREE>/containers-snapshots/c1/snap0/rootfs/mnt/testvol
ID 314 gen 234 top level 313 parent_uuid ac2b25df-5930-614f-a4d3-a11b4125537e uuid 180b5f8e-900d-b148-89d1-c8f4651ce151 path <FS_TREE>/containers-snapshots/c1/snap1/rootfs/mnt/testvol

Similarly we can see the snapshot of the testvol called testvolsnap1 also relates back to the main snapshot subvolume in the LXD snapshots:

ID 309 gen 235 top level 307 parent_uuid ac2b25df-5930-614f-a4d3-a11b4125537e uuid a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d path containers/c1/rootfs/mnt/testvolsnap1
ID 312 gen 227 top level 310 parent_uuid a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d uuid 60730de8-d929-fd42-8f43-f39da74500aa path <FS_TREE>/containers-snapshots/c1/snap0/rootfs/mnt/testvolsnap1
ID 315 gen 235 top level 313 parent_uuid a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d uuid 925fe28a-fc53-a241-953c-bd36cbda2645 path <FS_TREE>/containers-snapshots/c1/snap1/rootfs/mnt/testvolsnap1

So in principle if we scanned the main LXD volume and built a map of subvolume snapshot paths to relative parent paths and UUIDs, when sending the first LXD snapshot we could discover the relationship between subvolumes by relating their parent UUIDs to relative paths.

E.g. In the main LXD volume the subvolume snapshot containers/c1/rootfs/mnt/testvolsnap1 has a parent UUID of ac2b25df-5930-614f-a4d3-a11b4125537e which has a relative path of containers/c1/rootfs/mnt/testvol.

So when sending the first snapshot (snap0) we can discover that the containers-snapshots/c1/snap0/rootfs/mnt/testvolsnap1 subvolume has a parent of a9702a0e-4f46-de4d-b5b9-d90dab3d4d0d which has a parent in the main volume of containers/c1/rootfs/mnt/testvolsnap1 who's parent UUID is ac2b25df-5930-614f-a4d3-a11b4125537e which relates to a subvolume with a relative path containers/c1/rootfs/mnt/testvol. In that case we can then match the relative path in the LXD snapshot's subvolume list to find the parent we should use the send the subvolume.

@stgraber would appreciate any input you have on this and whether you think its worth spending time on this right now. Thanks

stgraber commented 3 years ago

It'd certainly be neat to handle this properly, btrfs is our most used backend currently so spending a bit of time on it is probably a good idea. I guess the first step would be a function we can pass a path and have it return data about every subvolume in that path and their relationship, we can then update the logic that figures out subvol ordering to consider that and the directory structure to sort things properly and use the appropriate parent?

tomponline commented 3 years ago

@stgraber based on our conversation earlier about needing to take into account the ability in BTRFS to create a snapshot of a subvolume, and then move the parent subvolume underneath the snapshot of itself (which if transferring the parent before the snapshot in order to efficiently transfer the snapshot subvolume as a differential would then break because the parent directory of the source subvolume depends on its own snapshot). In that case transferring by ogen (original created generation) won't work.

As such, I believe in order to achieve both:

We will need to change the way the backup export/import and migration strategies work.

Currently (at least after https://github.com/lxc/lxd/pull/8542 is merged) the oldest LXD snapshot volume is transferred first, and then each subsequent snapshot, followed last by the main volume, with each volume referencing its preceding volume as the parent for the optimized transfer (and any subvolumes inside that volume are transferred in file tree order referencing the same path in the preceding volume as parent).

In order to support intra-volume subvolume snapshot parents we will need to use a two-stage approach. Firstly transfer all subvolumes into a flat temporary holding area in order of created generation (with the oldest being transferred first) to allow efficient differential transfer. And then once all subvolumes have been transferred, the recipient will re-organise them into the target file tree as per the original source structure.

I have found this section of the btrfs-clone tool to be most useful in describing the different clone strategies:

https://github.com/mwilck/btrfs-clone#cloning-strategies

Moreover, file systems will not be cloned in the order of their creation, thus when a subvolumeis cloned, we can't be sure that its parent in the filesystem tree (btrfs parent_id, don't confuse with parent_uuid) has already been transferred. Therefore subvolumes are first cloned flatly into a temporary directory. After all subvolumes have been transferred, they are moved into their file position in the target filesystem tree.

As this has grown in scope I will put on hold until it is prioritised.

stgraber commented 2 years ago

@tomponline I wonder, did this get solved as part of the refresh work?

stgraber commented 2 years ago

Or was this one mostly around the nested snapshot case which I don't think we've done much as part of the refresh work.

tomponline commented 2 years ago

@stgraber I very much doubt it. It was all about the nested snapshot case.