canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.39k stars 930 forks source link

CephRBD-backed instances do not respect `--refresh` when copied #12668

Closed slapcat closed 9 months ago

slapcat commented 11 months ago

Required information

Issue description

When doing lxc cp --refresh on CephRBD-backed instances, it transfers the entire disk instead of just the delta between snapshots. It happens regardless of changes to the filesystem or snapshots of the source instance. This does not happen for containers or for other storage backends like ZFS. This has been tested when copying to a remote because at the time I was prevented from testing copying to a cluster member node by bug #12631.

Steps to reproduce

  1. Setup microceph.
  2. Setup LXD and add a ceph-rbd storage pool.
  3. lxc launch images:debian/12 --vm v1
  4. lxc snapshot v1
  5. lxc cp v1 remote:v1
  6. lxc cp v1 remote:v1 --refresh
  7. lxc snapshot v1
  8. lxc cp v1 remote:v1 --refresh
  9. lxc cp v1 remote:v1 --refresh
roosterfish commented 11 months ago

Hi, how do you determine that --refresh is ignored?

slapcat commented 11 months ago

Both the time taken and the live reporting of the data transferred, which shows the full size of the disk being transferred each time. I've used time to measure the exact transfer times between refreshes and compare those with the same commands on a ZFS-backed pool. The ZFS pool behaves a lot differently and repeated --refresh copies take almost no time at all, but the same on CephRBD will transfer the full disk each time.

slapcat commented 10 months ago

Here are some timings from a test of copying an instance to another node in the same cluster. In both cases, I used the debian/11 image. I took a second snapshot on the source instance before the last refresh, no other changes were made.

Ceph RBD

root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1

real    1m3.129s
user    0m0.055s
sys 0m0.063s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh

real    14m34.771s
user    0m0.288s
sys 0m0.337s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh

real    15m11.626s
user    0m0.348s
sys 0m0.302s
root@lxd-1:~# time lxc cp c2 c3 --target=lxd-1 --refresh

real    36m10.129s
user    0m0.679s
sys     0m0.996s

ZFS

root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0

real    0m8.293s
user    0m0.038s
sys 0m0.011s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh

real    0m1.577s
user    0m0.049s
sys 0m0.081s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh

real    0m1.229s
user    0m0.042s
sys 0m0.031s
root@lxd-0:~# time lxc cp v1 v2 --target=lxd-0 --refresh

real    0m1.116s
user    0m0.028s
sys 0m0.022s
roosterfish commented 10 months ago

I was able to reproduce this now with a larger Ceph cluster.

I found both containers and VMs are affected, but it depends if you provide a remote when specifying the target instance. Only when refreshing a container using a remote target LXD looks to behave correctly. @slapcat can you confirm this using a container with a remote?

Container

VM

slapcat commented 10 months ago

@roosterfish I can confirm the same behavior in my environment.

roosterfish commented 10 months ago

Following up on the post above, I can narrow it down a bit more.

Using latest/candidate (which has the fix for https://github.com/canonical/lxd/pull/12632) you can copy/refresh the VM without any issues if you never use a remote in between the refreshes:

I am not sure yet what is happening on the backend side but it looks as soon as a remote is used for v2 also the consumed storage capacity on Ceph grows significantly. Maybe the comparison then doesn't work anymore when doing lxc cp v1 v2 --refresh --target m2 so it needs to sync everything?

Update: This is the expected behavior when you mix match the copy with and without a remote pointing to the same host. It "recovers" itself after the first try.

Update: This looks to be a timing issue. When the snapshots on both ends get compared, one of them isn't using the Unix timestamp: https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_qemu.go#L6965

roosterfish commented 10 months ago

We are facing two inconsistencies here in both LXD and the docs. That seems to be the reason why the refresh sometimes feels "slower" than expected.

Unlike ZFS/Btrfs the Ceph storage driver has to use rsync or bit by bit copy to transfer the data depending on the volume type when refreshing. Since containers use filesystem volumes, rsync can compare the files and transfer only the delta. For VMs the block volumes can only be transferred bit by bit.

Containers

Refreshing on the same LXD server (or cluster) uses the --checksum flag (See https://github.com/canonical/lxd/blob/main/lxd/rsync/rsync.go#L75) for rsync unlike refreshing between LXD servers via the network (See https://github.com/canonical/lxd/blob/main/lxd/rsync/rsync.go#L160). This means a checksum is computed for each file instead of just checking mod-time and size. That is the reason why lxc cp c1 c2 --refresh takes longer than lxc cp c1 remote:c2 --refresh.

Of course if remote is on a completely different server this could potentially take longer depending on the network connection.

VMs

On the other hand refreshing VMs requires to transfer the entire block volume. This is performed using dd if=/path/to/rbd/v1 of=/path/to/rbd/v2 on a local system when running lxc cp v1 v2 --refresh. It reads the data from Ceph and writes it back to the other volume.

In case the refresh is performed using a remote (lxc cp v1 remote:v2 --refresh), LXD invokes genericVFSMigrateVolume() which reads the VMs block volume and sends it via websocket to the target by simply reading the file and copying it to the opened connection (See https://github.com/canonical/lxd/blob/main/lxd/storage/drivers/generic_vfs.go#L201).

Potential fixes

Btrfs, ZFS and Ceph RBD have an internal send/receive mechanism that allows for optimized volume transfer.

But from the code it looks that this only applies for ZFS and Btrfs due to their own send/receive functions. Only when copying instances for the first time the Ceph rbd import/export-diff functions are used.

roosterfish commented 10 months ago

The PRs #12708, #12715 and #12720 address all the findings from this issue. Adding support for optimized Ceph RBD volume refreshes is now tracked here https://github.com/canonical/lxd/issues/12721.

roosterfish commented 9 months ago

The original issue is described separately in https://github.com/canonical/lxd/issues/12721. We can close this issue as the other findings reported here are already fixed.