canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.37k stars 930 forks source link

Performance regression with `lvm` (non-thin) on `latest/edge` snap #14341

Open simondeziel opened 1 week ago

simondeziel commented 1 week ago

It's been a while since we noticed that tests/storage-vm lvm from lxd-ci is much slower with latest/edge than it is with 5.21/edge.

Here are some logs taken from CI runs comparing the 2 snap channels.

storage-vm lvm (latest/edge - 24.04) taking ~44 minutes to complete:

==> Checking VM can be migrated with snapshots (different storage pool)
+ lxc copy v1 localhost:v2 -s vmpool-lvm-33922 --stateless
Transferring instance: v2/snap0: 82.51MB (82.50MB/s)
Transferring instance: v2/snap0: 163.06MB (81.51MB/s)
Transferring instance: v2/snap0: 243.90MB (81.29MB/s)
...
Transferring instance: v2/snap0: 3.53GB (233.38MB/s)                                                    
Transferring instance: v2: 785.26MB (785.25MB/s)
...
Transferring instance: v2: 3.60GB (449.27MB/s)

storage-vm lvm (5.21/edge - 24.04) taking ~24 minutes to complete:

==> Checking VM can be migrated with snapshots (different storage pool)
+ lxc copy v1 localhost:v2 -s vmpool-lvm-34072 --stateless
Transferring instance: v2/snap0: 315.72MB (315.72MB/s)
Transferring instance: v2/snap0: 631.64MB (315.82MB/s)
...
Transferring instance: v2/snap0: 3.68GB (73.25MB/s)                                                   
Transferring instance: v2: 745.92MB (745.91MB/s)
...
Transferring instance: v2: 3.55GB (317.22MB/s)

In both cases, we see that transferring v2/snap0 is much slower than transferring v2 itself. However, in the latest/edge case, the snapshot transfer is noticeably slower than that of 5.21/edge.

In those 2 CI runs, the GHA runners are using the exact same 24.04 image. This means only LXD's snap and its core2X base differ. 5.21/edge uses core22 while latest/edge uses core24. To rule out a lvm2 version issue, I used lvm.external=true in https://github.com/canonical/lxd-ci/pull/328 and got identical results which seems to indicate a potential regression introduced in LXD between stable-5.21 and main.

One way to compare CI logs is to download raw logs from storage-vm lvm (latest/edge - 24.04) and storage-vm lvm (5.21/edge - 24.04). Once downloaded, they can be stripped of their datestamp prefix with sed:

$ sed 's/^[^Z]\+Z //' lvm-latest.raw > lvm-latest.txt
$ sed 's/^[^Z]\+Z //' lvm-521.raw > lvm-521.txt

And then meld allows to compare them line by line (meld lvm-latest.txt lvm-521.txt). This is with this method that I extracted the log snippets from above. Both the .raw and .txt files are included in the attached tarball. Due to GH policy, .tgz files cannot be attached so I added a .txt placeholder.

tomponline commented 1 week ago

@simondeziel please can you try building LXD from main (on an ubuntu 22.04 system) and then sideloading it into the 5.21/edge snap and repeating the tests. This will help us rule out LXD itself and hopefully narrow it down to something in the latest/edge snap packaging (with the most likely candidate being the core24 base snap I expect).