Open tlaurion opened 1 week ago
@tlaurion Having helped someone (and myself) recently with Btrfs degraded performance, two things stood out:
btrfs fs defrag -r -t 256K /var/lib/qubes
made lagging filesystems performant again. Note this is not 'autodefrag'. For multi-terabyte filesystems, consider values larger than 256K.fstrim
can be a trigger for sudden onset of poor performance. (This also happens with tLVM, except in that case it can quickly degrade into a corrupted pool.) I'd expect deduplication to have a similar impact.What I prescribe is pretty simple:
Also:
wyng monitor
command in-between infrequent backups will reduce snapshot frag footprint to near zero each time its run. Frequent backups also have this effect. (And its now possible to use wyng receive --use-snapshot
as a kind of stand-in for Qubes' volume revert feature, having a single snapshot serve both roles).Worth trying: _'ssdspread' option, if it has any effect on metadata
Batch defrag is by far the most important factor above, IMO. Using a margin of additional space in exchange for smooth operation seems like a small price to pay (note that other filesystems make space/frag tradeoffs automatically). Making the defrag a batch op, with days in between, gives us some of the best aspects of what various storage systems do, preserving responsiveness while avoiding the worst write-amplification effects.
'autodefrag' will make baseline performance more like tLVM and could also increase write-amplification more than other options. But it can help avoid hitting a performance wall if for some reason you don't want to use btrfs fs defrag
.
Long term, I would suggest someone convince the Btrfs devs to take a hard look at the container/VM image use case so they might help people avoid these pitfalls. Qubes might also help here as well: If we created one-subvol-per-vm and used subvol snapshots instead of reflinks, then the filesystem could be in 'nodatacow' mode and you would have a metadata performance profile closer to NTFS with generally less fragmentation because not every re-write would create detached/split extents. Qubes could also create a designation for backup snapshots, including them in the revisions_to_keep count.
With that said, all the CoW filesystems have these same issues. Unless some new mathematical principle is applied to create a new kind of write-history, then the trade-offs will be similar across different formats. We also need to reflect on what deduplication means for active online systems and the degree to which it should be used; the fact that we can dedup intensively doesn't mean that practice isn't better left to archival or offline 'warehouse' roles (one of Btrfs' target use cases).
FWIW, the Btrfs volume I use most intensively has some non-default properties:
Probably 'no-holes' has the greatest impact. I suspect the jbod hurts performance slightly. The safety margins on Btrfs are such that I'd feel safe turning off RAID1 metadata if it enhances performance. Also, I never initiate balancing (no reason why).
Better, but not quite there yet.
Some default, yet unchanged options from QoS 4.2.1 installer's FS creation defaults:
(130)$ sudo btrfs inspect-internal dump-super /dev/mapper/luks-5d997862-6372-4574-aa47-563060917b19
superblock: bytenr=65536, device=/dev/mapper/luks-5d997862-6372-4574-aa47-563060917b19
---------------------------------------------------------
csum_type 0 (crc32c)
csum_size 4
csum 0x7e1ea852 [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
fsid d6cf356b-495b-4b99-bd6d-1071f51cf1ef
metadata_uuid 00000000-0000-0000-0000-000000000000
label qubes_dom0
generation 153330
root 3625061007360
sys_array_size 129
chunk_root_generation 90053
root_level 0
chunk_root 3158019358720
chunk_root_level 1
log_root 3625006022656
log_root_transid (deprecated) 0
log_root_level 0
total_bytes 1973612969984
bytes_used 486205415424
sectorsize 4096
nodesize 16384
leafsize (deprecated) 16384
stripesize 4096
root_dir 6
num_devices 1
compat_flags 0x0
compat_ro_flags 0x3
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID )
incompat_flags 0x371
( MIXED_BACKREF |
COMPRESS_ZSTD |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
NO_HOLES )
cache_generation 0
uuid_tree_generation 153330
dev_item.uuid feaab371-72eb-488a-ae0a-923cf57cf6f2
dev_item.fsid d6cf356b-495b-4b99-bd6d-1071f51cf1ef [match]
dev_item.type 0
dev_item.total_bytes 1973612969984
dev_item.bytes_used 1973611921408
dev_item.io_align 4096
dev_item.io_width 4096
dev_item.sector_size 4096
dev_item.devid 1
dev_item.dev_group 0
dev_item.seek_speed 0
dev_item.bandwidth 0
dev_item.generation 0
fstab:
# BTRFS pool within LUKSv2:
UUID=d6cf356b-495b-4b99-bd6d-1071f51cf1ef / btrfs subvol=root,x-systemd.device-timeout=0,ssd_spread,space_cache=v2 0 0 #w/o autodefrag, w/o discard=async, w/o compress=zstd (incompatible with bees?)
Ran
sudo btrfs filesystem defragment -r -t 256K /var/lib/qubes
Still:
Happening on cp --reflink=always
wyng calls with beesd enforced (but not currently running in background).
See the amount of IO write without read? I'm a bit confused here on what to tweak @Zygo.
@tasket :Thought reflink was not supposed to copy image but reference disk images. Really not sure I get an understanding of what happens nor how to dig that deeper.
Thought reflink was not supposed to copy image but reference disk images. Really not sure I get an understanding of what happens nor how to dig that deeper.
Reflink copy will duplicate all the extent information in the source file's metadata to the dest file. Its not like a hard link (which is just one pointer to an inode) but usually much bigger. I am pretty sure Wyng is using reflink copy the same way Qubes Btrfs driver is. One difference is that after making reflinks, Wyng creates a read-only subvol snapshot, reads extent metadata from it, then deletes the snapshot (when it displays "Acquiring deltas"). You might try looking at a 'top' listing during that phase to see if there is anything unusual. For volumes over a certain size (about 128GB) Wyng will use a tmp directory in /var instead of /tmp; the more complex/deduped a large volume is, the more it will write data to /var (vaguely possible its creating your spike, but unlikely). Also check for swap activity.
PS: Look at btrfs subvolume list /
to check if there are any extra/stray subvolumes on that filesystem. You should see only the default and the one you made for /var/lib/qubes.
I'll revisit this issue and update this post with details as I gather information along.
First sharing current frozen screen waiting for system to deal with iowait and writing changes to disk after having ran beesd on qubesos deployed with qusal; which means a lot of clones were created from base minimal templates then specialized with different packages installed in derived templates and where origin of clone also were updated. This means the origin of the clones were intact origin of clones, then those disk images in reflink pool diverged, and where bees deduped extents that were the same.
Notes:
qvm-volume revert qube:volume
helper. This permits the end user to revert up to two states of a qube after having shutdown it after realizing he did a stupid mistake, eg wiping ~/, for up to two subsequent reboot of a qube without needing to rely on backups to restore files/disk image states.@tasket @Zygo : you have some guidelines on proper btrfs improvements of what is best known to work in cow disk images under virtualization context and more specific to qubesos use case of btrfs that should be tweaked? Willing to reinstall and restore backup if needed, where from current understanding most can/should be tweak able by balancing/fstab/tunefs without needing to reinstall.
Any insights welcome. I repeat, if I defrag: deduped is canceled and performances go back to normal. Thanks for your time.