Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
625 stars 56 forks source link

Demystifying needed options with QubesOS pool in btrfs reflink (multiple cow snapshots rotating, beesd dedup and load avg hitting 65+ on 12 cores setup) #283

Open tlaurion opened 1 week ago

tlaurion commented 1 week ago

I'll revisit this issue and update this post with details as I gather information along.

First sharing current frozen screen waiting for system to deal with iowait and writing changes to disk after having ran beesd on qubesos deployed with qusal; which means a lot of clones were created from base minimal templates then specialized with different packages installed in derived templates and where origin of clone also were updated. This means the origin of the clones were intact origin of clones, then those disk images in reflink pool diverged, and where bees deduped extents that were the same.

Notes:

PXL_20240627_143958263.jpg


@tasket @Zygo : you have some guidelines on proper btrfs improvements of what is best known to work in cow disk images under virtualization context and more specific to qubesos use case of btrfs that should be tweaked? Willing to reinstall and restore backup if needed, where from current understanding most can/should be tweak able by balancing/fstab/tunefs without needing to reinstall.

Any insights welcome. I repeat, if I defrag: deduped is canceled and performances go back to normal. Thanks for your time.

tasket commented 1 week ago

@tlaurion Having helped someone (and myself) recently with Btrfs degraded performance, two things stood out:

What I prescribe is pretty simple:

Also:

Worth trying: _'ssdspread' option, if it has any effect on metadata

Batch defrag is by far the most important factor above, IMO. Using a margin of additional space in exchange for smooth operation seems like a small price to pay (note that other filesystems make space/frag tradeoffs automatically). Making the defrag a batch op, with days in between, gives us some of the best aspects of what various storage systems do, preserving responsiveness while avoiding the worst write-amplification effects.

'autodefrag' will make baseline performance more like tLVM and could also increase write-amplification more than other options. But it can help avoid hitting a performance wall if for some reason you don't want to use btrfs fs defrag.

Long term, I would suggest someone convince the Btrfs devs to take a hard look at the container/VM image use case so they might help people avoid these pitfalls. Qubes might also help here as well: If we created one-subvol-per-vm and used subvol snapshots instead of reflinks, then the filesystem could be in 'nodatacow' mode and you would have a metadata performance profile closer to NTFS with generally less fragmentation because not every re-write would create detached/split extents. Qubes could also create a designation for backup snapshots, including them in the revisions_to_keep count.

With that said, all the CoW filesystems have these same issues. Unless some new mathematical principle is applied to create a new kind of write-history, then the trade-offs will be similar across different formats. We also need to reflect on what deduplication means for active online systems and the degree to which it should be used; the fact that we can dedup intensively doesn't mean that practice isn't better left to archival or offline 'warehouse' roles (one of Btrfs' target use cases).


FWIW, the Btrfs volume I use most intensively has some non-default properties:

Probably 'no-holes' has the greatest impact. I suspect the jbod hurts performance slightly. The safety margins on Btrfs are such that I'd feel safe turning off RAID1 metadata if it enhances performance. Also, I never initiate balancing (no reason why).

tlaurion commented 5 days ago

Better, but not quite there yet.

Some default, yet unchanged options from QoS 4.2.1 installer's FS creation defaults:

(130)$ sudo btrfs inspect-internal dump-super /dev/mapper/luks-5d997862-6372-4574-aa47-563060917b19
superblock: bytenr=65536, device=/dev/mapper/luks-5d997862-6372-4574-aa47-563060917b19
---------------------------------------------------------
csum_type       0 (crc32c)
csum_size       4
csum            0x7e1ea852 [match]
bytenr          65536
flags           0x1
            ( WRITTEN )
magic           _BHRfS_M [match]
fsid            d6cf356b-495b-4b99-bd6d-1071f51cf1ef
metadata_uuid       00000000-0000-0000-0000-000000000000
label           qubes_dom0
generation      153330
root            3625061007360
sys_array_size      129
chunk_root_generation   90053
root_level      0
chunk_root      3158019358720
chunk_root_level    1
log_root        3625006022656
log_root_transid (deprecated)   0
log_root_level      0
total_bytes     1973612969984
bytes_used      486205415424
sectorsize      4096
nodesize        16384
leafsize (deprecated)   16384
stripesize      4096
root_dir        6
num_devices     1
compat_flags        0x0
compat_ro_flags     0x3
            ( FREE_SPACE_TREE |
              FREE_SPACE_TREE_VALID )
incompat_flags      0x371
            ( MIXED_BACKREF |
              COMPRESS_ZSTD |
              BIG_METADATA |
              EXTENDED_IREF |
              SKINNY_METADATA |
              NO_HOLES )
cache_generation    0
uuid_tree_generation    153330
dev_item.uuid       feaab371-72eb-488a-ae0a-923cf57cf6f2
dev_item.fsid       d6cf356b-495b-4b99-bd6d-1071f51cf1ef [match]
dev_item.type       0
dev_item.total_bytes    1973612969984
dev_item.bytes_used 1973611921408
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size    4096
dev_item.devid      1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0

fstab:

# BTRFS pool within LUKSv2:
UUID=d6cf356b-495b-4b99-bd6d-1071f51cf1ef           /                       btrfs   subvol=root,x-systemd.device-timeout=0,ssd_spread,space_cache=v2 0 0 #w/o  autodefrag, w/o discard=async, w/o compress=zstd (incompatible with bees?)

Ran sudo btrfs filesystem defragment -r -t 256K /var/lib/qubes

Still: 2024-07-03-112521

Happening on cp --reflink=always wyng calls with beesd enforced (but not currently running in background). See the amount of IO write without read? I'm a bit confused here on what to tweak @Zygo.

@tasket :Thought reflink was not supposed to copy image but reference disk images. Really not sure I get an understanding of what happens nor how to dig that deeper.

tasket commented 4 days ago

Thought reflink was not supposed to copy image but reference disk images. Really not sure I get an understanding of what happens nor how to dig that deeper.

Reflink copy will duplicate all the extent information in the source file's metadata to the dest file. Its not like a hard link (which is just one pointer to an inode) but usually much bigger. I am pretty sure Wyng is using reflink copy the same way Qubes Btrfs driver is. One difference is that after making reflinks, Wyng creates a read-only subvol snapshot, reads extent metadata from it, then deletes the snapshot (when it displays "Acquiring deltas"). You might try looking at a 'top' listing during that phase to see if there is anything unusual. For volumes over a certain size (about 128GB) Wyng will use a tmp directory in /var instead of /tmp; the more complex/deduped a large volume is, the more it will write data to /var (vaguely possible its creating your spike, but unlikely). Also check for swap activity.

tasket commented 4 days ago

PS: Look at btrfs subvolume list / to check if there are any extra/stray subvolumes on that filesystem. You should see only the default and the one you made for /var/lib/qubes.