Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
625 stars 56 forks source link

Multiple iSCSI fileIO Block Devices on One Btrfs Disk with bees #263

Open a-priestley opened 10 months ago

a-priestley commented 10 months ago

I'm currently working on a project with the goal of setting up a remote game streaming service with simultaneous client capabilities, with a focus on efficient use of storage available over network.

For network storage availability, I'm using iSCSI fileIO image backstores -- one per client.

For the streaming service, I'm using wolf, which is a containerized service that dynamically spins up headless streaming displays using gstreamer for clients to connect to using moonlight.

The problem I am trying to solve when using this service is related to storage use. Each client would need its own filesystem, which means that storage use for duplicate files on the underlying disk would be multiplied by the number of clients using those files. Basically, two clients with the same game installed will double the storage requirements for that game.

The general idea is to use bees to reduce the storage requirements for these duplicate files. So I've currently got a 2TB hybrid drive formatted as a partitionless btrfs, which I'm running bees on, with a 256MB DB_SIZE. Inside the volume I've created an iSCSI target on a sparse image file.

For testing purposes, I'm connecting to the target on the same machine that I'm hosting from, and I've mounted the resulting block device as ext4, which I'm now testing out as a Steam library.

Unfortunately at this point I'm running into some trouble. While many titles are installing and running with no issue, others, particularly larger ones are unable to be verified by Steam due to disk read errors. I can see these coming in as btrfs corruption errors in the journal as well as beesd crawl errors such as:

Aug 18 10:28:45 bob beesd[837]: crawl_5_258[1016]: scan bbd BeesBlockData { 4K 0x1a425b17000 fd = 10 '/run/bees/mnt/424c18a1-e198-45af-b13e-7e64654e9e68/games.img', address = 0x118e523b000 }
Aug 18 10:28:45 bob beesd[837]: crawl_5_258[1016]: Extent { begin = 0x1a420000000, end = 0x1a428000000, physical = 0x118df724000, flags = 0, physical_len = 0x8000000, logical_len = 0x8000000 } block_count 32768
Aug 18 10:28:45 bob kernel: BTRFS error (device sda): bdev /dev/sda errs: wr 0, rd 0, flush 0, corrupt 4107, gen 0
Aug 18 10:28:45 bob beesd[837]: crawl_5_258[1016]: scan extent Extent { begin = 0x1a420000000, end = 0x1a428000000, physical = 0x118df724000, flags = 0, physical_len = 0x8000000, logical_len = 0x8000000 }

I'm trying to ascertain whether or not this project is even feasible. Is it advisable to use bees in this way? If anyone thinks it could work, are there any pointers you can give on how I can tweak the service?

Thanks!

kakra commented 10 months ago

Try using chattr +m on your fileIO backend storage directory. It will disable compression for newly created files, so you'd need to re-create the files (not just move, but copy, maybe with rsync), newly created directories will inherit the flag. If that does not work, try disabling direct io in the iSCSI service. I've seen similar issues when using qemu with direct io raw files. Running with qcow2 and chattr +m works fine. So I believe it's a combination of either both, or probably just direct io. btrfs doesn't seem to handle direct io well. While the files itself seem to be intact, btrfs will show corruption errors because checksums don't match.

a-priestley commented 10 months ago

Hi @kakra and thanks for the quick response.

Just so there's no confusion, you're suggesting I do chattr +m on the directory where the .img file is stored? And not the path where the resulting block device is mounted, right? I created the fileIO image directly on the path of the btrfs mount point so it's sharing a directory with @beeshome. Maybe I should move it down to its own directory so that the attribute does not affect the subvolume?

kakra commented 10 months ago

Then I suggest creating a subvol for the images: btrfs sub create /mnt/point/tobtrfs/images, then chattr +m that image directory. Now, you can rsync -av *.img images/ --remove-source-files to copy all img files over without reflinking, to save some space, it will be automatically removed. Stop bees while doing it so it won't try to re-share newly created extents.

Validate with lsattr that the m is set.

a-priestley commented 10 months ago

While carrying out your instructions, I discovered that the image file itself already has the m attribute. Wouldn't this mean that compression is already disabled? If so, I can still try disabling iSCSI direct IO.

kakra commented 10 months ago

I discovered that the image file itself already has the m attribute. Wouldn't this mean that compression is already disabled?

Okay, so your iSCSI tools already take care of that. Then disabling direct io may fix it. Cached io may work best if that is available, and it should be safe because CoW is still enabled on the files. Also, check if auto-defrag in btrfs is enabled and maybe try disabling it.

I don't think the observed problems are bees fault. It only uncovers a flaw in btrfs with concurrent access in the direct io path. You should be able to observe it even with bees not running but high IO load in the images.

a-priestley commented 10 months ago

Thanks for the advice. I did make sure not to set the autodefrag mount option as per your documentation, but I've created the qcow2 file and am now testing it out as a Steam library. I'll report back with the result!

kakra commented 10 months ago

If you suffer performance problems, you should consider a kernel patch to put btrfs meta-data on a dedicated SSD, also maybe use bcache to cache IO reads and writes. I have some patches here for metadata-{preferred,only} partitions: https://github.com/kakra/linux/pull/26

BTW: Meanwhile I converted to two NVMe disks for dedicated meta-data in a btrfs-raid1 setup and bcache in a mdraid1 setup. Remaining HDDs are still the same.

a-priestley commented 10 months ago

So I'm currently installing Baldurs Gate 3. It hasn't failed yet, but I am getting lots of journal errors as before. Improving performance may be ill-fated at this point, but I'll wait until it finishes before coming to any conclusions

kakra commented 10 months ago

Errors from bees logged to journal? That's okay, I think. bees is very chatty about situations it didn't expect - which happens while writing to the image files.

If you're no longer seeing complaints from the kernel, everything is fine.

a-priestley commented 10 months ago

It's mostly crawl entries from bees, but I am getting a few BTRFS kernel errors. Although they may be as a result of the old .img file still being present on the volume -- bees does seem to be detecting duplicate content between the .img and the .qcow2. If this doesn't work, I'll remove them and start from scratch.

kakra commented 10 months ago

Yeah, the checksums of the img files are borked... But bees created own checksums so it still detects the duplicates and tries to read the files which btrfs refuses to read - finally probably resulting in the qcow images becoming damaged, too.

Zygo commented 10 months ago

The errors will likely persist until the old .img is removed. bees will try to read every reachable block on the filesystem, so if some of them still have csum errors, then bees will eventually find (and complain about) all of them. bees will only try reading each block reference once, so it will skip over the extents with errors as it finds them. There can be multiple references to the same extent with a bad block, but there's a finite number of those, and bees should eventually run out of references to try. I run bees on a large enough fleet that there's always some drive failing somewhere, so this error detection and recovery path is fairly well tested.

bees might create new references to extents that contain errors; however, cp --reflink or snapshots can also do this, and are more likely because bees will not try to create a reference to data blocks it has not successfully read first. The other commands don't try to read the data, so they can very easily create new references to extents with errors.

bees should not be creating any new errors. The kernel dedupe ioctl doesn't allow modifications of data if one side of the dedupe isn't readable, and the inode-level locks during dedupe should prevent concurrent direct IO. bees does add a lot of IO workload to the system, so it can make any existing data corruption bug worse, especially if the corruption is caused by a race condition (as direct IO with concurrent in-memory data modification is). Historically there have been several kernel bugs found and fixed over the years. Direct IO is an exception because its behavior isn't considered a kernel bug for some reason.

Zygo commented 10 months ago

bees should eventually run out of references to try

There is one exception to this: if a block goes bad after a block is stored in the bees hash table (e.g. due to device-level data corruption), bees will keep hitting that bad block every time it reads a new block with a matching hash. That continues until the bad block is deleted (then the hash won't match and bees will remove the hash table entry). This could be handled differently, e.g. bees could detect the read error and remove the hash table entry immediately, or bees could simply exit when any IO error is detected. Right now bees assumes all errors are temporary, and tries to continue after skipping the task that found the error.

That shouldn't happen in this case, because data csums corrupted by direct IO are bad starting at the time of their creation. Data blocks aren't stored in the page cache when using direct IO, so the csum failure can't be bypassed by reading the block from cache (which wouldn't verify the data against the csum). The combination of those would mean there's no way for bees to read the block, which would prevent the block from ever reaching the bees hash table.

a-priestley commented 10 months ago

Hello and thanks for the info @Zygo. I've started over by removing everything from the filesystem. Fresh .qcow2 in a dedicated uncompressed subvolume, connected over iSCSI and mounted as ext4. The installation is running better than before, but it's not completely error-free (a few btrfs errors but not nearly as many as before).

Example: BTRFS error (device sda): bdev /dev/sda errs: wr 0, rd 0, flush 0, corrupt 1, gen 0

This will take several hours to complete though so this may change.

Again, thank you for your interest and patience!

a-priestley commented 10 months ago

No good I'm afraid... As the installation progressed, more and more kernel and beesd errors started showing up, and ultimately it resulted in failure with a disk read error in Steam during the install verification. As @kakra pointed out, it's likely that this isn't a problem with bees, and I have read through a few forum posts saying that fileIO backstores for iSCSI on top of btrfs just isn't a good idea. And without capabilities akin to ZFS Zvols enabling block backstores, Btrfs might not be up to the task yet. I'm happy to make more attempts at this in case your team finds the information useful, but otherwise I won't take up any more of your time.

Forza-tng commented 10 months ago

@a-priestley The fact that you are getting errors is worrisome. The /dev/sda device where you see the errors, is that inside the initiator (iSCSI client) OS or on the host device where the iscsi target images are stored?

What iscsi target software do you use (fileio sounds like LIO target?) and what settings do you use for exporting each target? I have had issues with LIO fileio targets over the years (which I am sure @Zygo remembers from #btrfs). Those issues have mostly been errors when initiators reboot or when the target reboots. Though I did not have csum errors on the clients unless I used the aio option in the fileio backingstore. Nowadays I use tgtd instead, and have had no further issues at all.

a-priestley commented 10 months ago

hi @Forza-tng,

I was using targetcli to set up my fileIO backstores. I chose it simply because it is currently in-kernel, and the Arch Wiki leans toward it. I have not tried tgtd. For the most recent attempt, I created the backstore using a sparse .qcow2 with a max size of the total capacity of the disk (which I never even got close to filling up). I'm not seeing anything about an "aio" option though.

Forza-tng commented 10 months ago

hi @Forza-tng,

I was using targetcli to set up my fileIO backstores. I chose it simply because it is currently in-kernel, and the Arch Wiki leans toward it. I have not tried tgtd. For the most recent attempt, I created the backstore using a sparse .qcow2 with a max size of the total capacity of the disk (which I never even got close to filling up). I'm not seeing anything about an "aio" option though.

I believe that aio has to be set in the savedconfig.json, so it is unlikely an issue.

I also can't see that using qcow2 instead of raw should make any differences to data corruption, unless there's a bug in the fileio driver itself.

Where are you seeing the errors, on the host dmesg or in the clients?

About tgtd. It's a iscsi server written in user-space and does not need any kernel modules. This, I found personally, is a more stable approach. At work I recently converted a rather large storage server to tgtd from targetcli (LIO). The server uses btrfs too.

A little while back I wrote a small wiki entry on setting up tgtd. https://wiki.tnonline.net/w/Blog/iSCSI_target_in_user-space#Configuration_example

a-priestley commented 10 months ago

Well to be honest with you, I think Steam just doesn't like btrfs for reasons I don't fully understand. I did some more testing with it using just btrfs, freshly formatted and mounted as a games library on a completely separate disk -- no deduplication. The same issues were happening: Steam downloads the files, tries to verify them, a bunch of corruption errors show up, and the process fails with a "disk read error". It seems to correlate with reading very large files.

I'm not sure what to make of it, but I do know that the Steam Deck ships on ext4. There's probably a good reason for that.

kakra commented 10 months ago

Well to be honest with you, I think Steam just doesn't like btrfs for reasons I don't fully understand. I did some more testing with it using just btrfs, freshly formatted and mounted as a games library on a completely separate disk -- no deduplication. The same issues were happening: Steam downloads the files, tries to verify them, a bunch of corruption errors show up, and the process fails with a "disk read error". It seems to correlate with reading very large files.

This problem does not exists here (running completely on btrfs), and Steam actually explicitly supports btrfs since 2+ years according to a dev I've chatted with: Proton uses reflink copies of the wine prefix to clone new prefixes per game to save space, and this will probably be ported to Steam Deck once btrfs supports case-folding.

My library contains over 2 TB of downloaded games, and not even one game failed to verify - not in the past, and not now, all files are pristine. Bees is running on the library and finds a lot of duplicate extents, so it's also not just write once.

"correlate with reading very large files" more likely indicates a statistical observation: your hardware may introduce bit errors or your storage software stack on the lower levels may introduce cache inconsistencies (like direct IO on btrfs) and it is more likely to be visible in large files. Did your test really use a native disk? Or some software block device? Any pre-fail conditions in smartctl? Did you check the Steam logs which files at which location really failed checks?

Also: "completely separate disk" may not mean much: Steam may download to a temp folder first, then move files over to the library. So verification may have failed early on the temp folder.

a-priestley commented 10 months ago

It turns out I have been having memory errors for the passed while. One of my modules appears to be faulty. After taking it out, none of the errors I've been seeing are happening any longer. I'll have to reassess the situation here.