Disable direct IO on Btrfs based storage

ImBearChild commented 1 month ago

Qubes OS release

Qubes Release 4.2.3 (R4.2)

Brief summary

There shouldn't be a use of direct-io on Btrfs because it can lead to various problems.

This would cause CSUM error, which is described in the documentation of PROXMOX and there's also an related issue on the Bugzilla for Fedora. What's more, there are also related discussions on the Btrfs mailing list.

Even if direct IO does not cause errors, this setting would also prevent the built-in compression functionality of Btrfs from working properly.

Steps to reproduce

Run sudo losetup --list on a Btrfs based Qubes OS to see that DIO is listed as having a value of 1.

Expected behavior

Direct IO should be disabled.

Actual behavior

It's enabled on Btrfs based installation.

Note

The commit states that this behavior has been reverted, but I assume it's still being used since the not-script.c introduced in R4.2, which uses LO_FLAGS_DIRECT_IO. (I'm unfamiliar with the code of Qubes OS, so this might be incorrect.)

And maybe it leads to some regression on Btrfs, making qubes on Btrfs slow. I made this conjecture because the file "not-script.c" does not exist in R4.1. I'm not sure about this, because I can't simply disable the DIO for testing purposes; if someone could tell me how to do it, I'd be happy to try.

marmarek commented 1 month ago

@DemiMarie @rustybird ?

rustybird commented 1 month ago

My understanding is that Btrfs used to be more prone to checksum errors with direct I/O mainly because direct I/O exercises substantially different parts of the Btrfs codebase that had some bugs. Since then more attention has been paid to the combination, bugs have been fixed, and it's pretty calm nowadays.

OTOH, besides bugs in the Btrfs codebase itself, these checksum errors also happen(ed?) when a "misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device, so I don't know if that's relevant?

Personally I haven't experienced any such corruption. I've been running Btrfs with direct I/O enabled on my main Qubes OS notebook since R4.1, hacked into the old block script.

And maybe it leads to some regression on Btrfs, making qubes on Btrfs slow.

It's plausible that direct I/O leads to different allocation patterns, which could be worse or even pathological in some cases, maybe linked to specific SSD hardware (handling fragmentation especially badly) or maybe not. But if that's a common problem to the point where it makes direct I/O a bad trade-off, shouldn't we have seen way more reports of such a regression with R4.2 from users who had R4.1 working fine with Btrfs?

Even if direct IO does not cause errors, this setting would also prevent the built-in compression functionality of Btrfs from working properly.

Huh, that's an important caveat (previously stated on the linux-btrfs mailing list) I didn't have on my radar, because I never use compression.

Maybe the not-script should have a mechanism to disable direct I/O. That same mechanism could also be useful to override the 512 byte logical block size.

Or I guess for disabling direct I/O it could also automatically look at the file to see if it's hosted on Btrfs and has compression enabled?

ImBearChild commented 1 month ago

OTOH, besides bugs in the Btrfs codebase itself, these checksum errors also happen(ed?) when a "misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device, so I don't know if that's relevant?

Thank you for your hint. I did indeed encounter this issue. After some testing, it turned out that using the Xen PV Block driver in Windows qubes would cause a CSUM error, but Linux qubes and those not using the Xen PV Block driver wouldn't have such an issue. I have attached the error logs for reference:

dom0 kernel: BTRFS warning (device dm-1): csum failed root 256 ino 255232 off 4297732096 csum 0xc73039cd expected csum 0xaf05393c mirror 1
dom0 kernel: BTRFS warning (device dm-1): csum failed root 256 ino 255232 off 4297728000 csum 0xc73039cd expected csum 0xbea01aaf mirror 1
dom0 kernel: BTRFS error (device dm-1): bdev /dev/mapper/luks-fedc3ffe-b16b-45ec-b968-2a929b8f5a81 errs: wr 0, rd 0, flush 0, corrupt 268806, gen 0
dom0 kernel: BTRFS error (device dm-1): bdev /dev/mapper/luks-fedc3ffe-b16b-45ec-b968-2a929b8f5a81 errs: wr 0, rd 0, flush 0, corrupt 268807, gen 0
dom0 kernel: BTRFS warning (device dm-1): direct IO failed ino 255232 op 0x0 offset 0x1002a2000 len 4096 err no 10
dom0 kernel: BTRFS warning (device dm-1): direct IO failed ino 255232 op 0x0 offset 0x1002a3000 len 4096 err no 10
dom0 kernel: I/O error, dev loop28, sector 8393992 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 2

I've been running Btrfs with direct I/O enabled on my main Qubes OS notebook since R4.1, hacked into the old block script.

Cool! To compare, can you reveal your Btrfs configuration? My Btrfs mount uses compress=zstd:1. Additionally, my dom0 shutdown time is longer than the LVM configuration on same hardware. Journalctl shows that file system synchronization took a long time and systemd had to kill corresponding processes in order for it to shut down. Maybe I should try not using compression? Or is using the option "autodefrag" helpful in alleviating this problem?

Also, I want to know why the default behavior of Qubes OS is to use COW on disk image files while Fedora and openSUSE disable COW for libvirtd's disk images.

Maybe the not-script should have a mechanism to disable direct I/O. That same mechanism could also be useful to override the 512 byte logical block size.

Making users have more choices is better indeed. Many virtual machine management programs support adjusting the cache mode. So it's good to have this feature in Qubes OS.

rustybird commented 1 month ago

"misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device

it turned out that using the Xen PV Block driver in Windows qubes would cause a CSUM error, but Linux qubes and those not using the Xen PV Block driver wouldn't have such an issue.

Looks like the untrusted qube side can cause checksum errors on the trusted dom0 side. I was hoping that the dom0 backend prevents this.

Can you easily reproduce the errors with a new Windows qube?

To compare, can you reveal your Btrfs configuration?

Sure: kernel-latest, default mount options plus noatime and since R4.2 discard=async (the default on modern kernels, but the installer adds discard which unfortunately now means discard=sync for Btrfs). Also since R4.2 I create the filesystem with the --checksum=xxhash.

I don't use Windows qubes though.

Additionally, my dom0 shutdown time is longer than the LVM configuration on same hardware. Journalctl shows that file system synchronization took a long time and systemd had to kill corresponding processes in order for it to shut down.

Try shutting down the qubes individually and/or take a look at the qubesd journal to see which .img files are causing long delays and could benefit from manual defragmentation. (Definitely don't use autodefrag for file-reflink pools. On Btrfs, defragmentation duplicates data shared across snapshots/reflinks so it must be done selectively.)

Also, I want to know why the default behavior of Qubes OS is to use COW on disk image files while Fedora and openSUSE disable COW for libvirtd's disk images.

There's currently an inherent amount of COW required by the Qubes OS storage API: #8767

So if a nocow file is reflinked into another, Btrfs still has to do a COW operation whenever a block is then modified for the first time in one of the files. nocow only avoids subsequent COW operations after that first write per block - but also totally disables data checksums (protection against bit rot) :(

DemiMarie commented 1 month ago

"misbehaving (...) application modifies a buffer in-flight when doing an O_DIRECT write". In our case the application is xen-blkback on a loop device

it turned out that using the Xen PV Block driver in Windows qubes would cause a CSUM error, but Linux qubes and those not using the Xen PV Block driver wouldn't have such an issue.

Looks like the untrusted qube side can cause checksum errors on the trusted dom0 side. I was hoping that the dom0 backend prevents this.

Is this a security vulnerability from an upstream Xen PoV?

ImBearChild commented 1 month ago

Try shutting down the qubes individually and/or take a look at the qubesd journal to see which .img files are causing long delays and could benefit from manual defragmentation.

It seems that the qubesd shutdown process is relatively quick on my machine; from the first "Refilinked file" message to "qubesdb-daemon: terminating," it takes only 8 seconds. However, systemd complains about "Syncing filesystems and block devices" taking too long (30 seconds), which causes it to be killed. Manual defragmentation does not seem to affect this "syncing" issue, but it has helped shorten my qubesd shutdown time.

dom0 systemd-shutdown[1]: Syncing filesystems and block devices.
dom0 systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 17446.

Is this a security vulnerability from an upstream Xen PoV?

Maybe not that bad? As far as I know, a CSUM error will prevent further read access, making the qube unstable. (But not damage the filesystem.) However, after I turned off the Windows qube, btrfs scrub and btrfs check show no errors found.

Can you easily reproduce the errors with a new Windows qube?

Yes. Here are steps to reproduce:

Prepare a new Windows qube
Install XEN PV Block driver from qubes tools installer and reboot
See Btrfs CSUM error in dmesg.

apparatius commented 1 month ago

I can reproduce this as well in Windows 10 qube with testsigning on:

bcdedit.exe /set testsigning on

And latest xenbus and xenvbd drivers installed from here: https://xenbits.xenproject.org/pvdrivers/win/

rustybird commented 1 month ago

Looks like the untrusted qube side can cause checksum errors on the trusted dom0 side. I was hoping that the dom0 backend prevents this.

Is this a security vulnerability from an upstream Xen PoV?

It would allow a malicious VM to DOS any backup procedure that (like the Qubes OS one) fails at the first read error and then doesn't back up the rest of the system (e.g. remaining VMs).

Although the more alarming angle might be dom0 causing a VM to unintentionally corrupt its internal data. Even if that's not necessarily a security vulnerability.

rustybird commented 1 month ago

IMHO we should indeed disable direct I/O in the not-script at the moment. @DemiMarie @marmarek

It's difficult to figure out who's technically wrong here in this stack of blkfront-on-Windows, blkback, loop devices, and Btrfs. Not to mention various manpages/docs that could all be more explicit about interactions with direct I/O.

But in practice, it looks like direct I/O can interfere with system backups, nullify compression, and break Windows VMs.

And it can't be ruled out that technically misused (by whatever part of the stack) direct I/O could cause real data corruption beyond a benign(?) checksum mismatch, especially considering that file-reflink can be hosted by XFS/ZFS/bcachefs too or even any filesystem, in degraded mode. There's also the still not quite dead legacy 'file' driver that's compatible with arbitrary filesystems, edit: does it use the not-script?

DemiMarie commented 1 month ago

@rustybird I can have the not-script check the filesystem type (via fstatfs(2)) and enable direct I/O only on ext4 or XFS (where it is a free performance win). There might be other filesystems (bcachefs?) where it is safe. ZFS users should be using the ZFS storage driver, which doesn’t have this issue.

rustybird commented 1 month ago

Do we know that it is always safe for XFS and ext4 though? That Btrfs commit said it's simply application misbehavior to modify the buffer during a direct I/O write. (Which isn't explictly mentioned for O_DIRECT in open(2), but it makes sense. Come to think of it, blkback modifying the buffer during a write doesn't seem great even without direct I/O, right?) As non data checksumming filesystems, they have one less reason to look at the buffer twice. But maybe they do in some circumstances, e.g. depending on journalling options.

It's silly but I like to treat filesystems as if they're developed by ruthless "language lawyers" who will see the slightest incorrect use as an opportunity to deploy an optimization destroying the maximum allowable amount of data. So now I'm a bit spooked about direct I/O until it's been diagnosed what exactly is going on here.

DemiMarie commented 1 month ago

Modifying a buffer that is in-flight is a straightforward data race and undefined behavior. No application should ever do this. The only difference here is that when the junk data is read, BTRFS returns an error, while XFS/ext4 return garbage.

Theoretically, what BTRFS is doing is better. However, most applications don’t handle I/O errors well, so the BTRFS behavior is a persistent denial of service.

QubesOS / qubes-issues