DemiMarie commented 3 years ago

The problem you're addressing (if any)

In R4.0, the default install uses LVM thin pools. However, LVM appears to be optimized for servers, which results in several shortcomings:

Space exhaustion is handled poorly, requiring manual recovery. This recovery may sometimes fail.
It is not possible to shrink a thin pool.
Thin pools slow down system startup and shutdown.

Additionally, LVM thin pools do not support checksums. This can be achieved via dm-integrity, but that does not support TRIM.

Describe the solution you'd like

I propose that R4.3 use BTRFS+reflinks by default. This is a proposal ― it is by no means finalized.

Where is the value to a user, and who might that user be?

BTRFS has checksums by default, and has full support for TRIM. It is also possible to shrink a BTRFS pool without a full backup+restore. BTRFS does not slow down system startup and shutdown, and does not corrupt data if metadata space is exhausted.

When combined with LUKS, BTRFS checksumming provides authentication: it is not possible to tamper with the on-disk data (except by rolling back to a previous version) without invalidating the checksum. Therefore, this is a first step towards untrusted storage domains. Furthermore, BTRFS is the default in Fedora 33 and openSUSE.

Finally, with BTRFS, VM images are just ordinary disk files, and the storage pool the same as the dom0 filesystem. This means that issues like #6297 are impossible.

Describe alternatives you've considered

None that are currently practical. bcachefs and ZFS are long-term potential alternatives, but the latter would need to be distributed as source and the former is not production-ready yet.

Additional context

I have had to recover manually from LVM thin pool problems (failure to activate, IIRC) on more than one occasion. Additionally, the only supported interface to LVM is the CLI, which is rather clumsy. The LVM pool requires nearly twice the amount of code as the BTRFS pool, for example.

Relevant documentation you've consulted

man lvm

Related, non-duplicate issues

5053

6297

6184

3244 (really a kernel bug)

5826

3230 ― since reflink files are ordinary disk files we could just rename them without needing a copy

3964

everything in https://github.com/QubesOS/qubes-issues/search?q=lvm+thin+pool&state=open&type=issues

Most recent benchmarks: https://github.com/QubesOS/qubes-issues/issues/6476#issuecomment-1689640103

DemiMarie commented 5 months ago

@DemiMarie : Maybe OP should refer to those test results, kernel versions to state clearly what could be the causes of discrepancies between qubesos forum post showing better performance gains simply by switching to brtfs from thin lvm?

Would it be possible to re-run benchmarks now? I’d rather refer to up-to-date benchmarks than ones known to be stale.

no-usernames-left commented 5 months ago

openzfs shines in all aspect but licence

This is the perfect tl;dr.

Canonical has shown that this is a solved problem. We should do what they did and get on with other pressing issues such as Wayland, seL4, etc.

If that means we dump Fedora, even better; their release cadence is far too quick for dom0 IMHO. Debian would likely be a much better choice... or we could roll our own slim release for dom0, which would result in both a slower release cadence and a slimming of the TCB, both of which would be good.

no-usernames-left commented 5 months ago

Would it be possible to re-run benchmarks now?

Excellent idea — but let's include ZFS this time.

@Rudd-O you're probably the best-equipped for this, no?

no-usernames-left commented 5 months ago

As an aside, this mess of kernel vs userland vs filesystem, and the clusterfuck which is licensing, makes me appreciate FreeBSD all the more.

DemiMarie commented 5 months ago

We should do what they did and get on with other pressing issues such as Wayland, seL4, etc.

Wayland definitely needs to be implemented, and I’m going to be talking about GPU acceleration (which it will enable) at Xen Project Summit 2024. Right now, seL4 doesn’t provide sufficient protection against CPU vulnerabilities, so it isn’t an option.

tlaurion commented 5 months ago

@DemiMarie : Maybe OP should refer to those test results, kernel versions to state clearly what could be the causes of discrepancies between qubesos forum post showing better performance gains simply by switching to brtfs from thin lvm?

Would it be possible to re-run benchmarks now? I’d rather refer to up-to-date benchmarks than ones known to be stale.

@marmarek i guess this question from @DemiMarie was addressed to you for open-qa tests

@DemiMarie meanwhile, there is no cost at editing OP with past results so all people getting here are at least clear on the why's and versions related to brtfs having not been considered in the past because bad press, no?

tlaurion commented 5 months ago

On my side, I postponed building https://github.com/tlaurion/qubes-bees directly from qubes-builder v2 (which didn;t work last time I attempted that goal) to build the rpm directly under fedora-37 to produce an installable bees rpm. Will update https://forum.qubes-os.org/t/bees-and-brtfs-deduplication/20526/6 later on with some meat to chew on when I have it.

Testing to see and compare gains of deduped space with/without bees, as a start.

For those interested to test dedup gains on test machines only, here is the produced RPM (gzipped because github constraints).: bees-0.10-1.fc37.x86_64.rpm.gz

tlaurion commented 5 months ago

Some preliminary not so convincing bees results on x230 https://forum.qubes-os.org/t/bees-and-brtfs-deduplication/20526/10

tlaurion commented 5 months ago

The previously mentioned tests (the lower time the better):

Test suite run on LVM: 1:01:00

Test suite run on btrfs: 1:44:00 and in one test VM failed to start within 90s timeout

Test suite run in XFS: 1:01:00

Both tests were on the same real laptop, and same software stack (besides the partitioning). Furthermore, in another run, fstrim / timed out on btrfs after 2min (that was after installing updates in all templates, so there probably was quite a bit to trim, but still, that's a ~200GB SSD, so not that big and not that slow). Seeing this results, I've rerun it several times, but got similar results.

Different test, much less heavy on I/O:

LVM: 0:34:49

btrfs: 0:33:27

So, at least not a huge regression, but not a significant improvement either.

It is also worth noting that whatever benchmark that might have happened prior of kernel 6.1+ release might need to be redone, since whatever bottleneck on IO that affected the kernel prior has now vanished and penalized brtfs vs lvm which improvements is stalling.

Those were running on kernel-latest at that time, so at least 6.3.x. But even then, if only the very latest kernel version would start to work fine, it's not enough for switching the default. The LTS version needs to work well for that. In any case, it's way to late for this for R4.2. We may re-test, and re-consider for R4.3.

One really important question was not asked: this is real hardware testing. What hardware and related HCL those results were obtained from? @marmarek can you update that comment? @DemiMarie updated OP comment pointing to that text report. Would be nice that we understand once and for all what changed since then to replicate and understand the bottlenecks, if present explains something.

As stated in my last comment having tested bees in old hardware, old hardware ram speed <> pci <> SSD never bottlenecks to ssd. Recents experiments under heads with newer cryptsetup and dmsetup AND kernel changed a lot. So instead of chasing the white rabbit forever, we kinda need to know what was tested before planning retesting and understand what is wrong even in current and past defsukt configs to understand what went wrong and where.

I can only stare again that on old hardware, there is direct and massive improvement just by switching from TLVM to BTRFS but people prefer the defaults encouraged. But if those defaults are wrong in either case and not perceived on newer hardware... What exact are we testing and what are supposed to be experienced improvements?

fstrim called upon snapshot rotation in those tests? Not anymore?
discards enabled on fstab after those tests were made?
what were the luks/lvm passed down configs back then? Have they changed?
cryptsetup/kernel/lvm/dmsetup now uses async backend by default. Optimized defaults?

If changes benefit some hardware at the cost of others, we need to know.

An example of such improvement having made it's way upstream and downstream are Kernel Io queues being bypassed for read and write ops, from luks to lvm to kernel to ssd so that less overhead in IO overhead happens at each op https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ Those changes landed in Linux kernel 5.9+ but luks needs to be configured to apply the quirk.

Anyway. I think all of this would happen in testing section of the forum with ISOs to test applying different default options so willing testers can just run a script and give needed output. Otherwise I feel ticket won't go anywhere without diversified testing in link with HCL.

no-usernames-left commented 5 months ago

Might dedup become unnecessary if layered templates become a thing?

kocmo commented 1 month ago

Additionally, LVM thin pools do not support checksums. This can be achieved via dm-integrity, but that does not support TRIM.

Ext4 has metadata checksums enabled since e2fsprogs 1.43, so at least some filesystem integrity checking is happening inside VMs:

root@sys-firewall /h/user# dumpe2fs /dev/mapper/dmroot | grep metadata_csum
dumpe2fs 1.47.0 (5-Feb-2023)
Filesystem features:      ... metadata_csum ...

Does Qubes have mechanisms to report kernel errors from VMs and dom0 to the user, via toast notifications or so?

In Qubes 4.2.1, 4.2.2 dom0 systemd journal continuously gets repeated PAM error messages :-/

dom0 pkexec[141170]: PAM unable to dlopen(/usr/lib64/security/pam_sss.so): /usr/lib64/security/pam_sss.so: cannot open shared object file: No such file or directory
dom0 pkexec[141170]: PAM adding faulty module: /usr/lib64/security/pam_sss.so

tlaurion commented 1 day ago

@DemiMarie https://github.com/tasket/wyng-backup/issues/211

With proper settings, I confirm btrfs to be way better performance wise then lvm2 with large qubes, clones+specialization (qusal used), where my tests of beesd have stopped momentarily by lack of time.

DemiMarie commented 1 day ago

@tlaurion Can you provide proper benchmarks? Last benchmarks by @marmarek found that BTRFS was not faster than LVM2, which is why LVM2 is still the default.

QubesOS / qubes-issues

Switch default pool from LVM to BTRFS-Reflink #6476

5053

6297

6184

3244 (really a kernel bug)

5826

3230 ― since reflink files are ordinary disk files we could just rename them without needing a copy

3964