DemiMarie commented 3 years ago

The problem you're addressing (if any)

In R4.0, the default install uses LVM thin pools. However, LVM appears to be optimized for servers, which results in several shortcomings:

Space exhaustion is handled poorly, requiring manual recovery. This recovery may sometimes fail.
It is not possible to shrink a thin pool.
Thin pools slow down system startup and shutdown.

Additionally, LVM thin pools do not support checksums. This can be achieved via dm-integrity, but that does not support TRIM.

Describe the solution you'd like

I propose that R4.3 use BTRFS+reflinks by default. This is a proposal ― it is by no means finalized.

Where is the value to a user, and who might that user be?

BTRFS has checksums by default, and has full support for TRIM. It is also possible to shrink a BTRFS pool without a full backup+restore. BTRFS does not slow down system startup and shutdown, and does not corrupt data if metadata space is exhausted.

When combined with LUKS, BTRFS checksumming provides authentication: it is not possible to tamper with the on-disk data (except by rolling back to a previous version) without invalidating the checksum. Therefore, this is a first step towards untrusted storage domains. Furthermore, BTRFS is the default in Fedora 33 and openSUSE.

Finally, with BTRFS, VM images are just ordinary disk files, and the storage pool the same as the dom0 filesystem. This means that issues like #6297 are impossible.

Describe alternatives you've considered

None that are currently practical. bcachefs and ZFS are long-term potential alternatives, but the latter would need to be distributed as source and the former is not production-ready yet.

Additional context

I have had to recover manually from LVM thin pool problems (failure to activate, IIRC) on more than one occasion. Additionally, the only supported interface to LVM is the CLI, which is rather clumsy. The LVM pool requires nearly twice the amount of code as the BTRFS pool, for example.

Relevant documentation you've consulted

man lvm

Related, non-duplicate issues

5053

6297

6184

3244 (really a kernel bug)

5826

3230 ― since reflink files are ordinary disk files we could just rename them without needing a copy

3964

everything in https://github.com/QubesOS/qubes-issues/search?q=lvm+thin+pool&state=open&type=issues

Most recent benchmarks: https://github.com/QubesOS/qubes-issues/issues/6476#issuecomment-1689640103

iamahuman commented 3 years ago

It might be a good idea to compare performance (seq read, rand read, allocation, overwrite, discard) between the three backends. See: #3639

GWeck commented 3 years ago

With regard to VM boot time, LVM storage pool was slightly faster than BTRFS, but this may be still within the margin of error (LVM: 7.43 s versus BTRFS: 8.15 s for starting a debian-10-minimal VM).

DemiMarie commented 3 years ago

Marking as RFC because this is by no means finalized.

tlaurion commented 3 years ago

@DemiMarie following comment I'm posting deconstructed thoughts here.

No problem with QubesOS searching the best FS to switch for on 4.1 release, and questioning partition scheme, but i'm a bit lost on the direction of QubesOS 4.1 and the goals here (stability? performance? backups? portability? security?)

I was kind of against having dom0 having seperate LVM pool for space constrains resulting of the change, but agreed and accepted that the pool metadata exhaustion possibility was a real tangible issue that hit me a lot before, for which resolution is sketchy and still not advertised in widget correctly for users simply upgrading and being hit with.

The fix in new install resolved the issue, while QubesOS decided to split the dom0 pool out of main pool, so fixing pool issues on the system would be more easy for the end user or non existent.

I am just not so sure why switching filesystem is on point now, where LVM thin provisioning seems to fit the goal, but willing to hear more about the advantages.

I am interested into the reasoning for such a switch, and the probabilities of doing so, since I am really interested into pushing wyng-backups farther, inside/outside of Heads inside/outside of QubesOS, of grant/self funding the work so that QubesOS metadata would be included in wyng-backups, permitting restore/verification/fresh deployment/revert from local(oem recovery VM)/remote source, just applying diff where required from ssh remote red only mountpoint.

This filesystem choice seems to be less relevant then what can make those changes consume dom0 LVM which should be excludedof dom0 so that dmverity can be setuped under Heads/Safeboot. But this is irrelevant to this ticket.

DemiMarie commented 3 years ago

I am just not so sure why switching filesystem is on point now, where LVM thin provisioning seems to fit the goal, but willing to hear more about the advantages.

The advantages are listed above. In short, a BTRFS pool is more flexible, and it offers possibilities (such as whole-system snapshots) that I do not believe are possible with LVM thin provisioning. BTRFS also offers flexible quotas, and can always recover from out of space conditions provided that a small amount of additional storage (such as a spare partition set aside for the purpose) is available. Furthermore, BTRFS checksumming and scrubbing appear to be useful. Finally, new storage can be added to and removed from a BTRFS pool at any time, and the pool can be shrunk as well.

BTRFS also has disadvantages: its throughput is worse than LVM, and there are reports of bad performance on I/O heavy workloads such as QubesOS. Benchmarks and user feedback will be needed to determine which is better, which is why this is an RFC.

I am interested into the reasoning for such a switch, and the probabilities of doing so, since I am really interested into pushing wyng-backups farther, inside/outside of Heads inside/outside of QubesOS, of grant/self funding the work so that QubesOS metadata would be included in wyng-backups, permitting restore/verification/fresh deployment/revert from local(oem recovery VM)/remote source, just applying diff where required from ssh remote red only mountpoint.

I believe that btrfs send and btrfs receive offer the same functionality as wyng-backups, but am not certain as I never used either. As far as the probability: this is currently only a proposal, and I am honestly not sure if switching this close to the R4.1 release date is a good idea. In any case, LVM will continue to be fully supported ― this just flips the default in the installer.

tasket commented 3 years ago

@DemiMarie There are many questions swirling around advanced storage on Linux, but I think the main ones applicable here are about reliability and performance. Btrfs and Thin LVM appear to offer trade-offs on those qualities, and I don't think its necessarily a good move to switch the Qubes default for a slower storage scheme at this point; storage speed is critical for Qubes' usability and large disk image files with random write patterns are Btrfs' weakest point.

Running out of space is probably Thin LVM's weakest point, although this can be pretty easily avoided. For one, dom0 root is moving to a dedicated pool in R4.1, which will keep admin working in most situations. Adding more protections to the domU pool can also be done with some pretty simple userland code. (For those who are skeptical, note that this is the general approach taken by Stratis.)

The above mentioned Btrfs checksums is a nice-to-have feature against accidental damage, but it unfortunately does not come close to providing authentication. To my knowledge, no CRC mode can do that even if its encrypted. Any attacker able to induce some calculated change in an encrypted volume would probably find the malleability of encrypted CRCs to be little or no obstacle. IMHO, the authentication aspect of the proposal is a non-starter. (BTW, it looks like dm-integrity may be able to do this now along with discard support, if its journal mode supports internal tags.)

As for backups, Wyng basically exists because tools like btrfs send are constrained to using the same back end (Btrfs with admin privileges) which severely narrows the user's options for backup destinations. Wyng can also be adapted to any storage source that can create snapshots and report their deltas (Btrfs included).

The storage field also continues to evolve in interesting ways: Red Hat is creating Stratis while hardware manufacturers implemented NVMe objects and enhanced parallelism. Stratis appears to be based on none other than Thin LVM's main components (dm-thin, etc) in addition to dm-integrity, with XFS on top; all the layers are tied together to respond cohesively from a single management interface. This is being developed to avoid Btrfs maintenance and performance pitfalls.

I think some examination of Btrfs development culture may also be in order, as it has driven Red Hat to exasperation and a decision to drop Btrfs. I'm not sure just what it is about accepting Btrfs patches that presents a problem, but it makes me concerned that too much trust has been eroded and that Btrfs may become a casualty in 'storage wars' between an IBM / Red Hat camp and what I'd call an Oracle-centric camp.

FWIW, I was one of the first users to show how Qubes could take advantage of Btrfs reflinks for cloning and to request specific reflink support. Back in 2014, it was easy to assume Btrfs shortcomings would be addressed fairly soon, since those issues were so obvious. Yet they are still unresolved today.

My advice at this point is to wait and see – and experiment. There is an unfortunate dearth of comparison tests configured in a way that makes sense; they usually compare Btrfs to bare Ext4, for example, and almost always overlook LVM thin pools. So its mostly apples vs oranges. However, what little benchmarking I've seen of thin LVM suggests a performance advantage vs Btrfs that would be too large to ignore. There are also Btrfs modes of use we should explore, such as any performance gain from disabling CoW on disk images; if this were deemed desirable then the Qubes Btrfs driver would have to be refactored to use subvolume snapshots instead of reflinks. An XFS reflink comparison on Qubes would also be very interesting!

DemiMarie commented 3 years ago

@DemiMarie There are many questions swirling around advanced storage on Linux, but I think the main ones applicable here are about reliability and performance. Btrfs and Thin LVM appear to offer trade-offs on those qualities, and I don't think its necessarily a good move to switch the Qubes default for a slower storage scheme at this point; storage speed is critical for Qubes' usability and large disk image files with random write patterns are Btrfs' weakest point.

In retrospect, I agree. That said (as you yourself mention below) XFS also supports reflinks and lacks this problem.

Running out of space is probably Thin LVM's weakest point, although this can be pretty easily avoided. For one, dom0 root is moving to a dedicated pool in R4.1, which will keep admin working in most situations. Adding more protections to the domU pool can also be done with some pretty simple userland code. (For those who are skeptical, note that this is the general approach taken by Stratis.)

Will it be possible to reserve space for use by discards? A user needs to be able to free up space even if they make a mistake and let the pool fill up.

The above mentioned Btrfs checksums is a nice-to-have feature against accidental damage, but it unfortunately does not come close to providing authentication. To my knowledge, no CRC mode can do that even if its encrypted. Any attacker able to induce some calculated change in an encrypted volume would probably find the malleability of encrypted CRCs to be little or no obstacle. IMHO, the authentication aspect of the proposal is a non-starter. (BTW, it looks like dm-integrity may be able to do this now along with discard support, if its journal mode supports internal tags.)

The way XTS works is that any change (by an attacker who does not have the key) will completely scramble a 128-bit block; my understanding is that a CRC32 with a scrambled block will only pass with probability 2⁻³². That said, BTRFS also supports Blake2b and SHA256, which would be better choices.

As for backups, Wyng basically exists because tools like btrfs send are constrained to using the same back end (Btrfs with admin privileges) which severely narrows the user's options for backup destinations. Wyng can also be adapted to any storage source that can create snapshots and report their deltas (Btrfs included).

Good to know, thanks!

The storage field also continues to evolve in interesting ways: Red Hat is creating Stratis while hardware manufacturers implemented NVMe objects and enhanced parallelism. Stratis appears to be based on none other than Thin LVM's main components (dm-thin, etc) in addition to dm-integrity, with XFS on top; all the layers are tied together to respond cohesively from a single management interface. This is being developed to avoid Btrfs maintenance and performance pitfalls.

I think some examination of Btrfs development culture may also be in order, as it has driven Red Hat to exasperation and a decision to drop Btrfs. I'm not sure just what it is about accepting Btrfs patches that presents a problem, but it makes me concerned that too much trust has been eroded and that Btrfs may become a casualty in 'storage wars' between an IBM / Red Hat camp and what I'd call an Oracle-centric camp.

My understanding (which admittedly comes from a comment on Y Combinator) is that BTRFS moves too fast to be used in RHEL. RHEL is stuck on one kernel for an entire release, and rebasing BTRFS every release became too difficult, especially since Red Hat has no BTRFS developers.

FWIW, I was one of the first users to show how Qubes could take advantage of Btrfs reflinks for cloning and to request specific reflink support. Back in 2014, it was easy to assume Btrfs shortcomings would be addressed fairly soon, since those issues were so obvious. Yet they are still unresolved today.

My advice at this point is to wait and see – and experiment. There is an unfortunate dearth of comparison tests configured in a way that makes sense; they usually compare Btrfs to bare Ext4, for example, and almost always overlook LVM thin pools. So its mostly apples vs oranges. However, what little benchmarking I've seen of thin LVM suggests a performance advantage vs Btrfs that would be too large to ignore. There are also Btrfs modes of use we should explore, such as any performance gain from disabling CoW on disk images; if this were deemed desirable then the Qubes Btrfs driver would have to be refactored to use subvolume snapshots instead of reflinks. An XFS reflink comparison on Qubes would also be very interesting!

That it would be, especially when combined with Stratis. The other major problem with LVM2 (and possibly dm-thin) seems to be snapshot and discard speeds; I expect XFS reflinks to mitigate most of those problems.

tasket commented 3 years ago

Ah, new Btrfs feature... Great! I'd consider enabling one of its hashing modes as being able to support authentication.

I'd still consider the Stratis concept to be more interesting for now, as Qubes' current volume management is pretty similar but potentially even better and simpler due to having a privileged VM environment.

DemiMarie commented 3 years ago

Ah, new Btrfs feature... Great! I'd consider enabling one of its hashing modes as being able to support authentication.

Agreed. While I am not aware of any way to tamper with a LUKS partition without invalidating a CRC, Blake2b is by far the better choice.

I'd still consider the Stratis concept to be more interesting for now, as Qubes' current volume management is pretty similar but potentially even better and simpler due to having a privileged VM environment.

I agree, with one caveat: my understanding is that LUKS/AES-XTS-512 + BTRFS/Blake2b-256 is sufficient to protect against even malicious block devices, whereas dm-integrity is not. dm-integrity is vulnerable to a partial rollback attack: it is possible to rollback parts of the disk without dm-integrity detecting it. Therefore, dm-integrity is not (currently) sufficient for use with untrusted storage domains, which is a future goal of QubesOS.

DemiMarie commented 3 years ago

@tasket: what are your thoughts on using loop devices? That’s my biggest worry regarding XFS+reflinks, which seems to otherwise be a very good choice for QubesOS. Other approaches exist, of course; for instance, we could modify blkback to handle regular files as well as block devices.

0spinboson commented 3 years ago

I really wish the FS's name wasn't a misogynistic slur. That aside, my only experience with it, under 4.0, had my Qubes installation become unbootable, and I found it very difficult to fix, relative to a system built on LVM. And that does strike as relevant to the question whether Qubes switches, while imo this is only partly addressable via improving the documentation (since the other part is the software we have to use to restore).

DemiMarie commented 3 years ago

FS's name wasn't a misogynistic slur

@0spinboson would you mind clarifying which filesystem you are referring to?

tasket commented 3 years ago

Will it be possible to reserve space for use by discards? A user needs to be able to free up space even if they make a mistake and let the pool fill up.

Yes, its simple to allocate some space in a pool using a non-zero thin lv. Just reserve the lv name in the system, make it inactive, and check that it exists on startup.

Further, it would be easy to use existing space-monitoring components to also pause any VMs associated with a nearly-full pool and then show an alert dialog to the user.

it is possible to rollback parts of the disk without dm-integrity detecting it.

I thought the journal mode would prevent that? I don't know it in detail, but something like a hash of the hashes of the last changed blocks, computed with the prior journal entry, would have to be in each journal entry.

what are your thoughts on using loop devices? That’s my biggest worry regarding XFS+reflinks

I forgot they were a factor... its been so long since I've used Qubes in a file-backed mode. But this should be the same for Btrfs, I think.

FWIW, the XFS reflink suggestion was more speculative, along the lines of "What if we benchmark it for accessing disk images and its almost as fast as thin LVM?". The regular XFS vs Ext4 benchmarks I'm seeing suggest it might be possible. Its also not aligned with the Stratis concept, as that is closer to thin LVM with XFS just providing the top layer. (Obviously we can't use Stratis itself unless it supports a mode that accounts for the top layer being controlled by domUs.)

Also FWIW: XFS historically supported a 'subvolume' feature for accessing disk image files, instead of loopdevs. It requires certain IO sched conditions are met before it can be enabled.

0spinboson commented 3 years ago

FS's name wasn't a misogynistic slur

@0spinboson would you mind clarifying which filesystem you are referring to?

'Butterface', was intentional, afaik.

Rudd-O commented 2 years ago

No, it was not. The file system is named btrfs because it means B-tree FS. That the name is often pronounced with a hilarious word may or may not be seen as a pun, but that is on the beholder's eye.

dmoerner commented 2 years ago

Basic question: If I install R4.1 with BTRFS by selecting custom, and then using Anaconda to automatically create the Qubes partitions with BTRFS, is that sufficient for the default pool to use BTRFS-Reflink? Or do I have to do something extra for the "Reflink" part?

rustybird commented 2 years ago

If I install R4.1 with BTRFS by selecting custom, and then using Anaconda to automatically create the Qubes partitions with BTRFS, is that sufficient for the default pool to use BTRFS-Reflink?

Yes

noskb commented 2 years ago

Change defaults: Btrfs should use 'dup' metadata on encrypted devices #319

tasket commented 1 year ago

I don't know if there's a separate issue for this, but possible Btrfs + fscrypt integration in Fedora seem relevant here:

The system by default will be encrypted with an encryption key stored in the TPM and bound to the signatures used to sign the bootloader/kernel/initrd, https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/LYUABL7F2DENO7YIYVD6TRYLWBMF2CFI/

Rudd-O commented 1 year ago

Seconded!

brendanhoar commented 1 year ago

I don't know if there's a separate issue for this, but possible Btrfs + fscrypt integration in Fedora seem relevant here:

The system by default will be encrypted with an encryption key stored in the TPM and bound to the signatures used to sign the bootloader/kernel/initrd

[If a separate ticket/forum discussion is opened for this, I will move this comment there.]

FWIW, I would want to avoid requiring TPM-stored data encryption keys by default, as it ties the user's data to system hardware that can fail.

The approach does makes some sense in an enterprise setting, primarily ensuring that data on storage devices separated from the machine with the TPM are provably unrecoverable during reuse/e-cycling. Business data is often backed up off enterprise endpoints by tools such as One Drive (e.g. often subsuming the Documents folder on windows in recent deployments), so they usually have the hardware failure risk mitigated via backups by default.

For non-enterprise users, esp. the user audience for Qubes, there should be flexibility in how keys are handled. Storage-subsystem-detached keys are useful for some, but the user must make the choices on privacy/security vs data-loss risk.

B

DemiMarie commented 1 year ago

Confirmed: dm-thin is optimized for in-place overwrites of already-provisioned blocks, not breaking sharing or provisioning new blocks.

tasket commented 1 year ago

May also explain why the default metadata volume sizes seem insufficient.

Aside from performance, I'd have to say my experience with Btrfs has been more stable than with tLVM. When something does go wrong with Btrfs, its easier to diagnose and recover.

There is also the problem of extra wear from write-amplification. When I look at stats other people have posted for the nvme models I'm using, I'm seeing much higher rates of wear-out on my drives (that have had Qubes on tLVM). Compared to people reporting a similar amount of lifetime read-access GB, my own drives are seeing > 3X the wear-indicator values.

Edit: FWIW, any dynamic thin-provisioning system will have to do most of the allocation work that a full filesystem does. tLVM would have turned out better had they started with a filesystem model (like Ext4) and removed the bits that a volume manager didn't use.

kalkin commented 1 year ago

As far as I understand currently there is no code for BTRFS support? Or can we just use the file-image based code and patch it to use cp --reflink?

rustybird commented 1 year ago

@kalkin There's a newer file-reflink storage driver that's automatically used for the non-default Btrfs installation layout since R4.0.1.

tlaurion commented 10 months ago

EDIT: answer from Marek under https://github.com/QubesOS/qubes-issues/issues/6476#issuecomment-1689640103 :

In any case, it's way to late for this for R4.2. We may re-test, and re-consider for R4.3.

@andrewdavidwong Release TBD means not planned for 4.2? This ticket title should be updated for 4.3 and performances comparisons for default installs should be taken into consideration as well. It is to be noted that most OSes are moving away of TLVM for XFS/BRTFS.

Some history:

Fedora 33 swtiched to BRTFS partitioning by default for workstations, having swap and rootfs in different LUKS containers and still default to BRTFS as Fedora 38.
Ubuntu Lunar Lobster followed steps and decided to switch to BRTFS as well.
OpenSuse followed step as well.
Debian still relies on a thin LVM.

Also, ext4 as fixed inodes at time of formatting the partition, as opposed to XFS/BRTFS which are dynamic, which Qubes extends causing issues for some users.

I interested into knowing what Heads should support in the future for space constaints reasons and to prepare for changes. Also, Qubes is first class citizen, but not the only OS deployed. I was wondering what are the directions of QubesOS considering that BRTFS is default since Fedora33.

cross-posts linking to each other:

https://forum.qubes-os.org/t/btrfs-and-qubes-os/6967/28?u=insurgo
https://github.com/osresearch/heads/issues/1474
Speed comparison of LUKS+TLVM+EXT4 vs LUKS+BRTFS: https://forum.qubes-os.org/t/ext4-vs-btrfs-performance-on-qubes-os-installs/13585

DemiMarie commented 10 months ago

@tlaurion BTRFS appears to have significant problems with I/O intensive workloads. There are known problems that can result in unbounded latency spikes.

tlaurion commented 10 months ago

@tlaurion BTRFS appears to have significant problems with I/O intensive workloads. There are known problems that can result in unbounded latency spikes.

@DemiMarie those were documented? They should be referenced here. From my experience, the benefits are definitely overpowering TLVM. Care to share some I/O intensive workloads examples?

andrewdavidwong commented 10 months ago

Release TBD means not planned for 4.2?

The "Release TBD" milestone no longer exists (explanation).

DemiMarie commented 10 months ago

@tlaurion BTRFS appears to have significant problems with I/O intensive workloads. There are known problems that can result in unbounded latency spikes.

@DemiMarie those were documented? They should be referenced here. From my experience, the benefits are definitely overpowering TLVM. Care to share some I/O intensive workloads examples?

@marmarek did the benchmarks. IIRC he found that BTRFS and XFS were not any faster than thin LVM in a workload (Qubes OS openQA tests, IIRC) that should have favored them.

tlaurion commented 10 months ago

@tlaurion BTRFS appears to have significant problems with I/O intensive workloads. There are known problems that can result in unbounded latency spikes.

@DemiMarie those were documented? They should be referenced here. From my experience, the benefits are definitely overpowering TLVM. Care to share some I/O intensive workloads examples?

@marmarek did the benchmarks. IIRC he found that BTRFS and XFS were not any faster than thin LVM in a workload (Qubes OS openQA tests, IIRC) that should have favored them.

I'm surprised and looking forward to read the benchmarks @marmarek.

Meanwhile Bing reports:

I searched the web for "Qubes OS benchmark Btrfs and XFS over LVM" and found :

BTRFS and Qubes OS: This is a forum discussion where users share their experiences and opinions about using Btrfs as the filesystem for Qubes OS. Some users report a dramatic improvement in performance, especially when shutting down or backing up large VMs, compared to using LVM+Ext4. Others compare Btrfs with XFS and discuss the trade-offs and benefits of each filesystem. One user explains that Btrfs is faster because it does not need to activate the volumes before reading or writing them, unlike LVM.

Does installing Qubes OS on a btrfs file system offer any advantage?: This is another forum discussion where a user asks whether using Btrfs for Qubes OS has any advantage over Ext4. The user lists some drawbacks of Btrfs, such as no compression, slower speed, and no self-healing. Other users reply with links to benchmarks, mailing list posts, and GitHub issues that compare Btrfs with other filesystems and storage drivers. Some users suggest using XFS or ZFS instead of Btrfs or Ext4.

BTRFS performance compared to LVM+EXT4 with regards to database workloads: This is a blog post that presents the results of a benchmark test that compares Btrfs and LVM+Ext4 in terms of database performance. The test uses PostgreSQL as the database engine and measures the read and write speeds before and after taking a snapshot. The post concludes that Btrfs outperformed LVM in every benchmark, with higher scores in both initial and post-snapshot reads and writes. The post also explains some of the features and limitations of Btrfs, such as copy-on-write, checksums, compression, and fragmentation.

I hope this helps you find the information you are looking for. 😊

Source: Conversation with Bing, 8/22/2023 (1) BTRFS and Qubes OS - General Discussion - Qubes OS Forum. https://forum.qubes-os.org/t/btrfs-and-qubes-os/6967. (2) Does installing Qubes OS on a btrfs file system offer any advantage?. https://forum.qubes-os.org/t/does-installing-qubes-os-on-a-btrfs-file-system-offer-any-advantage/12282. (3) BTRFS performance compared to LVM+EXT4 with regards to database .... https://blog.pythian.com/btrfs-performance-compared-lvmext4-regards-database-workloads/. (4) Examining btrfs, Linux’s perpetually half-finished filesystem. https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/.

And when asked to summarize those threads

From what I can gather from the threads, the conclusion between Btrfs and LVM is that Btrfs has some advantages over LVM in terms of performance, flexibility, and features, but also some drawbacks and risks. Here are some of the main points:

Btrfs is faster than LVM because it does not need to activate the volumes before reading or writing them, unlike LVM. This means that Btrfs can access the data more quickly and efficiently, especially when dealing with large or multiple VMs.

Btrfs also supports snapshots, which are a way of creating a copy of the data at a certain point in time. Snapshots can be used for backup, recovery, or testing purposes. Btrfs snapshots are more efficient than LVM snapshots because they use copy-on-write, which means that only the changes are stored, not the whole data. This saves space and reduces overhead.

Btrfs has some other features that LVM does not have, such as checksums, compression, deduplication, and RAID levels. These features can improve the reliability, security, and performance of the data. However, some of these features are still experimental or unstable and may cause data corruption or loss. Therefore, using Btrfs requires caution and regular backups.

Btrfs also suffers from fragmentation, which is when the data is scattered across different locations on the disk. Fragmentation can slow down the read and write speeds and affect the performance of the system. Btrfs has a defragmentation tool that can be used to reduce fragmentation, but it may also cause problems with snapshots or RAID.

LVM is more stable and mature than Btrfs and has been widely used for a long time. LVM is also compatible with most filesystems and operating systems. LVM has a simpler and more familiar interface and commands than Btrfs. LVM is less likely to cause data loss or corruption than Btrfs.

In summary, Btrfs and LVM have different strengths and weaknesses and there is no definitive answer to which one is better. It depends on the user's needs, preferences, and risk tolerance. Some users may prefer Btrfs for its speed and features, while others may prefer LVM for its stability and compatibility.

tlaurion commented 10 months ago

@DemiMarie @marmarek

It is also worth noting that whatever benchmark that might have happened prior of kernel 6.1+ release might need to be redone, since whatever bottleneck on IO that affected the kernel prior has now vanished and penalized brtfs vs lvm which improvements is stalling.

6.2 real life testing on database situation shows some interesting hypothesis/validation showing important boosts that needs to be read

https://blog.pythian.com/btrfs-performance-compared-lvmext4-regards-database-workloads/

Also:

According to Phoronix, Btrfs has enjoyed some performance improvements with Linux 6.x kernel after the major bug fixes. Some of the articles that report on these improvements are:

Btrfs Brings Some Great Performance Improvements With Linux 6.1: This article claims that Btrfs has introduced async buffered writes and IO_uring support, which increased the direct IO limit for read and improved the throughput by 3x on a sample workload. It also mentions a significant FIEMAP speed improvement and a change to the algorithm for how extents are enumerated that leads to orders of magnitude speed boost.

Btrfs Enjoys More Performance With Linux 6.3: This article states that Btrfs has added some performance optimizations and new features, such as big timestamps, FIEMAP speed improvement, and async buffered writes for compressed extents. It also reports some 3~10x speedups in some benchmarks.

Btrfs In Linux 6.5 May Bring A Cumulative Performance Improvement For Metadata-Heavy Operations: This article says that Btrfs has brought a cumulative performance improvement for metadata-heavy operations, such as reading extent buffer in one-go, simplifying IO tracking and bio submission, and avoiding unnecessary reads in scrub code.

These articles suggest that Btrfs has improved its performance by addressing some of the bugs and limitations that affected its previous versions. However, they also acknowledge that Btrfs still has some challenges and drawbacks, such as RAID56 issues, fragmentation, balance operations, and stability. Therefore, the performance of Btrfs may vary depending on the workload type, the disk layout, the compression algorithm, and the mount options. 😊

Source: Conversation with Bing, 8/22/2023 (1) Btrfs Brings Some Great Performance Improvements With Linux 6.1 - Phoronix. https://www.phoronix.com/news/Linux-6.1-Btrfs. (2) Btrfs Enjoys More Performance With Linux 6.3 - Phoronix. https://www.phoronix.com/forums/forum/software/general-linux-open-source/1374178-btrfs-enjoys-more-performance-with-linux-6-3-including-some-3~10x-speedups. (3) Btrfs In Linux 6.5 May Bring A Cumulative Performance ... - Phoronix. https://www.phoronix.com/news/Btrfs-Linux-6.5. (4) Btrfs - Phoronix. https://www.phoronix.com/search/Btrfs. (5) undefined. https://btrfs.wiki.kernel.org/index.php/Main_Page.

~I see that 4k templates are now available for testing under 4.2.~ Edit: No. They still don't exist, bas bing.

I will try to participate providing real life testing under relevant qubesos forum threads.

marmarek commented 10 months ago

The previously mentioned tests (the lower time the better):

Test suite run on LVM: 1:01:00
Test suite run on btrfs: 1:44:00 and in one test VM failed to start within 90s timeout
Test suite run in XFS: 1:01:00

Both tests were on the same real laptop, and same software stack (besides the partitioning). Furthermore, in another run, fstrim / timed out on btrfs after 2min (that was after installing updates in all templates, so there probably was quite a bit to trim, but still, that's a ~200GB SSD, so not that big and not that slow). Seeing this results, I've rerun it several times, but got similar results.

Different test, much less heavy on I/O:

LVM: 0:34:49
btrfs: 0:33:27

So, at least not a huge regression, but not a significant improvement either.

It is also worth noting that whatever benchmark that might have happened prior of kernel 6.1+ release might need to be redone, since whatever bottleneck on IO that affected the kernel prior has now vanished and penalized brtfs vs lvm which improvements is stalling.

Those were running on kernel-latest at that time, so at least 6.3.x. But even then, if only the very latest kernel version would start to work fine, it's not enough for switching the default. The LTS version needs to work well for that. In any case, it's way to late for this for R4.2. We may re-test, and re-consider for R4.3.

tasket commented 10 months ago

Thanks for the test results. TLVM has outsized lag times with snapshot rotation for larger-sized volumes (>32GB data content), so that should be factored in as well. This is probably why my subjective experience says that Btrfs feels faster than TLVM.

FWIW, fstrim should generally be avoided. Its default discarding pattern is aggressive and not used by the filesystems themselves in discard mode for a reason, as it creates higher fragmentation and greater demand on metadata resources. In my experience, fstrim use can also contribute to or trigger a TLVM failure.

I think stability is an overriding concern with choice of storage system, and I would put Btrfs head and shoulders above either XFS or TLVM in that regard. My PC experience over the past decade has been mostly on Btrfs and (because of Qubes) TLVM, and the difference has become stark.

OTOH, I realize performance is a sensitive issue for Qubes, which has suffered a negative efficiency trend. However, I'd also consider how that trend compares to the trend for PC systems in general when weighing performance vs. stability priorities.

DemiMarie commented 10 months ago

FWIW, fstrim should generally be avoided. Its default discarding pattern is aggressive and not used by the filesystems themselves in discard mode for a reason, as it creates higher fragmentation and greater demand on metadata resources. In my experience, fstrim use can also contribute to or trigger a TLVM failure.

What do you recommend instead? Should the filesystem be mounted with the discard option? Not discarding at all is not an option because it causes space leaks.

tasket commented 10 months ago

BTW, even when you want to run fstrim manually for some maintenance objective and you reduce the granularity with --minimum (which does help), it is still hitting metadata resources with new deltas in a very short period of time, and the volume(s) might not be small. Compare that to snapshot rotation, where deltas accumulate via routine user app use or system updates, and the sudden changes are mainly limited to removing deltas (freeing metadata space) when snapshots are removed.

All of the CoW/snapshot capable systems have issues with surges in metadata use. NTFS is famous for gradually degrading in performance and becoming unbearably slow for minutes or even days. Btrfs is a bit closer to NTFS in this, and I wonder if they have made a better trade-off for uptime vs performance. ZFS is reported to degrade as the number of snapshots increase. However, TLVM's issues haven't been tempered like these other systems; one gets strange inconsistencies that have to be diagnosed before applying a course of disjointed home remedies.

What do you recommend instead? Should the filesystem be mounted with the discard option?

@DemiMarie Definitely use the mount option, which I think is standard practice now. The mount option helps moderate fragmentation resulting from discards. I think fstrim should remain something that is only invoked manually when an admin wants to address a special case (such as having forgotten to mount with 'discard').

Some recommendations:

The Btrfs 'nodesize' setting affects contention and thus latency for metadata-intensive operations, of which our fragmented VM image files require in abundance. It is reasonable to assume the 16KB default is not optimized for active disk image files. Therefore, I suggest testing Qubes usage with 'nodesize' set to a smaller value like 8KB or 4KB. This can be taken together with other optimizations:

Btrfs 'no-holes' and 'skinny-metadata' options
4KB guest fs block size
Use F2FS or XFS for guest private vols, both of which have considerably lower write amplification rates vs EXT4
For guest root fs, compression should be considered

Some experimentation could be done beyond that, however. To reduce the level of data fragmentation resulting from random writes (not just discards), ideally we would want to steak out a middle-ground between the 4KB minimum extent size that Btrfs normally uses and the 64KB (but usually >256KB) minimums that TLVM uses. Also, active defragmentation and deduplication (the two often having opposite effects) should not be completely discounted; periodically running them both with extent-size thresholds that complement each other could improve overall efficiency and responsiveness.

Rudd-O commented 10 months ago

Would love to see a TLVM / btrfs / ZFS (ashift 12) performance comparison now that we have the three drivers in upstream.

tlaurion commented 10 months ago

Would love to see a TLVM / btrfs / ZFS (ashift 12) performance comparison now that we have the three drivers in upstream.

Would love to see that! Little experiment challenging Bing for an hour.. she chose brtfs first and then reconsidering the results compared against zfs and then chose zfs... When asked to produce a table of her findings:

File system	Pros	Cons	References
TLVM	- Offers more flexibility and scalability for creating and managing virtual disks for Qubes OS and the VMs. - Uses thin provisioning, which means it only allocates the space that is actually used by the VMs, instead of reserving all the space in advance.	- Adds more complexity and overhead to the system, and may not work well with some file systems that do not support discard. - Uses a lot of metadata to keep track of the changes in the data, especially when using features like snapshots, which can grow too large and slow down the system or cause errors. - May be more prone to corruption if the system is hard reset.	[1], [2], [3], [4]
btrfs	- Supports many advanced features that can improve the performance and reliability of Qubes OS and the VMs, such as compression, encryption, deduplication, checksums, snapshots, subvolumes, RAID, etc. - Supports resizing and thin provisioning natively, which means it does not need LVM layer to work with Qubes OS. - Has a good balance between large and small extent sizes, which can reduce fragmentation and metadata use.	- Has some issues with SELinux relabeling and root filesystem resizing, which may require some manual intervention or workaround to solve. - Uses a lot of metadata and space to keep track of the changes in the data, which may require more frequent maintenance and cleanup to avoid performance degradation or errors. - Is still relatively new and experimental compared to other file systems, which means it may have some unknown bugs or limitations that may cause problems.	[5], [6], [7], [8]
ZFS	- Supports many advanced features that can improve the performance and reliability of Qubes OS and the VMs, such as compression, encryption, deduplication, checksums, snapshots, subvolumes, RAID, etc. - Supports encryption natively, which means it does not need LUKS layer to work with Qubes OS. - Has some unique features that can enhance the caching and logging capabilities of the system, such as ARC, SLOG, L2ARC, etc. - Requires a custom kernel parameter (ashift) to match the disk block size, which can improve the performance and efficiency of ZFS by aligning the data with the physical structure of the disk. Ashift can also prevent wasting space on the disk by avoiding partial blocks.	- Uses a lot of memory and CPU resources to perform its operations, which may affect the performance and availability of other processes on the system. ZFS may also require more tuning and optimization to work well with Qubes OS and the VMs. - Is still relatively new and experimental for Linux systems, which means it may have some compatibility and stability issues with Qubes OS and its features. ZFS may also have some bugs or limitations that may cause problems.	[9], [10], [11], [12]

[1]: TLVM: Thinly-provisioned Logical Volume Manager - https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/thin-provisioning.html
[2]: TLVM: A New Logical Volume Manager for Linux - https://www.usenix.org/system/files/conference/atc16/atc16-paper-lee.pdf
[3]: TLVM: A New Logical Volume Manager for Linux - https://www.youtube.com/watch?v=0QZLw8nY4gk
[4]: TLVM: How to recover from a hard reset? - https://github.com/QubesOS/qubes-issues/issues/2729
[5]: btrfs: B-tree file system - https://btrfs.wiki.kernel.org/index.php/Main_Page
[6]: btrfs: SELinux relabeling issue - https://github.com/QubesOS/qubes-issues/issues/1817
[7]: btrfs: Performance tuning - https://btrfs.wiki.kernel.org/index.php/Performance_tuning
[8]: btrfs: Root filesystem resizing issue - https://github.com/QubesOS/qubes-issues/issues/1120
[9]: ZFS: Zettabyte File System - https://openzfs.github.io/openzfs-docs/
[10]: ZFS: ashift parameter - https://openzfs.github.io/openzfs-docs/Basic%20Concepts/vdevs.html#ashift
[11]: ZFS: Performance tuning - https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/index.html
[12]: ZFS: Compatibility and stability issues with Linux - https://github.com/openzfs/zfs/issues

(For what it's worth!)

DemiMarie commented 10 months ago

@tlaurion links 2 and 3 are 404, and links 4, 6, and 8 are completely irrelevant. Also ZFS is not experimental; there are plenty of production workloads (like Let’s Encrypt’s main databases!) that use it.

tlaurion commented 10 months ago

Also ZFS is not experimental; there are plenty of production workloads (like Let’s Encrypt’s main databases!) that use it.

@DemiMarie ashift and alignment to disk blocks is interesting property there, as well as dodging complexity of having LUKS enclosed BRTFS or current LUKS->TLVM->volumes complexity and failures. I'm not a FS expert, but I have suffered from TLVM qubes defaults in the past and really looking forward for a switch for whatever better. TLVM is known to fail the user, it's not a question of how, which is answered everywhere, but when. Your praises for BRTFS were proven by testing by the community, your call was heard by other projects, now the question is where we go to and what will be proposed to the user next. TLVM still in 4.3 or ZFS/BRTFS.

Only testing under different use case scenarios will answer that question, where let's encrypt database scenario doesn't correspond to server Xen virtualization servers, which our use case more correspond to. I wish we had some Xen infra admins of the sort chiming in to brain dump what failed and worked for them. That input would be significative. Other inputs are neither internally valid nor generalizable to quebesos, unfortunately, which is basically what internet answers to outside of quebesos direct use case testing today.

andrewdavidwong commented 10 months ago

@tlaurion: FWIW, I don't think it's very useful to post large quantities of ChatGPT output in qubes-issues, unless you're willing to say, at minimum, that you've personally vetted the output and agree with it.

DemiMarie commented 10 months ago

@tlaurion: FWIW, I don't think it's very useful to post large quantities of ChatGPT output in qubes-issues, unless you're willing to say, at minimum, that you've personally vetted the output and agree with it.

Exactly. Such vetting clearly was not done here, as shown by the broken and irrelevant links.

Rudd-O commented 10 months ago

Seconding Demi here. ZFS is not experimental — the only reason it's not included in mainline is because it's not GPL, not because it's "waiting for the bugs to shake out". Even though it's not in mainline, some distributors have been known to distribute ZFS in compliance to all licensing agreements (well, as compliant as it can be when the current practice is that distributing non-GPL code or derivatives is not okay, but loading non-GPL code into the kernel is long-standing practice that has given rise to no legal claims for decades).

no-usernames-left commented 5 months ago

Chiming in here to say that, IMHO, we should be aiming for ZFS instead of Btrfs, "Linux's perpetually half-finished filesystem".

Not only would this add robust data integrity guarantees (data+metadata checksumming as well as always being consistent on disk), it would make snapshots and backup/restore much easier and cleaner.

tasket commented 2 months ago

I usually avoid ZFS for the same reasons Linus Torvalds does; I simply don't trust Oracle and large corporations are getting bolder with their open source IP rug-pulls (see Red Hat, not to mention Oracle's past attempt with Java APIs).

For people who do trust Oracle, I remain open-minded about the possibility of supporting OpenZFS in Wyng backup. But I also don't see a clear cut way to obtain image file metadata there; in that case Wyng only performs as well as a typical incremental backup. Maybe someone could enlighten me, but AFAICT ZFS is not currently providing features that would enable efficient metadata-driven backups to non-ZFS filesystems.

The idea Btrfs is "perpetually unfinished" doesn't ring true; the basic extent and file allocation format has been described as finished by the Btrfs devs, while they also state upfront that changes are still being made. None of the popular OS-running filesystems have been finalized – see XFS as a particularly old example where significant features are still being added.

The Btrfs format has some amazing qualities, for example the ability to describe a volume (or sub-volume) in a way that can replicate deduplication/reflinks of file data on a different Btrfs filesystem (e.g. btrfs-send knows what data is shared by reflinked files). The reason why is that everything a file/inode directly references in Btrfs is considered a logical entity, so its easy to transfer while retaining its original identity (and when files reference them more than once, it follows that is easy to replicate as well). Very promising stuff, IMO.

Finally, versus thin-lvm, Btrfs has been a quantum leap in reliability for me; I've used and provided support for both for as long as each has had a Qubes driver (longer than that for Btrfs) and there is no question in my mind about the difference.

no-usernames-left commented 2 months ago

I simply don't trust Oracle and large corporations are getting bolder with their open source IP rug-pulls (see Red Hat, not to mention Oracle's past attempt with Java APIs).

I don't trust Oracle either, but the fact is that OpenZFS was forked from before Oracle acquired Sun and closed the source; they have no claim to OpenZFS and therefore no trust in Oracle is required.

And what happened with CentOS? It was immediately forked when Red Hat pulled the rug, with no significant loss other than a change of name.

The fact that there is no such thing as fsck.zfs is itself huge; being always consistent on-disk is a total gamechanger.

rustybird commented 2 months ago

One exciting feature coming up in Btrfs is fscrypt support ("file-based encryption"). The reason I find it exciting in a Qubes OS context is because they're making it extent-based, i.e. not just individual files but individual fragments of files can have their own encryption keys. Reading the tea leaves on their mailing list, I get the impression that this design will - eventually, but probably not from day one - allow reflinking an unencrypted (or differently encrypted) source file into a destination file where subsequent writes will then be encrypted with a new key. Which would be exactly what is needed to support ephemeral snap_on_start volumes, solving https://github.com/QubesOS/qubes-issues/issues/904#issuecomment-1929766110.

tasket commented 2 months ago

I think the facts show that the fate of OpenJDK and derivatives (what most mobile devices rely on) was left to the US courts – which are now very much in a mood to issue new rulings on supposedly settled case law. The whole point of Oracle's case was that forked and reverse-engineered projects could come under their control.

And CentOS was effectively disbanded. The replacements don't have access to RH "proprietary" fixes and modifications that they distribute to RHEL customers (a "new fact" about software that the legal world seems comfortable with). If corporations are ready to attempt de-facto nullification of the GPL then licensing is a top concern.

no-usernames-left commented 2 months ago

If corporations are ready to attempt de-facto nullification of the GPL then licensing is a top concern.

OpenZFS, somewhat infamously, is not GPL.

Another point in favour of OpenZFS is the ease of backing up with zfs send (and, thus, restoration with zfs recv).

tlaurion commented 2 months ago

From https://github.com/QubesOS/qubes-issues/issues/6476#issuecomment-1689640103 :

The previously mentioned tests (the lower time the better):

Test suite run on LVM: 1:01:00

Test suite run on btrfs: 1:44:00 and in one test VM failed to start within 90s timeout

Test suite run in XFS: 1:01:00

Both tests were on the same real laptop, and same software stack (besides the partitioning). Furthermore, in another run, fstrim / timed out on btrfs after 2min (that was after installing updates in all templates, so there probably was quite a bit to trim, but still, that's a ~200GB SSD, so not that big and not that slow). Seeing this results, I've rerun it several times, but got similar results.

Different test, much less heavy on I/O:

LVM: 0:34:49

btrfs: 0:33:27

So, at least not a huge regression, but not a significant improvement either.

It is also worth noting that whatever benchmark that might have happened prior of kernel 6.1+ release might need to be redone, since whatever bottleneck on IO that affected the kernel prior has now vanished and penalized brtfs vs lvm which improvements is stalling.

Those were running on kernel-latest at that time, so at least 6.3.x. But even then, if only the very latest kernel version would start to work fine, it's not enough for switching the default. The LTS version needs to work well for that. In any case, it's way to late for this for R4.2. We may re-test, and re-consider for R4.3.

I would love to focus on currently referred testing results (and options/versions tested) as being the base for not going forward with BRTFS.

@DemiMarie : Maybe OP should refer to those test results, kernel versions to state clearly what could be the causes of discrepancies between qubesos forum post showing better performance gains simply by switching to brtfs from thin lvm?

@tasket suggested some optimizations on filesystems options to boost brtfs performance for better performance boost/comparison under qubesos.

@Rudd-O made really clear of performance and stability gains of openzfs vs TLVM and brtfs, where licensing still is an issue, even though Ubuntu dodged the problem altogether a while ago and never got sued. But Fedora won't follow.

My past attempts here were to recap on the current states of improvements of brtfs. This thread stated possible causes of performance degradation of TLVM over brtfs for large volumes, snapshot rotation, trimming etc.

It's currently hard to grasp what is the current state of things and how to get things forward to make it better. But it's unclear what was fixed, what is still problems and things that can't be fixed by tweaking accordingly.

The reason I insist on this is because downstream projects insist on cloning templates and diverging through salt recipes, which TLVM cannot support efficiently since no dedup is possible. Brtfs supports offline dedup, which bees permits to accomplish if deployed so that space consomption is highly reduced, backups can be restored without filling disks and compression reduces operation times, boots read times and offer performance gains that cannot be denied. Openzfs would fit the need even better by compressing and offering online deduplication, resolving space consomption extend life of SSD drives, which otherwise as been claimed again and again that qubesos over other OSes is a drive life shortener. But at the expense of higher memory usage for dom0, needing to keep track of all blocks to be able to live dedup. Bees does it more efficiently, but after the fact, meaning that wear of drives will not be optimized, while drive consumption and speed woukd be improved.

Users wants faster vm boot/shutdown time, proper backup/restore mechanisms (speed, space requirement upon restore not exploding), hardware life and battery efficiency.

Openzfs is known to be a ram hugger for live dedup but results in more stable, more efficient IO, reduced disk consumption, faster snapshot and faster IO. TLVM overhead is known, large disks qubes show that TLVM loses over brtfs, openzfs shines in all aspect but licence. Where to go next and how?

What is the current state of the art with current kernel versions and where/how to get this forward?

From what I get, brtfs proper filesystem creation and runtime options need to be defined properly to be compared correctly against current TLVM ones. Openzfs, if I understand well the situation, won't be part of the installer until Fedora does a move like Ubuntu did, otherwise QubesOS won't take the decision to do what Ubuntu did and take any risks for themselves.

So it leaves us to brtfs. So how to show without any doubt that brtfs can be a better candidate then TLVM (which most other OSes dodged being default installation deployments nowadays)?

If QubesOS stays with Fedora as dom0, which is not planned to change anytime soon unless I missed something, then how to reduce SSD wear, improve large vm shutdown/snapshot rotation time and embrace templates space deduplication from cloning/diverging/backup without switching to brtfs, while openzfs is out of scope?

Tldr: the discussion should resume from @tasket suggestions https://github.com/QubesOS/qubes-issues/issues/6476#issuecomment-1692530385

Once again, if some openqa testing iso images were produced to ask testers to report output on same hardware having only fs changes from OS install, reporting perf diffs could be asked, and provided easily from those wanting to move this important subject forward.

I could spare time reporting and so others having multiple identical machines, isolating changes to the fs changes and tweaks alone.

QubesOS / qubes-issues

Switch default pool from LVM to BTRFS-Reflink #6476

5053

6297

6184

3244 (really a kernel bug)

5826

3230 ― since reflink files are ordinary disk files we could just rename them without needing a copy

3964