Direct dm-thin storage driver

DemiMarie commented 3 years ago

The problem you're addressing (if any) In Qubes OS, snapshot creation is an incredibly latency-sensitive operations. Users waiting for VM startup is not uncommon, and VM startup requires three snapshots to be created.

Unfortunately, LVM is not optimized for snapshot creation speed. LVM is effectively stateless and must always revalidate the current state against the metadata. But we have absolutely no need for this. We do not need to synchronize with udev, and we know exactly what the state of the system should be at any given time. In short, LVM imposes a substantial amount of overhead. Each LVM command consumes somewhere between 0.2 and 1 second per invocation, and there are three invocations per VM start and stop.

Describe the solution you'd like

We can substantially reduce overhead by bypassing LVM and using the underlying device-mapper ioctls directly. This should save several tenths of a second per VM launch, and another several tenths of a second per VM shutdown.

Where is the value to a user, and who might that user be?

Users will benefit from improved start time.

Describe alternatives you've considered

We could try to submit patches to improve LVM startup time, but according to the aforementioned mailing list post, they probably will not be accepted upstream.

Additional context

We can still use LVM to manage the storage on which our thin pools reside. LVM’s performance hit is much more manageable there, as it is not on the critical path for VM startup.

Relevant documentation you've consulted

Related, non-duplicate issues

https://lore.kernel.org/linux-lvm/a4708de8-93f9-e663-3184-099a8e40ac3c@redhat.com

DemiMarie commented 3 years ago

Further notes:

We can store the qube:volume → volume number mapping directly in qubes.xml if performance of this approach is acceptable. However, we should probably avoid doing so (see below).
Losing track of which volume is which is a catastrophic failure with obvious security consequences. No matter which approach we choose, we MUST ensure that this does not happen, no matter what.
We should avoid duplicating work being done by Stratis.

brendanhoar commented 3 years ago

Probably a non-starter but...one could approach this the same way that Microsoft approached this with Windows (partially booting up after shutdown and hibernating). Ok not exactly, but similar: use the shutdown or post-shutdown phase of the VM lifecycle to create the snapshots. One could argue it's an illusory performance enhancement but it does remove a user-facing latency...most of the time.

If the vm is based on a template and the template is modified while the vm is shutdown then Qubes could invalidate and remove the snapshots, taking the longer path to vm startup.

B

unman commented 3 years ago

On Tue, May 25, 2021 at 06:16:43PM -0700, Brendan Hoar wrote:

Probably a non-starter but...one could approach this the same way that Microsoft approached this with Windows (partially booting up after shutdown and hibernating). Ok not exactly, but similar: use the shutdown post-shutdown phase of VM lifecycle to create the snapshots. One could argue it's an illusory performance enhancement but it does remove a user-facing latency...most of the time.

If the vm is based on a template and the template is modified while the vm is shutdown then Qubes could invalidate and remove the snapshots, taking the longer path to vm startup.

B

I dont think you should underestimate the power of illusion. I think it was Apple who first used this in showing a picture of the desktop before it ws usable - people were convinced that Apple booted faster.

andrewdavidwong commented 3 years ago

Reminds me of this:

A classic story illustrates very well the potential cost of placing a problem in a disciplinary box. It involves a multistoried office building in New York. Occupants began complaining about the poor elevator service provided in the building. Waiting times for elevators at peak hours, they said, were excessively long. Several of the tenants threatened to break their leases and move out of the building because of this…

Management authorized a study to determine what would be the best solution. The study revealed that because of the age of the building no engineering solution could be justified economically. The engineers said that management would just have to live with the problem permanently.

The desperate manager called a meeting of his staff, which included a young recently hired graduate in personnel psychology…The young man had not focused on elevator performance but on the fact that people complained about waiting only a few minutes. Why, he asked himself, were they complaining about waiting for only a very short time? He concluded that the complaints were a consequence of boredom. Therefore, he took the problem to be one of giving those waiting something to occupy their time pleasantly. He suggested installing mirrors in the elevator boarding areas so that those waiting could look at each other or themselves without appearing to do so. The manager took up his suggestion. The installation of mirrors was made quickly and at a relatively low cost. The complaints about waiting stopped.

Today, mirrors in elevator lobbies and even on elevators in tall buildings are commonplace.

One lesson for us: It's not just about how long users have to wait; it's about their subjective experience while they wait. This is presumably why a wait with a progress indicator seems more tolerable than an equally long wait without one, even if the accuracy of the indicator is approximate at best.

DemiMarie commented 2 years ago

The obvious (to me!) way to manage the metadata is as an SQLite database on the root file system. Not sure what @marmarek will think about that, though.

brendanhoar commented 2 years ago

Hmm, I believe using dm snapshot directly with LVs in thin pools will allow avoiding the LVM restriction that the snapshot origin must be stored in the same LVM VG as the target snapshot.

If so, this gives greater flexibility for VM-specific ephemeral and VM-specific permanent encryption.

Revisions management might be more difficult though.

B

DemiMarie commented 2 years ago

Hmm, I believe using dm snapshot directly with LVs in thin pools will allow avoiding the LVM restriction that the snapshot origin must be stored in the same LVM VG as the target snapshot.

It does, though one could use an external snapshot as well.

If so, this gives greater flexibility for VM-specific ephemeral and VM-specific permanent encryption.

Indeed it does, since data is written to a controlled location.

Revisions management might be more difficult though.

Yeah :(

brendanhoar commented 2 years ago

Hmm, I believe using dm snapshot directly with LVs in thin pools will allow avoiding the LVM restriction that the snapshot origin must be stored in the same LVM VG as the target snapshot.

It does, though one could use an external snapshot as well.

If you're referencing LVM external snapshots then...Nope.

Everything i have read states that "External origin" LVM snapshots must have a source located on the same VG as the target snapshot. Thin or thick but in the same VG. Effectively, this prevents inserting a crypt layer.

e.g. with an origin in either 1) "VG01 thick-LV" or 2) "VG01 thin-LV in thin-pool1" to 3) "VG01 thin-LV in thin-pool2" it'll work.

Change source or target to VG02, however, then it will not work. If you try to use a non-LVM source, it will not work.

Brendan

DemiMarie commented 2 years ago

Hmm, I believe using dm snapshot directly with LVs in thin pools will allow avoiding the LVM restriction that the snapshot origin must be stored in the same LVM VG as the target snapshot.

It does, though one could use an external snapshot as well.

If you're referencing LVM external snapshots then...Nope.

Everything i have read states that "External origin" LVM snapshots must have a source located on the same VG as the target snapshot. Thin or thick but in the same VG. Effectively, this prevents inserting a crypt layer.

e.g. with an origin in either 1) "VG01 thick-LV" or 2) "VG01 thin-LV in thin-pool1" to 3) "VG01 thin-LV in thin-pool2" it'll work.

Change source or target to VG02, however, then it will not work. If you try to use a non-LVM source, it will not work.

Brendan

Ah, good point. You can do this with dm-snapshot, or with a dedicated (not managed by lvm2) dm-thin pool

brendanhoar commented 2 years ago

Ah, good point. You can do this with dm-snapshot, or with a dedicated (not managed by lvm2) dm-thin pool

Agreed.

I think I can hear @marmarek groaning, of course.

B

DemiMarie commented 2 years ago

Ah, good point. You can do this with dm-snapshot, or with a dedicated (not managed by lvm2) dm-thin pool

Agreed.

I think I can hear @marmarek groaning, of course.

B

Yeah, the more I think of it, the more tempting reflink+XFS is. Unless there is a benchmark that shows dm-thin being faster than reflink+XFS, I’m inclined to recommend the latter.

tasket commented 1 year ago

My experience with thin snapshot management within wyng-backup is that the multi-step process of snapshot rotation is slow because these two operations are slow:

lvremove lvrename

But lvcreate occurs without apparent penalty.

I made some LVM operations asynchronous in Wyng to avoid the delays; each volume group object has a process queue which is a simple list of unfinished LVM ops. Perhaps Qubes could handle that in a similar fashion? The snapshot naming and re-naming could be done in such a way that interrupted rotation would be easily detected and handled on the next VM startup (I believe Qubes already does this), so you only need to check that the LVM queue for that VM is empty before starting new write operations: start, delete, rename. This lets you quickly start/stop different VMs concurrently, and delays would only be experienced with rapid successive operations on single VMs.

From an engineering standpoint, this requires relatively few lines of code in only one or two Qubes components and yet is about as sturdy as one could expect (e.g. the resulting state from system crashes look about the same as they do now).

QubesOS / qubes-issues

Direct dm-thin storage driver #6607