DisposableVMs: support for in-RAM execution only (for anti-forensics)

marmarek commented 9 years ago

Reported by joanna on 25 Sep 2014 20:26 UTC Currently volatile.img is being backed up on the fs. See: https://groups.google.com/forum/#!topic/qubes-devel/QwL5PjqPs-4/discussion

Migrated-From: https://wiki.qubes-os.org/ticket/904

marmarek commented 9 years ago

Two options are evaluated for it:

use RAM only (including turning off swap)
encrypt everything written to disk with disposable key

/cc @qubesuser

marmarek commented 9 years ago

Pasting @v6ak comment from linked discussion:

I've some scripts that I use for a temporary swapfile and temporary filesystem. They use a random key (from /dev/random). The tmp filesystem uses some configuration for better performance by disabling some features like journaling. (We don't need journaling for filesystem that are expected to be unreadable after reboot…)

My current usage:

The swapfile is attached automatically added in the background after 120 seconds in order not to block by reading from /dev/random during the system boot.

The largetmp is mounted only when needed, i.e. manually. (Yes, the usage of sudo suggests that…) I usually use tmpfs, but when I need something large, I mount the largetmp. I was thinking about automount of largetmp, but I was unsure about safety and some other potential issues. (Moreover, largetmp lies on rotational HDD, so using it instead of tmpfs could cause much more HDD usage and more power consumption.)

Some security considerations:

It is essential to handle safely situation when largetmp is not mounted. This is the reason why I use /tmp/large and not /large/tmp. If I forget to mount it, the worst thing that can happen is writing large amount of data to RAM. If it was in /large/tmp, accidental writing to a less protected partition (i.e. /large or /, which depends on the setup) may happen if the largetmp1 is not mounted. (Which is what I've once accidentally done. It was followed by several days of continuous wiping…)

The /dev/random is usually seeded from saved random seed. When some wear-levelling or relocation is used, the random seed might be available to local forensics, which could reduce the effective entropy of the key in the considered case. (Fortunatelly, the Qubes login screen requires some keystrokes, which adds some entrophy.)

There are the scripts and crypttab lines: https://gist.github.com/v6ak/3171313bc2c22efc263d

rootkovska commented 9 years ago

FWIW, I really like the idea of encrypting volatile.img with a one-time key and then throw it away after the DispVM shutdown. Can we ensure the dm will never write the key to dom0 fs anywhere? Do we need to disable swap in dom0 for that?

marmarek commented 9 years ago

No need to disable dom0 swap - VM memory is never written to disk (AFAIR it isn't even supported). The tricky part would be to integrate it with our DispVM implementation

based on savefiles, to each DispVM have different key. But that's surely doable.

And it should be trivial for normal AppVMs when #1308 got implemented.

Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

v6ak commented 9 years ago

First, I see two levels of anti-forensic DVMs. Attacker is not able to get any data from the DVM, if:

…the whole computer is shut down and he cannot get any RAM dump, but attacker knows the encryption password. (Supposing that RAM does not persist. AFAIK this is sometimes not kept on older RAM types (see cold boot attacks), but it is practically kept on modern RAMs.)
…the DVM is shut down. (But the instance of Qubes might be still running) Attacker is able to get RAM dump (and thus e.g. extract encryption keys for full-disk-encryption), but this is not useful for extracting information about the DVM that is not running. Attacker also knows the encryption password.

The level 1 protection can be implemented IMHO relatively easily in dom0:

All temporary files (e.g. volatile.img and COWs) are saved to some filesystem encrypted by a disposable key.
Any dom0 swap should be encrypted by a disposable key. (I am not sure about default setup on Qubes.) If this is not done, it might be hard to ensure that no related sensitive data (e.g. encryption keys of the temporary filesystem above) might be swapped on HDD and thus potentially exposed to the attacker. This is likely doable, but much harder.

Both of them this might be also useful for standard VMs and also or performance reasons, as the disposable FS might be configured not to persist data reliably (e.g. data=writeback, disabled metadata journal and so on). But this is rather a nice side effect than primary goal of this issue.

By design, hibernation can't be supported.

The level 2 protection seems to be much harder:

Even having some swap in dom0 is potentially an issue. Remember that encryption currently runs in dom0. Maybe this could be offloaded to some “stubvm” for these cases. (Note that enough entropy in such stubvm has to be ensured in some way.)
When DVM shuts down, the related memory must be wiped immediately. I am not sure if Xen does that. It is likely that it does that lazily (in the same way that Linux kernel without proper GRSec config does) for performance reasons. But doing that lazily implies a potential problem.
When the DVM releases some allocated memory (through Qmemman), it also must be wiped soon enough. (Theoretically, wiping might be delayed after VM shutdown, but not more.)
Maybe a proper DVM shutdown indication has to be implemented.

Rather a side note at this point: I have to correct my statement about /dev/random and relocation. With my today's knowledge, this issue is true only for /dev/urandom, not for /dev/random, at least if standard systemd is used.

marmarek commented 9 years ago

I was thinking about implementing that volatile.img encryption inside of VM, not in dom0. Dom0 would not know that encryption key. In this case we can be sure that encryption key will not land in dom0 swap. Generally this give us rather nice property - every data in DispVM is either in VM RAM (not dom0 RAM), or written to volatile.img, encrypted. This doesn't include some intentional leaks, cover channels etc, but this is offtopic here.

As said earlier - VM memory is never written to dom0 swap (or any other dom0 part). This, I think, fully solve the first case.

As for the second case, we "just" need to ensure that VM memory is wiped when VM is shut down. In practice, I think this is the case, because all the memory released by VM will be redistributed to other VMs. And before VM get such memory, it is scrubbed by Xen. But in theory qmemman leave some small part of memory (50MB) unassigned just for Xen internal use. Theoretically it can happen that Xen will give some part from that pool to the VM, instead of reusing just released memory - we do not control which memory page is assigned where. this pool and

Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

v6ak commented 9 years ago

The DVM can handle encryption of volatile.img (well, actually a block device backed by dom0:volatile.img), but I don't think it can properly handle encryption of COWs without some non-trivial tweaks.

When I run xentop, it seems to usually show several dozens of megabytes free, but I also remember having about 1.5 GiB free when runing few VMs on laptop with 16GiB RAM. I however agree with the statement that the one issue is likely much less serious than I initially thought.

qubesuser commented 9 years ago

For level 1 I think dom0 swap needs to be encrypted with a disposable key because it's possible for both keystrokes, window contents and audio data to be swapped to dom0 swap

It might be possible to completely avoid that by mlocking Xorg, kwin, pulseaudio, qubes-guid, pacat-simple-vchan, etc. but it seems a risky route.

Encrypting dom0 swap with a disposable key is not hard at all: all you need is to add an entry to /etc/crypttab with /dev/urandom as the key and point the swap to the resulting DM device.

For the same reason, I don't think "Level 2 protection" is easy.

You definitely need to at least restart the whole GUI since otherwise there could be window contents, keystrokes and audio data in the heap or unused stacks of all processes.

Fixing that is infeasible since it requires patching gcc to clear all parts of the stack when adjusting it, hacking glibc to scrub memory on free() and unused parts of all thread stacks on request

Even restarting the GUI you'd still need to patch the Linux kernel to scrub kernel memory and GPU memory.

I think it's probably worthwhile to do a "best effort level 2" where VM memory is not leaked, but the UI leaks are not plugged.

In this case, due to dom0 4GB mem max it is not true that qmemman will redistribute all memory, so it probably requires a simple patch to Xen to scrub memory no longer assigned to VMs (or maybe Xen can already do it?).

v6ak commented 9 years ago

The X11 leaks is a good point. And it is hard to get rid of that. I basically agree with the whole @qubesuser's post.

Restarting the GUI is theoretically feasible. When you restart X11, Qubes seems to remember some state and be able to continue working. (It is not perfect, though. For example, minimization states don't seem to be remembered.) With a separate GUI domain, a whole VM reboot could be probably implemented. I am, however, not sure if it is worth the work. It is just idea meaned: When you feel it is worth the work, you might find the idea useful.

Where have you found the 4GB mem max for dom0? My experience with Qubes does not confirm that, I can see higher memory amounts assigned to dom0. However, after stopping some VMs, I can see 1446384k free in xentop, which is much more than ~50MiB.

marmarek commented 9 years ago

On Mon, Oct 12, 2015 at 12:35:20PM -0700, qubesuser wrote:

Encrypting dom0 swap with a disposable key is not hard at all: all you need is to add an entry to /etc/crypttab with /dev/urandom as the key and point the swap to the resulting DM device.

When you do this manually, it's easy. When you need to script this (in installer), things get (slightly) more complicated. But still not that hard.

I think it's probably worthwhile to do a "best effort level 2" where VM memory is not leaked, but the UI leaks are not plugged.

Yes.

In this case, due to dom0 4GB mem max it is not true that qmemman will redistribute all memory, so it probably requires a simple patch to Xen to scrub memory no longer assigned to VMs (or maybe Xen can already do it?).

I'm not sure. By default VM have maxmem as half of physical memory, so if you have at least two running VMs (and haven't altered that default significantly), it shouldn't be needed.

Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

marmarek commented 9 years ago

On Mon, Oct 12, 2015 at 01:33:51PM -0700, Vít Šesták wrote:

Where have you found the 4GB mem max for dom0? My experience with Qubes does not confirm that, I can see higher memory amounts assigned to dom0. However, after stopping some VMs, I can see 1446384k free in xentop, which is much more than ~50MiB.

Done for R3.1: #1313

Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

v6ak commented 9 years ago

Some implementation for anti-forensic DVMs (for volatile.img): https://groups.google.com/forum/#!topic/qubes-users/X0BBZ-kfix0

Drawbacks:

Logs are not covered.
Not much fine-grained. You have to reboot in order to make the old volatile.img unreadable.
Some small part of current implementation (moving location of volatile.img) is a bit hacky at the moment.

Advantages unrelated to this goal:

Some anti-forensics for other VMs (e.g. for swap).
Likely some performance benefits (mainly for HDD) and less wear (mainly for SSD).

Rudd-O commented 8 years ago

From #1819 :

This is a feature request.

User requests DisposableVM via UX interacton.

dom0 script in charge of DisposableVM setup sets up the root file system as a device-mapper device. Then it sets up the swap device and the home directory device in the following manner:

do the exact same thing being done right now to create the block devices
generate cryptographically secure random key, 256 bits of entropy; this program must mlockall() to prevent that data being swapped
luksFormat and luksOpen those devices using the secure random key (which will be held in RAM and will never be swapped to disk); check that the assumption holds that invocation of these programs won't leak to swap
make the filesystems and swap devices atop those block devices

(To be honest, the swap devices of all VMs should be made atop that).

Teardown of devices is the exact opposite -- once the VM is dead, the devices must be luksClosed and then luksWiped.

Presto correcto mundo — unrecoverable devices associated with DisposableVMs, so long as the user does not write to anything other than /home.

This should not be too much of a complication compared to DisposableVM setup today.

andrewdavidwong commented 8 years ago

Just doing a routine check: Is it still correct that @rootkovska is assigned to this issue?

marmarek commented 8 years ago

As you've probably guessed, no.

andrewdavidwong commented 8 years ago

User suggestion regarding per-VM encryption:

https://groups.google.com/d/msgid/qubes-devel/38301f18-0d04-dad2-511f-e5c56f255135%40yahoo.com

jpouellet commented 8 years ago

@marmarek wrote:

I was thinking about implementing that volatile.img encryption inside of VM, not in dom0.

I think it is safe to say that users may have compromised VMs which they would still like to have resist local forensics. Putting the crypto inside makes violating confidentiality trivial for the adversary. Putting it outside makes it harder.

I had the same objection when reading the storage domain section of the arch spec.

Thoughts?

jpouellet commented 8 years ago

If we are trying to make the storage domain (thing touching disks) untrusted, then clearly it can not be allowed to handle keys, but neither should the VM we are trying to protect.

Consider also the case where you have an HVM with an OS that does not have disk encryption that you trust, or where disk unlocking can not be bootstrapped via kernel/initramfs fed from xen because it is not linux. IMO these scenarios still deserve ensured confidentiality as a feature, but can not provide it inside the VM.

To me, this suggests a middle crypto VM as a preferred option.

marmarek commented 8 years ago

Indeed middle crypto VM would solve both cases. Not sure if this should be per-VM crypto VM, or one for all. In any case, I'm worried about performance (yet another chain of xen-blkfront/xen-blkback in the storage path) and even bigger memory footprint (especially in case of per-VM crypto VM). For HVM domains, stubdomain (domain hosting qemu) could be used. At least for systems without PV drivers installed (which is the case for Windows - we exclude block PV driver there, for unrelated reason). This would not solve all the cases (for example running some live-cd Linux system there), but at least some.

In case of DispVM in non-storage-domain world, the problem is - it's hard to handle the key in dom0 and be sure it didn't landed in swap or such. Is relying on crypsetup properly mlock-ing memory good idea?

Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

jpouellet commented 8 years ago

On Mon, Nov 14, 2016 at 4:44 PM, Marek Marczykowski-Górecki wrote:

In case of DispVM in non-storage-domain world, the problem is - it's hard to handle the key in dom0 and be sure it didn't landed in swap or such.

This is also true for other user secrets such as keystrokes and frame/audio buffers. IMO that should be solved generally, and is not a disk-crypto-specific issue.

Is relying on crypsetup properly mlock-ing memory good idea?

Perhaps not, but I would definitely prefer mlocked cryptsetup in dom0 over compromised cryptsetup in domU.

Rudd-O commented 5 years ago

Here is a musing I came up with over the last few hours of insomnia:

Commentary that follows assumes either LVM or file pool (reflink pool or my experimental ZFS pool do not apply):

VM boots. Xen maps its private.img+private-cow.img / volatile.img+volatile-cow.img as /dev/sdb. All good. Writes make it to the COW device in every case.

To prevent live forensics, <device>-cow.img must be backed, not by a file / LVM logical volume, but by a ephemerally-in-RAM-keyed LUKS device (which in turn can then be backed by <device>-cow.img), and then do the snapshot combo of luksOpened COW image atop <device>.img. This would ensure that writes of in-VM data never make it to persistent storage to be later read by forensics tools.

(You can extend this principle to the template.img+template-cow.img pair.)

Then, at VM shutoff, in the case of private.img+private-cow.img device-mapper can merge the COW image onto the base image. This should work fine, except now you don't have any antiforensics because the data actually did "hit the disk" (as private.img is not encrypted).

However, things are different with DispVMs. volatile.img and its companion COW image get nuked at VM shutoff (unlike private.img, they do not get snapshot-merged), and even if they did not get nuked, the data written to the COW device has gone to disk encrypted by LUKS anyway. That puts us 80% close to the goal, because what people want most, is anti-forensics of DispVM uses.

EEEEEXCEPT... except for the key. DispVMs are resumed from suspension. The suspended disk images are created at DVM creation time. In other words: in order to boot the DVM from its "DVM template" -- which may happen days or weeks after creating the DVM -- the LUKS key which is supposed to be ephemeral must, unfortunately, be persisted somewhere... totally defeating the idea of antiforensics to begin with.

The two ways I see out of here:

Make DVMs resume with an entirely empty volatile.img (/dev/sdb in the VM), but Qubes absolutely wants to add some stuff to /rw (mountpoint of /dev/sdb, prepared during DVM savefile creation time).
Make a special case for DVMs' volatile-cow.img be backed by a just-in-time created LUKS device which in turn is what is fed into the libvirt configuration file (as opposed to the regular pair volatile.img+volatile-cow.img.

So, this is not at all an easy circle to square.

jpouellet commented 5 years ago

EEEEEXCEPT... except for the key. DispVMs are resumed from suspension. The suspended disk images are created at DVM creation time.

IIRC this was the case in R3.x, but is no longer the case in R4.0.

brendanhoar commented 5 years ago

From what I understand, and correct me if I am wrong, snapshots of either linear or thin LVs must be in a on the same VG as the origin. Is that correct?

I'm having trouble trying to understand where the opportunity to insert a LUKS layer will be using LVM thin pool for template storage.

marmarek commented 5 years ago

To prevent live forensics, <device>-cow.img must be backed, not by a file / LVM logical volume, but by a ephemerally-in-RAM-keyed LUKS device (which in turn can then be backed by <device>-cow.img), and then do the snapshot combo of luksOpened COW image atop <device>.img. This would ensure that writes of in-VM data never make it to persistent storage to be later read by forensics tools.

Yes, this is a very good approach to this problem. See also https://github.com/QubesOS/qubes-issues/issues/1819#issuecomment-499215410. In short: you can set root volume to read-only and let the VM use volatile volume as CoW backend (which is separate and can be encrypted).

I think it shouldn't be hard to write a "proxy storage pool" driver, that takes another storage pool as a backend and then, when requested a volume (for example for DispVM's volatile volume), it forward the request to backend pool to create a volume and then apply LUKS (or just plain dm-crypt) with ephemeral key. And only such volume then is attached to a DispVM.

There is "a little" problem with private volume. In R4.0, DispVMs have it cloned from DisposableVM Template (like fedora-29-dvm). But in practice, it isn't a copy, it is a snapshot of it. And in LVM thin pool you can't use arbitrary block device as a COW layer. I see two options for it:

do not use LVM thin pool snapshots for such anti-forensic DispVMs and construct standalone dm-snapshot using base private volume (the one from DisposableVM Template) and encrypted COW volume
set private volume as read-only, similar to root volume, and let the VM construct COW layer using volatile volume

Both options add some complexity to storage setup (and also slows down DispVM startup). But the second one move some of this complexity to the VM, reducing potential impact.

marmarek commented 5 years ago

Some clarified version of the above comment.

Goal: have DispVM have no persistent, non-encrypted (with ephemeral key) writable storage volume. This means, each volume needs to be either set read-only, or encrypted with ephemeral key (outside of reach of that VM).

Lets enumerate volumes. Below when I write "encrypted", I mean encrypted with ephemeral key.

root volume

Can be set read-only. Then in-VM initramfs will setup COW using volatile volume. Exactly as documented on https://www.qubes-os.org/doc/template-implementation/

private volume

In the current implementation, needs to be writable and initialized with a content from DisposableVM Template's private volume (which is persistent). This rule out initializing it as an empty independent encrypted volume. Unfortunately, LVM thin snapshots ("clone" operation in lvm thin storage pool) cannot use arbitrary block device (encrypted in this case) as a backend for COW. So, something else is needed here, at least for LVM thin storage pool (but file-reflink probably too). I see two options here:

do not use LVM thin pool snapshots for such anti-forensic DispVMs and construct standalone dm-snapshot using base private volume (the one from DisposableVM Template) and encrypted COW volume
set private volume as read-only, similar to root volume, and let the VM construct COW layer using volatile volume

The first one adds extra dm-snapshot in dom0, the second one do that in VM. In fact I've attempted to do theVM part of the second option already, here.

volatile volume

Needs to be writable, but doesn't need to have any initial data. Can be created as fresh encrypted volume.

modules volume

Read-only volume of linux kernel modules, for the kernel set from dom0.

encrypted storage pool driver

I think it shouldn't be hard to write a "proxy storage pool" driver, that takes another storage pool as a backend and then, when requested a volume (for example for DispVM's volatile volume), it forward the request to backend pool to create a volume and then apply LUKS (or just plain dm-crypt) with ephemeral key. And only such volume then is attached to a DispVM. In fact, very similar driver could be used also for #1293 , "just" a key management needs to be added (instead of using ephemeral random keys).

brendanhoar commented 5 years ago

This document -> https://www.qubes-os.org/doc/secondary-storage/ <- says "Qubes 4.0 is more flexible than earlier versions about placing different VMs on different disks. For example, you can keep templates on one disk and AppVMs on another, without messy symlinks."

First, is this really true? I have my doubts, but if it is correct...

I propose something new called "Ephemeral Disposable VMs."

When one is started, Qubes would automate creation of a very over-provisioned LV in the primary thin pool, slap ephemerally keyed LUKS on top of that LV, set up PV/VG on top and provision another thin pool out of that VG and then copy the entire dvm AppVM to that encrypted pool.

You'd only need to do that once per qubes OS session and then all disposable VMs started from that dvm AppVM would be full ephemerally encrypted.

And then, on shutdown or reboot, you'd remove all references in qubesdb and remove the very overprovisioned LV.

If that statement in the Qubes docs is wrong, then, well, nevermind.

B

marmarek commented 5 years ago

You'd only need to do that once per qubes OS session and then all disposable VMs started from that dvm AppVM would be full ephemerally encrypted.

This means effectively you need to shutdown the system to make prior DispVMs data unreachable. I'm not sure if that's enough to satisfy this task. See some summary in https://github.com/QubesOS/qubes-issues/issues/904#issuecomment-147487804

brendanhoar commented 5 years ago

How about slightly modified to also remove the stack/LV once all children (to dvm appvm) ephemeral disposable VMs are shutdown?

So instead of throwing it out once per Qubes session, it would also be thrown out once per "session" of using that particular dvm appVM as a source for disposable VMs?

If you keep a single child disposable VM open, then VM shutdown throws it out as well as the copy of the dvm appvm.

If you keep multiple children open, then a shutdown of one does not "erase" the activity, but a shutdown of all the sibling VMs does "erase" all the activty.

Whever the single disposable VM or all of the sibling VMs are thrown out, there's an understanding that starting one up again will take a bit longer (due to lvm build up and copying ~100 to 400MB).

I guess that only covers certain uses cases. And..of course that would be difficult for users to understand, and difficult to understand, esp. surrounding behavior, is anathema to security solutions. :/

B

brendanhoar commented 5 years ago

Ok, here's a script that creates ephemerally encrypted disposable VM using the default configuration of Qubes R4.0x on a single drive with a single LVM thin pool.

Utilizing the qvm-clone performance fix to lvm.py suggested in #5134 (adjusting the dd arguments to indicate blocksize), the total startup/shutdown time running an ephemeral disposable whonix is 54s (vs. 21s for a standard disposable whonix). Without the fix, well, it takes about 5 minutes.

Basically, the script creates a thin LV in pool00, layers dm-crypt on top of that, then uses that non-LVM device to build up another LVM stack, registers the new pool with qubes, then copies the VM there and executes the disposable VM based on that VM. After VM exits, the script tears it all down.

Warning: this is a barebones, non-error-checking experiment with poorly chosen variable names and key material sourced blindly from /dev/urandom. Use with caution. I was lucky that I had thought to remove the -f option from lvremove during iterative writing/testing: at one point lvremove asked if I wanted to remove all 114 volumes...(!!!)...oops, used the wrong variable. Fixed it though...I think. Anyway...

Lastly, modify sourceVMname for your use case.

#!/bin/bash

# Set variables
sourcevg=qubes_dom0
sourcepool=pool00
sourceVMname=whonix-ws-14-dvm
targetVMname=${sourceVMname}-ephemeral

lev1name=ephemeral-test # shows up in /dev/qubes_dom0/ (and is part of pool00)
lev1size=512 # in GB # replace with value that is %age of source thin pool size.
lev2name=${lev1name}-luks #shows up in /dev/mapper/ (not seen as part of lvm)
lev3name=${lev1name}-vg
lev4name=${lev1name}-thinpool #shows up in LVM as another pool
let lev4size=${lev1size}-1
lev5name=${lev1name}-LV-ephemeral
let lev5size=${lev4size}-1

# Get options and set variables
# TODO

# functions
function pause(){
    read -p "$*"
}

# Check for existing config that matches invocation and skip setup

# Use naming convention that can be easily rememdied after an unclean shutdown.
# activation/deactivation settings and timing/invocations matter.

# Setup
#   Create overallocated thin LV same size as enclosing pool (temp 512GB)
sudo lvcreate -T ${sourcevg}/${sourcepool} -kn -ay -n ${lev1name} -V ${lev1size}G 
#   Create LUKS *or* plain crypt Volume on top of this thin LV.
#     Add --keyfile-size= if you want less than 8MB from urandom
#     Utilized random key, enable discards, determine correct block size (4K? LV TP cluster size?), option to drop on dismount?
sudo cryptsetup open --allow-discards --type plain --key-file /dev/urandom /dev/${sourcevg}/${lev1name} ${lev2name}
#   Set up PV using entire LUKS or loop device.
sudo pvcreate /dev/mapper/${lev2name}
#   Set up VG using this single PV.
sudo vgcreate ${lev3name} /dev/mapper/${lev2name}
#   Set up thinpool LV using entire VG.
sudo lvcreate --type thin-pool --name ${lev4name} --size ${lev4size}G ${lev3name}
#   Set up LV in thinpool.
sudo lvcreate --type thin --name ${lev5name} --virtualsize ${lev5size}G --thinpool ${lev3name}/${lev4name}

# create qvm-pool object
qvm-pool -a ${lev4name} lvm_thin -o volume_group=${lev3name},thin_pool=${lev4name},revisions_to_keep=3
#   Copy DispVMTemplate AppVM to new thinpool via qubes tools. 
#      Note: Separate thin pool for template and appvm is documented as working in qubes, but launching DVM from AppVM has to be in the same pool.
#  THIS IS VERY SLOW, dd is used without any bs option - modify lvm.py dd invocation to fix.
time qvm-clone -P ${lev4name} ${sourceVMname} ${targetVMname}
#   Start dispVM in new thinpool.
qvm-run --dispvm ${targetVMname} konsole

# insert monitoring/wait here instead of BS
#
pause "VM halted. Press a key to tear down the ephemeral storage volumes & wipe the session."

# Shutdown
#   Ensure VM is no longer running. Use wait loop around qvm-ls ${targetVMname} looking at status and/or parseing output.
# ...
#   ADD -f option to all lv/vg/pv invocations once confident the syntax is correct.
#   Remove DispVMTemplate AppVM from thinpool via qubes tools.
qvm-remove -f ${targetVMname}
#   Remove qubes pool
qvm-pool -r ${lev4name}
#   Remove LV
sudo lvremove -f ${lev3name}/${lev5name}
#   Remove thinpool
sudo lvremove -f ${lev3name}/${lev4name}
#   Remove VG
sudo vgremove ${lev3name}
#   Remove PV
sudo pvremove /dev/mapper/${lev2name}
#   Remove LUKS volume (and loop device is applcable)
sudo cryptsetup close ${lev2name}
#   Remove overallocated thin LV
sudo lvremove -f ${sourcevg}/${lev1name}
# 
# end
#

Addendum:

Depending on threat model, one may consider cloning sys-whonix onto ephemeral storage along with whonix-ws VMs and call it sys-whonix-ephemeral. sys-whonix does see plaintext on the whonix-ws side and that could cause a plaintext leak into sys-whonix storage (e.g. swap). For non-networked ephemeral VMs, that would be unnecessary.

Also, extending that a bit farther (multiple VMs per nested LVM thinpool, but now non-ephemeral) the approach could support groupings of encrypted semi-permanant VMs (e.g. per contract-client VM groupings) all in separately-keyed-and-passworded-per-group pools. For example. this could help ensure discipline when working with client data. Qubes would have to add key management, session handling, etc.

Generally, government and enterprise workstation usage is always FDE, but there are many use cases to keep additional layers of encryption on top of the personal or enterprise-managed FDE.

B

tasket commented 5 years ago

@brendanhoar Interesting! But may I make a suggestion for a non-dd implementation?

IIRC another block driver exists that is related to thin lvm called dm-thin. You can think of it as thin provisioning snapshots without a strict need for lvm. I thought this could be used to create a COW snapshot on un-related devices (i.e. ro base image in qubes_dom0, COW deltas in wherever you setup the dm-crypt block device):

https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt

External snapshots
------------------

You can use an external _read only_ device as an origin for a
thinly-provisioned volume.  Any read to an unprovisioned area of the
thin device will be passed through to the origin.  Writes trigger
the allocation of new blocks as usual.

If it works as that suggests, there would be no need for a temporary lvm stack and no need to copy a whole volume before using it.

brendanhoar commented 5 years ago

Hopefully that'll work, assuming the scope of External is global for dm-thin, which it isn't for LVM thin. That restriction on LVM thin makes sense from a...volume management...perspective, of course. :)

Anyway, as I mentioned above, this was more of a proof of concept. The (brittle*) script above simply uses the Qubes and LVM tools as-is, and is a temporary workaround until such time that Qubes supports encrypted disposable VMs or groupings of non-disposable encrypted VMs**.

I'll leave pursuit of doing it the right way to the developers, as a non-dd approach involves modification of the qubes VM management stack...

...and I don't know python. :)

B * in the sense that quite a bit of cleanup would be necessary if the tear down happened at the wrong time or didn't occur due to an unclean shutdown. ** "designing" a different, but also brittle, shell script for those.

brendanhoar commented 5 years ago

...and of course I just ran into two problems with my script.

It appears that when cloning a VM from Pool A (default pool) to Pool B (custom pool), a copy of the private LV is made in Pool B. However, there's a subtle difference in how the volatile volume is handled for the regular VM vs. the disposable VM...and there's also an issue regarding the not-copied root-snap (which you can guess already).

The volatile volume: a. If the VM that was started from Pool B was invoked as a regular VM, the volatile volume is created on Pool B during the time the VM is up and running. b. However, if the VM that was started from Pool B was invoked as a disposable VM, a temporary volatile volume on Pool A is created for the VM. c. Should we consider case "b" a bug in Qubes and report it, or is this expected behavior?
The root-snap volume: I noticed that the root-snap for any Pool B VM (regular or disposable) is created in Pool A. That makes sense, as that is where the Template is...but it really puts a ding in my approach, as the additional time required to clone the VM's template to Pool B is substantial (1-2 minutes).

UPDATE - cloning the template to Pool B and linking the cloned VMs in Pool B to that template addressed item number 2 above. Unfortunately it did not address issue 1b, the volatile volumes for disposable VMs launched from Pool B are still created in Pool A!

UPDATE2 - Pretty sure the cause of issue 1b is here: https://github.com/QubesOS/qubes-core-admin/blob/master/qubes/storage/__init__.py in _init_volume round about line 398 which says "# if pool still unknown, load default". I think that the routine is looking at the VM config and AppVMs or dispvm 'template's have an entry for the volatile volume in the config, but the just-in-time created disposable VM does not have an entry for the volatile volume in the config. Whether the code here need to be adjusted to special case volatile (or do something else entirely)...or whether the invocation of the disposable VM earlier in the process should push the value into the config earlier so that the unknown pool path isn't taken, I'm not sure.

But I'm not super comfortable with Qubes falling back to the "default" pool under these kinds of circumstances.

Let me know if I should open a defect for the issue where diposable VMs always use the default pool for their volatile volumes, no matter what pool they are started from.

UPDATE3 temporarily modifying default_pool_volatile via qubes-prefs, with the other adjustments above and then restoring it after VMs are up ensures all volumes are on Pool B.

B

Rudd-O commented 5 years ago

Strongly recommend reading the storage drivers from Qubes OS. In dom0:

find /usr/lib* | grep qubes | grep storage | grep '.py$'

As marmarek pointed out, a wrapper driver that wraps an existing one will do nicely and will be compatible with all storage techs in Qubes OS.

3hhh commented 4 years ago

As marmarek pointed out, a wrapper driver that wraps an existing one will do nicely and will be compatible with all storage techs in Qubes OS.

That's just what you can now find in https://github.com/QubesOS/qubes-core-admin/pull/354. Storage techs in the Linux kernel are designed in layers for good reasons...

Anyway it should work for the aforementioned use cases (you can find an example for the file pool driver with key in RAM as test scenario). If not, feedback could be valuable.

If anyone can spare some time to write unit tests for it, it might even go upstream. Otherwise I consider it completed unless someone finds bugs or bugs are revealed by the unit tests. If you disagree, feel free to fork.

dylangerdaly commented 4 years ago

What's the latest with this? Has it been merged into 4.1?

When I start a dispVM, it's still just using the default LVM pool, so it's still hitting my disk.

andrewdavidwong commented 4 years ago

What's the latest with this? Has it been merged into 4.1?

Looks like https://github.com/QubesOS/qubes-core-admin/pull/354 has not been merged yet.

mfc commented 3 years ago

@andrewdavidwong @marmarek since https://github.com/QubesOS/qubes-core-agent-linux/pull/258 is merged, can this issue be closed?

andrewdavidwong commented 3 years ago

@andrewdavidwong @marmarek since https://github.com/QubesOS/qubes-core-agent-linux/pull/258 is merged, can this issue be closed?

Based on the description, that doesn't sound like it's sufficient to close this issue, but I'll leave it to Marek to determine.

marmarek commented 3 years ago

That is a different feature, only slightly related.

brendanhoar commented 3 years ago

I like the direction these commits are going. 😃

brendanhoar commented 2 years ago

Background: I came to this simple proposal after some local LVM & device-mapper experiments.

After some consternation, I realized (and/or finally understood @marmarek's comments from a while ago...) that you can safely layer raw device mapper on top of LVM to work around some (data safety!) "limitations" in LVM.

So, after @DemiMarie's recent work on the volatile volumes, there are only a couple more changes needed to support fully ephemeral disposable VMs, which is: a similar strategy for root and a similar strategy for private.

Ephemerally encrypt root volume changes for disposable VMs (LVM version):

Create a snapshot of the source root volume (as Qubes does now) named vm-disp1234-root-snap. This is where reads of unchanged data will be sourced.
Get LV size and store in $size
Create another empty LVM volume of the same $size with appropriate name, e.g. vm-disp1234-root-snap-encrypted-writes. This is the backing store for encrypted reads/writes of changed data.
Perform cryptsetup plain with /dev/urandom as key layered on top of this new vm-disp1234-root-snap-encrypted-writes, which is called /dev/loopNN. This is the backing store for the snapshot which will provide for plaintext reads/writes of changed data. Using "plain" helps ensure one-to-one block number mappings as well as allowing on size for all volumes in the stack.
Write 32KB of zeros to header of /dev/loopNN before using it as the snapshot store.
Finally, create the non-LVM snapshot volume to pass to the VM: printf '%s\n' "0 $size snapshot vm-disp1234-root-snap /dev/loopNN N 128" | sudo dmsetup create "vm-disp1234-root-snap-cleartext" # device will show up in /dev/mapper
Pass /dev/mapper/vm-disp1234-root-snap-cleartext to VM as root volume.
The 'N' passed to dmsetup tells device mapper this is a non-permanent snapshot. Which means: a) snapshot metadata is in-memory only; b) a one-to-one block # relationships up and down the chain is kept (since there's no metadata header like with 'Y'); c) all volumes along the stack can be exactly the same size; d) there will be no attempt to recreate this device on the next boot.
The '128' is the number of 512 byte blocks per allocation, which is 64KB, as that matches my local thinpool chunk size. Probably should be computed from pool chunk size for LVM for best performance.
After shutdown of the VM, the storage stack would perform a reversal of above steps, disposing of the dm-snapshot device, closing cryptsetup, removing the two LVs.
As per usual with disposable VMs: dom0 boot time volume cleanup is likely necessary to handle unclean VM or dom0 shutdown.

Ephemerally encrypt private volume changes for disposable VMs (LVM version): UPDATE: removed content here because I realized that private volumes for disposable VMs generally aren't empty and should therefore follow the same pattern as root volumes above, sourced from the disposable template (and using different volume/device names of course).

Same as above, but change root to private in all device names.

Template-based encrypted AppVMs: Once something similar to this is implemented for disposables, you can extend to fully encrypted AppVMs, with an additional strategy for retaining the changes to the private volume of AppVMs. The AppVM private volume would not need a raw device mapper snapshot but would still require a cryptsetup layer, either plain (with password/key management handled by qubes) or luks (with minor password handling by qubes) and that device would be passed to the VM. Note that cryptsetup plain has the benefit of one-to-one block mapping and avoiding size adjustments for encrypted vs. non-encrypted volumes.

...and so on.

B

marmarek commented 2 years ago

Ephemerally encrypt root volume changes for disposable VMs (LVM version):

There is an easier way. Switch root volume to read-only - then the VM itself will construct appropriate dm-snapshot device, using (already encrypted) volatile volume for writes. You can do this with qvm-volume config <vmname>:root rw 0. A little problem with that is you need to do it before starting the VM. That's tricky for dynamic disposables, but trivial for static (aka named) ones.

I have some half-backed solution to do the same with private volume, but it requires more work, possibly even another approach.

brendanhoar commented 2 years ago

Hi @marmarek

[link is broken in your above message]

While I am not a fan of domU-controlled ephemeral encryption (in the qubes context, as I worry about compromised domU behavior), it appears that a read-only LV in dom0 prompting the domU to create an internal dm-snapshot does appear to be a safe approach, assuming all other writable volumes are encrypted.

However.

That approach does not handle the general case, e.g. that would fail for a Windows disposable domU, unless a snapshot driver or config were added to QWT, which seems unlikely in the near term. Similar for *BSD. That solution requires built in cooperation from the template.

I do think my proposal is also rather easy. Caveat: but I'm not the one implementing it, of course.

IMO, leaving all of the work to dom0 seems cleaner. It does mean a couple more thin/sparse devices/files in existence during runtime, dom0 creating and tracking them, and a little more post-domU-shutdown or startup cleanup. But it does not require the domU templates to cooperate.

The realization that cryptsetup plain, dmsetup snapshot, or both can be very simply layered on top of the existing/temporary LVs in the main pool, to reach the end result of ephemeral disposable VMs and/or fully encrypted AppVMs/StandaloneVMs was an aha moment earlier this week. The solution is regardless of pool type (thin LV, reflink, etc.).

Of course there could be a critical flaw that I haven't thought of yet.

B

marmarek commented 2 years ago

That approach does not handle the general case, e.g. that would fail for a Windows disposable domU, unless a snapshot driver or config were added to QWT, which seems unlikely in the near term. Similar for *BSD. That solution requires built in cooperation from the template.

That is true. While for some cases, it could be handled inside stubdomain, it doesn't cover all the cases, and is probably the same amount of work as doing it in dom0.

The reason I propose doing it this way, is to avoid even more complexity in dom0. Even with the current setup, the amount of corner cases we've run into while developing it is rather high (all the cases of something failing on cleanup, choosing device names to avoid clashes, correct order of setup and tear down etc). In case of in-VM device, we don't have most of those issues, because there is just one instance inside, and when the VM is off, the impact of something not cleaned up inside the VM is rather minimal (if not completely none). From dom0, we'd just ensure that all the devices are either:

read-only
ephemeral (via encryption)

PS fixed link

marmarek commented 2 years ago

To clarify: the main difference is the impact of a bug in this even more complex setup. In case of in-VM setup, the VM may fail to start, or you may loose some data you write (which isn't really an issue for DispVM). In case of dom0 setup, the impact may include writing plain-text to the wrong place, corrupting state of another VM or worse.

brendanhoar commented 2 years ago

@marmarek -

Under this arrangement, how would the approach be extended to the private volume for disposable VMs? That volume's content is based on the contents of the disposable VM private volume, which is not empty.

If the private snapshot that is passed to the disposable VM is set to readonly in dom0 and a boot-time overlay is also used for the contents of private then, again, a partition on the volatile volume would be the write target? Yes?

I would think, then, that the size of the thin/sparse volatile volume created before starting each disposable would need to be set based on a) an expected maximum swap partition size plus b) the total current size of private plus c) total current size for dom0?

B

anywaydense commented 2 years ago

Here is a possible solution for PVH DispVM https://github.com/anywaydense/QubesEphemerize

UndeadDevel commented 11 months ago

The currently most advanced implementation of this seems to be here.

Rudd-O commented 9 months ago

If the storage layer provides for it, I could add support for encrypted ZFS volumes such that disposable VMs can start with encrypted storage with a throwaway key.

rustybird commented 9 months ago

@Rudd-O

The storage layer already has generic support for encrypting a volume with an ephemeral key. In the context of this issue, the problem is that this not implemented for snap_on_start volumes like the 'root' and especially the 'private' volume of a DisposableVM. (For the 'root' volume the existing workaround is to set it as read-only, rely on in-VM support to redirect writes to it to the 'volatile' volume, and make the non-snap_on_start 'volatile' volume ephemeral instead of 'root'. But there is currently no such in-VM support for the 'private' volume.)

A storage driver can override this generic implementation of ephemeral encryption. Theoretically it could support even snap_on_start volumes. However that would require OpenZFS to be able to clone a zvol across an encryption boundary - i.e. can it clone from zvol A to zvol B where B is encrypted with an ephemeral key, but A is unencrypted (or encrypted with a different key)?

QubesOS / qubes-issues