wishlist: Encrypted blocks and HMAC-driven deduplication domains to provide volume snapshots.

dm-vdo / kvdo

A kernel module which provide a pool of deduplicated and/or compressed block storage.

GNU General Public License v2.0

242 stars 46 forks source link

wishlist: Encrypted blocks and HMAC-driven deduplication domains to provide volume snapshots. #45

Open ewheelerinc opened 2 years ago

ewheelerinc commented 2 years ago

It has been my goal to have a dm-target (or stack of targets) that provides the following features, and it seems that VDO is close. VDO already supports deduplication, compression, and thin provisioning: It would be nice to see per-volume encryption and snapshots as well because that could provide encrypted thin volumes with snapshots that deduplicate and compress. Unfortunately compression, de-duplication, and encryption are at odds with each other:

The problem with encrypted de-duplication of, lets say, a dm-thinpool backing device (tdata) is that even if the same encryption key is used across logical volumes, the data offset seeds the encryption so the on-disk representation is different and cannot be deduplicated.
Relatedly, compression must happen before encryption because encrypted data has too much entropy to compress.

To implement encrypted thin volumes (with different keys per volume) that are compressed throughout the lifecycle of their snapshot history the possible existing topologies and their shortcomings as follows:

SCSI => dm-crypt -> dm-vdo -> dm-thin
- Compresses and deduplicates, but has a shared dm-crypt key at the bottom and we want per-volume encryption to deactivate volumes while still getting deduplication against other volumes with the same key.
SCSI => dm-vdo -> dm-thin -> dm-crypt
- Provides thin provisioning and encryption but the value of VDO is nullified by the encryption above dm-thin
SCSI => dm-thin -> encryption -> dm-vdo
- This is nearly the best option because it supports thin provisioning snapshots with per-volume encryption, deduplication and compression. Unfortunately deduplicated content from thin-snapshot divergence over time is lost because VDO is at the bottom. That is, if you:
  1. Write A
  2. Snapshot
  3. Delete A
  4. Snapshot
  5. Write A
- Thus you will have two copies of A (when "A" is identical content) because VDO never sees the first and last A as being the same (indeed, no single instance of VDO is "seeing" them).
SCSI => dm-crypt+dm-vdo -> dm-thin
- If encryption+VDO were implemented above LVM's tdata volume such that each dm-thin pool was compressed and deduplicated by VDO, but LVM could still mange the per-customer pool, then you could have per-customer encryption with per-customer deduplication, but there is an issue that creates a disk usage inefficiency: dm-thin pools are allocated statically; multiple dm-thin pools themselves are not thin, they are staticaly allocated.

One way to solve this would be to implement each feature in a target that has snapshotting and encryption as additional features, maybe within VDO. This might be added as follows:

Incorporate thin provisioning using a btree or similar structure:
- Similar to dm-thin, snapshots generate a snapshot ID
- The snapshot ID references a logical-block to deduplicated-block mapping.
- VDO target activation selects some snapshot ID from the same VDO pool
Add encryption around compressed+hashed blocks:
- Store the hash as an HMAC (or in the clear for higher performance/lower security)
- Encrypt the data: use the hash as the CBC IV since it is never duplicated (or use some other means of extending a block cipher). Things like ESSIV are not needed because the data is de-duplicated so the mapping index may place the de-duplicated block in multiple volume locations anyway.
All snapshots sharing the same HMAC deduplicate to the same deduplicated-blocks, thus all thin provisioned VDO volumes share the compressed+encrypted block.
Theoretically two different thin volumes could have the same HMAC for deduplication, but if they have different encryption keys. However, both keys need to be active or one volume might fail to decrypt a block that was encrypted with another volume's key. Practically speaking, each HMAC needs to be matched with its own cipher key.
Different HMACs provide different deduplication domains: One customer might have multiple volumes with one shared key and they would all deduplicate with themselves, whereas, another customer would have another HMAC and customer#1's data would be cryptographically isolated from customer#2's data.
It sounds like the previous list item is equivalent to "SCSI => dm-crypt -> dm-vdo -> dm-thin" above, but the additional benefit here is that the encrypted+deduplicated block-store is a shared resource. Thus we get isolation between organizations with deduplication domains and still get to share the same pool space without static pool allocation, and get all the advantages of deduplication with compression.
Additional caching layers (ie, dm-cache, dm-writeboost, bcache) can be placed above dm-vdo to hide the processing latency at lower levels if so desired.

Anyway, this has been on my mind for literally years and I'm just now writing it up. It could be a neat project if someone is interested in extending VDO to support this.

-Eric

raeburn commented 2 years ago

It’s not clear to me what you’re trying to accomplish by using a single unified VDO storage device but separate encryption keys per volume. I expect this is probably for a storage system serving up storage or virtual machines for customers, but if you’re managing the encryption and the customers never get the keys, where does the requirement for different keys per volume (per customer?) come from? Some external contractual or legal requirement?

I’m also not sure I’m following your diagrams correctly for the various example stacks. #1 mentions dm-crypt (on the left) as being at the bottom; #2 mentions encryption (on the right) being above dm-thin (in the middle); #3 mentions vdo (on the right) being at the bottom; etc.

The VDO module itself currently has no concept of volumes or volume management; it has one storage device that it sits on top of, and provides one storage device as its interface. To do otherwise would require VDO to be aware of, at the very least, different regions and some sort of associated key (cryptographic or otherwise), and at the other extreme, a full-blown LVM and dm-crypt implementation integrated into VDO itself (or vice versa). And VDO is already complex enough by itself.

But perhaps the modular approach can still work. Like I said, I’m kind of fuzzy on exactly what your requirements are, but if you really require different encryption keys, perhaps it could work to define multiple VDO devices, each atop a different dm-crypt device. Either assign them their storage up front, or create them from a common LVM volume group with some moderate size and add space to each if it starts to get full. (Be aware that VDO doesn’t support shrinking.) That would increase the storage and memory requirements for the UDS deduplication indexes, unfortunately, as they’d be maintained independently for each volume, but it also means one busy client won’t impact another’s deduplication by writing too much data.

Since VDO is managed via LVM in the latest versions, and LVM doesn’t manage storage encryption directly, it would take a little work to assemble the stack, probably something like this (untested):

one or more disks/paritions/RAIDs/etc (lowest level) format as LVM physical volumes LVM volume group LVM logical volumes for each logical volume: dm-crypt with unique key configure as an LVM physical volume and its own volume group create an LVM VDO pool and logical volume if you need to subdivide, another layer of PV/VG/LVs export via SCSI or whatever to clients (top)

Would something like this work for you?

raeburn commented 2 years ago

There’s apparently been some discussion in the Stratis project about letting it manage complex storage stacks that export block devices; currently it only presents file systems. If they extend it in that direction it might make managing a stack like I described a bit easier.

https://github.com/stratis-storage/project/issues/161

ewheelerinc commented 2 years ago

It’s not clear to me what you’re trying to accomplish by using a single unified VDO storage device but separate encryption keys per volume. I expect this is probably for a storage system serving up storage or virtual machines for customers, but if you’re managing the encryption and the customers never get the keys, where does the requirement for different keys per volume (per customer?) come from? Some external contractual or legal requirement?

It allows replication of devices from a dedicated customer's site to a shared multi-tennant site without re-keying and without transporting the plain-text data through a separate encrypted channel. Keying per customer creates an opaque view of the data enabling it to be shipped without concern for transport or re-encapsulation. Right now we do this per thin LV, but that presents no option for deduplication or compression without doing something like #1, but then it won't deduplicate across volumes.

In addition, per-volume crypto is precautionary to prevent re-used blocks from one user from being exposed to another user. dm-thin can accomplish this by zeroing newly-allocated blocks, but it is much slower to zero them than to encrypt the volumes with different keys. If a block is re-used for another customer then the decrypt with a different key presents no usable content. (There is a dm-devel list discussion about pre-zeroing blocks so they are quickly available for use without blocking the user, but I don't think anyone is working on that.)

I’m also not sure I’m following your diagrams correctly for the various example stacks. #1 mentions dm-crypt (on the left) as being at the bottom;

Well, its above SCSI, but it is at the "bottom" of the dm stack, so maybe I could have been more precise.

2 mentions encryption (on the right) being above dm-thin (in the middle);

Yes, dm-crypt is atop of the dm-thin volume, thus VDO at the "bottom" cannot deduplicate because the volume data is encrypted.

3 mentions vdo (on the right) being at the bottom; etc.

Ok, I think I understand the confusion: I called left-most "bottom" and right most "top" or "above". Sorry for the confusion.

The VDO module itself currently has no concept of volumes or volume management; it has one storage device that it sits on top of, and provides one storage device as its interface. To do otherwise would require VDO to be aware of, at the very least, different regions and some sort of associated key (cryptographic or otherwise), and at the other extreme, a full-blown LVM and dm-crypt implementation integrated into VDO itself (or vice versa). And VDO is already complex enough by itself.

True, it VDO is quite the project.

How do deduplicated allocations map to logical volumes on the target that is exposed to the VDO block user?

What would be involved in having multiple maps?

I'm guessing adding encryption would be simple enough: add some dm target options and wrap the deduplicated "block" in a crypto mechanism that doesn't change the block size. (Of course this is only useful if VDO could multiple logical volume maps, otherwise one may as well place dm-crypt under dm-vdo). What do you think? You might even be able to hook existing dm-crypt calls for your blocks, not sure about that though.

By the way, so as to not confuse the meaning of "block" as in "sector", what do you call the chunk of data that is deduplicated and stored+compressed under the hood with VDO?

But perhaps the modular approach can still work. Like I said, I’m kind of fuzzy on exactly what your requirements are, but if you really require different encryption keys, perhaps it could work to define multiple VDO devices, each atop a different dm-crypt device.

That might be the best option, which is #4 from the original post: "If encryption+VDO were implemented above LVM's tdata volume such that each dm-thin pool was compressed and deduplicated by VDO, but LVM could still mange the per-customer pool, then you could have per-customer encryption with per-customer deduplication, but there is an issue that creates a disk usage inefficiency: dm-thin pools are allocated statically; multiple dm-thin pools themselves are not thin, they are staticaly allocated."

The only way to get thin allocation below VDO would be to have a volume manager within VDO that supports encryption (or stack the VDO volumes atop of thin, but dm-crypt->dm-thin->dm-vdo->dm-thin would be top heavy, slow, and maybe fragile).

Either assign them their storage up front, or create them from a common LVM volume group with some moderate size and add space to each if it starts to get full. (Be aware that VDO doesn’t support shrinking.) That would increase the storage and memory requirements for the UDS deduplication indexes, unfortunately, as they’d be maintained independently for each volume, but it also means one busy client won’t impact another’s deduplication by writing too much data.

Yep, was thinking the same thing. Maybe automate it to 80% full, but as you say, shrinking is not an option so that presents a possible issue for automation if they grow like crazy but didn't mean to. *()**

Since VDO is managed via LVM in the latest versions, and LVM doesn’t manage storage encryption directly, it would take a little work to assemble the stack, probably something like this (untested):

one or more disks/paritions/RAIDs/etc (lowest level) format as LVM physical volumes LVM volume group LVM logical volumes for each logical volume: dm-crypt with unique key configure as an LVM physical volume and its own volume group create an LVM VDO pool and logical volume if you need to subdivide, another layer of PV/VG/LVs export via SCSI or whatever to clients (top)

Would something like this work for you?

I think so. The only issue, then, is noted above with the *()**.

raeburn commented 2 years ago

It’s not clear to me what you’re trying to accomplish by using a single unified VDO storage device but separate encryption keys per volume. I expect this is probably for a storage system serving up storage or virtual machines for customers, but if you’re managing the encryption and the customers never get the keys, where does the requirement for different keys per volume (per customer?) come from? Some external contractual or legal requirement?

It allows replication of devices from a dedicated customer's site to a shared multi-tennant site without re-keying and without transporting the plain-text data through a separate encrypted channel. Keying per customer creates an opaque view of the data enabling it to be shipped without concern for transport or re-encapsulation. Right now we do this per thin LV, but that presents no option for deduplication or compression without doing something like #1, but then it won't deduplicate across volumes.

This sounds like it assumes direct access to the encrypted version of the storage (and perhaps snapshots?), as a discrete device. Which, if you’re bundling encryption together with the volume management and deduplication, might not be a given.

In addition, per-volume crypto is precautionary to prevent re-used blocks from one user from being exposed to another user. dm-thin can accomplish this by zeroing newly-allocated blocks, but it is much slower to zero them than to encrypt the volumes with different keys. If a block is re-used for another customer then the decrypt with a different key presents no usable content. (There is a dm-devel list discussion about pre-zeroing blocks so they are quickly available for use without blocking the user, but I don't think anyone is working on that.)

If you don’t expose any information about deduplication stats, what’s exposed? (VDO does confirm that the blocks match before unifying the mappings, it doesn’t just go by hashes.) If customer #1 stores a block “abc” and customer #2 stores a block “abc”, and VDO maps them to the same location, what information does either customer get? As far as I can see, there’s (1) possibly some subtle timing differences, though we acknowledge blocks before writing to disk (think “RAM cache”) so you’d need an fsync on each block (or opening O_SYNC) to observe it, and (2) success or failure of a write could differ if the VDO backing storage is full, no one else is using/freeing space, you’re not adding new storage to the system, etc.

The VDO module itself currently has no concept of volumes or volume management; it has one storage device that it sits on top of, and provides one storage device as its interface. To do otherwise would require VDO to be aware of, at the very least, different regions and some sort of associated key (cryptographic or otherwise), and at the other extreme, a full-blown LVM and dm-crypt implementation integrated into VDO itself (or vice versa). And VDO is already complex enough by itself.

True, it VDO is quite the project.

How do deduplicated allocations map to logical volumes on the target that is exposed to the VDO block user?

VDO implements an address mapping layer that uses an on-disk tree structure (multiple parallel trees, actually, so each “zone” can be processed in a dedicated thread based on logical block address, without having to lock a single, global structure). There’s also a table of reference counts maintained per physical address. The tree structure is filled in as needed, but the allocations are permanent; empty parts of the tree (cleared out via trim/discard operations) aren’t removed.

What would be involved in having multiple maps?

Multiple address mapping trees with multiple roots on disk, carrying around some sort of map id number everywhere we carry an LBA now, or perhaps doing some sort of extra mapping layer where the map id number and LBA are combined into a single 64-bit value. The ability to prune parts of this uber-tree that have been marked for deletion (entire volumes), and actually free the storage, including recovering from crashes mid-operation. All the bookkeeping associated with keeping a list of volume names and ids and metadata. Probably more stuff. And that’s without snapshot support.

We’ve already got LVM which does some of this stuff (and dm-crypt for the encryption bits), but we can’t just grab the code and plug it in, as there’s no interface that we’re all coding to where the pieces all plug in. Or, well, there sort of is, but it’s the block device interface…

I'm guessing adding encryption would be simple enough: add some dm target options and wrap the deduplicated "block" in a crypto mechanism that doesn't change the block size. (Of course this is only useful if VDO could multiple logical volume maps, otherwise one may as well place dm-crypt under dm-vdo). What do you think? You might even be able to hook existing dm-crypt calls for your blocks, not sure about that though.

Then the volume management needs per-volume key storage too. And then you get questions of whether you want just one key, or the one “real” encryption key plus one or more user-facing password-based keys which unlock the real key via some table somewhere, dealing with changing passwords, etc. Many of the UI issues that cryptsetup has to deal with then become VDO’s as well.

By the way, so as to not confuse the meaning of "block" as in "sector", what do you call the chunk of data that is deduplicated and stored+compressed under the hood with VDO?

We refer to the 4kB data chunk we operate on for deduplication and compression as a “block”, made up of 8 “sectors”. Once upon a time VDO’s block size was variable, but for years it’s been fixed at 4kB, which I’m told experiments had indicated tended to be the optimal choice for deduplication.

But perhaps the modular approach can still work. Like I said, I’m kind of fuzzy on exactly what your requirements are, but if you really require different encryption keys, perhaps it could work to define multiple VDO devices, each atop a different dm-crypt device.

That might be the best option, which is #4 from the original post: "If encryption+VDO were implemented above LVM's tdata volume such that each dm-thin pool was compressed and deduplicated by VDO, but LVM could still mange the per-customer pool, then you could have per-customer encryption with per-customer deduplication, but there is an issue that creates a disk usage inefficiency: dm-thin pools are allocated statically; multiple dm-thin pools themselves are not thin, they are staticaly allocated."

The only way to get thin allocation below VDO would be to have a volume manager within VDO that supports encryption (or stack the VDO volumes atop of thin, but dm-crypt->dm-thin->dm-vdo->dm-thin would be top heavy, slow, and maybe fragile).

Ah. Yes.

Though, I think dm-thin is currently not recommended on top of VDO, or at least you have to be very careful with it, because it assumes it can always write to its backing storage. With VDO, whether you can write, even to a block you’ve previously written to, may depend on the content of the block you’re now writing, if you’re short on storage. If it’s a block of zeros, or it duplicates an existing block, you’re fine, but if it’s something new, you might find you can’t overwrite the previous content! If you ensure that you can always grow the VDO backing storage to keep ahead of dm-thin, then it should be okay. I could see a desire for ENOSPC to never be returned if there’s extra space available for growing the VDO device, but doing a generic sort of call-out during a low-level allocation failure is tricky. We can generate warnings in advance that can be trigger actions, though.

Thin allocation below VDO is also not good -- if you give VDO storage of N gigabytes it assumes all of that storage is immediately available. You can extend it later, if you use something like LVM logical volumes (dm-linear), but if random blocks in the middle are found to be unavailable, VDO will be very unhappy.

Either assign them their storage up front, or create them from a common LVM volume group with some moderate size and add space to each if it starts to get full. (Be aware that VDO doesn’t support shrinking.) That would increase the storage and memory requirements for the UDS deduplication indexes, unfortunately, as they’d be maintained independently for each volume, but it also means one busy client won’t impact another’s deduplication by writing too much data.

Yep, was thinking the same thing. Maybe automate it to 80% full, but as you say, shrinking is not an option so that presents a possible issue for automation if they grow like crazy but didn't mean to. *()**

Yeah, it’s something I’ve pondered before, but since the address mapping is unidirectional, the only way to implement shrinking is to scan the logical address space for anything mapping into the space you want to clear out, and copy out any such blocks you find. That’s a slow operation, and not currently on our roadmap.

Since VDO is managed via LVM in the latest versions, and LVM doesn’t manage storage encryption directly, it would take a little work to assemble the stack, probably something like this (untested): one or more disks/paritions/RAIDs/etc (lowest level) format as LVM physical volumes LVM volume group LVM logical volumes for each logical volume: dm-crypt with unique key configure as an LVM physical volume and its own volume group create an LVM VDO pool and logical volume if you need to subdivide, another layer of PV/VG/LVs export via SCSI or whatever to clients (top) Would something like this work for you?

I think so. The only issue, then, is noted above with the *()**.

KJ7LNW commented 2 years ago

[...] per-volume crypto is precautionary to prevent re-used blocks from one user from being exposed to another user. dm-thin can accomplish this by zeroing newly-allocated blocks, but it is much slower to zero them than to encrypt the volumes with different keys. If a block is re-used for another customer then the decrypt with a different key presents no usable content. (There is a dm-devel list discussion about pre-zeroing blocks so they are quickly available for use without blocking the user, but I don't think anyone is working on that.)

If you don’t expose any information about deduplication stats, what’s exposed?

VDO may not have the same issue as dm-thin. I'm referring to this scenario in dm-thin when block-zeroing is disabled; block-zeroing zeroes a 64k block before allocating it in the btree and providing it to the user, but if turned off and:

User writes a single sector to an unallocated 64k block
User reads the whole 64k block
User can sniff leaked data from reclaimed space originally written by other volumes outside of the single sector they wrote.

Placing dm-crypt atop of all dm-thin volumes solves this because the data reaped in #2 was encrypted with one key but decrypted with another key (plus ESSIV mismatch and such). Turning block-zeroing on fixes this too, but in our experience the 64k-zeroing churn IO overhead is far worse than the CPU overhead added by dm-crypt.

(VDO does confirm that the blocks match before unifying the mappings, it doesn’t just go by hashes.)

Heh, sounds like a heated discussion Linus Torvalds once had about git hash collisions. Anyway, I agree that its possible, even intentionally with MD5. However, it might be a nice feature to turn off this validation to prevent the read-before-write overhead. If the hash is configurable, say SHA256, then the birthday problem means a collision will happen at least every 2^128 writes---which would be a lot of writes!

The VDO module itself currently has no concept of volumes or volume management; it has one storage device that it sits on top of, and provides one storage device as its interface. To do otherwise would require VDO to be aware of, at the very least, different regions and some sort of associated key (cryptographic or otherwise), and at the other extreme, a full-blown LVM and dm-crypt implementation integrated into VDO itself (or vice versa). And VDO is already complex enough by itself.

True, it VDO is quite the project. How do deduplicated allocations map to logical volumes on the target that is exposed to the VDO block user?

VDO implements an address mapping layer that uses an on-disk tree structure (multiple parallel trees, actually, so each “zone” can be processed in a dedicated thread based on logical block address, without having to lock a single, global structure). There’s also a table of reference counts maintained per physical address. The tree structure is filled in as needed, but the allocations are permanent; empty parts of the tree (cleared out via trim/discard operations) aren’t removed.

Interesting.

Is the usable disk space reclaimed?
Does this mean the tree can run out of room over time?

I'm guessing adding encryption would be simple enough: add some dm target options and wrap the deduplicated "block" in a crypto mechanism that doesn't change the block size. (Of course this is only useful if VDO could multiple logical volume maps, otherwise one may as well place dm-crypt under dm-vdo). What do you think? You might even be able to hook existing dm-crypt calls for your blocks, not sure about that though.

Then the volume management needs per-volume key storage too. And then you get questions of whether you want just one key, or the one “real” encryption key plus one or more user-facing password-based keys which unlock the real key via some table somewhere, dealing with changing passwords, etc. Many of the UI issues that cryptsetup has to deal with then become VDO’s as well.

I just meant pass a key big enough for the cipher algorithm on the table line, no need for password hashing (that's for userspace to figure out). Counter-mode wouldn't grow the existing block size and it sounds like you might already have a monotonically-increasing counter if the tree is never freed.

For now dm-crypt is sufficient unless (someday) VDO decides to add multiple thin volumes per VDO backend, in which case it would be useful.

But perhaps the modular approach can still work. Like I said, I’m kind of fuzzy on exactly what your requirements are, but if you really require different encryption keys, perhaps it could work to define multiple VDO devices, each atop a different dm-crypt device. That might be the best option, which is #4 from the original post. The only way to get thin allocation below VDO would be to have a volume manager within VDO that supports encryption (or stack the VDO volumes atop of thin, but dm-crypt->dm-thin->dm-vdo->dm-thin would be top heavy, slow, and maybe fragile).

Ah. Yes.

Though, I think dm-thin is currently not recommended on top of VDO, or at least you have to be very careful with it, because it assumes it can always write to its backing storage. With VDO, whether you can write, even to a block you’ve previously written to, may depend on the content of the block you’re now writing, if you’re short on storage. If it’s a block of zeros, or it duplicates an existing block, you’re fine, but if it’s something new, you might find you can’t overwrite the previous content! If you ensure that you can always grow the VDO backing storage to keep ahead of dm-thin, then it should be okay. I could see a desire for ENOSPC to never be returned if there’s extra space available for growing the VDO device, but doing a generic sort of call-out during a low-level allocation failure is tricky. We can generate warnings in advance that can be trigger actions, though.

Thin allocation below VDO is also not good -- if you give VDO storage of N gigabytes it assumes all of that storage is immediately available. You can extend it later, if you use something like LVM logical volumes (dm-linear), but if random blocks in the middle are found to be unavailable, VDO will be very unhappy.

Are you referring to any failure scenarios other than out of disk space issues? Of course that is always a concern with thin volumes and monitoring is critical.

DemiMarie commented 2 years ago

What would be involved in having multiple maps?

Multiple address mapping trees with multiple roots on disk, carrying around some sort of map id number everywhere we carry an LBA now, or perhaps doing some sort of extra mapping layer where the map id number and LBA are combined into a single 64-bit value. The ability to prune parts of this uber-tree that have been marked for deletion (entire volumes), and actually free the storage, including recovering from crashes mid-operation. All the bookkeeping associated with keeping a list of volume names and ids and metadata. Probably more stuff. And that’s without snapshot support.

If one implements all of this, how much more is there left to do before one winds up with a complete filesystem? Serious question, because this seems to be practically the same feature set as ZFS zvols or of loop devices on top of bcachefs.

KJ7LNW commented 2 years ago

If one implements all of this, how much more is there left to do before one winds up with a complete filesystem? Serious question, because this seems to be practically the same feature set as ZFS zvols or of loop devices on top of bcachefs.

Interesting that you should say that: Looking seriously at bcachefs, but loop devices are serialized through a single work queue which isn't very good for performance. Native bcachefs block device exports would be ideal.

Perhaps its time to look at ZFS again, too. I've not played with it in years but its come along way since 0.6.

DemiMarie commented 2 years ago

Main problem with ZFS is that (IIRC) the entire deduplication index has to fit in memory.

DemiMarie commented 2 years ago

If one implements all of this, how much more is there left to do before one winds up with a complete filesystem? Serious question, because this seems to be practically the same feature set as ZFS zvols or of loop devices on top of bcachefs.

Interesting that you should say that: Looking seriously at bcachefs, but loop devices are serialized through a single work queue which isn't very good for performance. Native bcachefs block device exports would be ideal.

Could VDO replace dm-thin and dm-crypt? Both seem to be special cases of an enhanced VDO to me: dm-thin would only use thin provisioning and shapshotting, and dm-crypt would only use encryption.

sweettea commented 2 years ago

What would be involved in having multiple maps?

Multiple address mapping trees with multiple roots on disk, carrying around some sort of map id number everywhere we carry an LBA now, or perhaps doing some sort of extra mapping layer where the map id number and LBA are combined into a single 64-bit value. The ability to prune parts of this uber-tree that have been marked for deletion (entire volumes), and actually free the storage, including recovering from crashes mid-operation. All the bookkeeping associated with keeping a list of volume names and ids and metadata. Probably more stuff. And that’s without snapshot support.

If one implements all of this, how much more is there left to do before one winds up with a complete filesystem? Serious question, because this seems to be practically the same feature set as ZFS zvols or of loop devices on top of bcachefs.

It strikes me as also similar to the featureset of btrfs: as far as I know, btrfs has 'offline' dedupe, integrated checksumming, compression, and is working on gaining encryption support.

The remaining difference between a theoretical VDO with those additional features, and a VDO filesystem, would be in adding the ability to store files, indeed, I think.

KJ7LNW commented 2 years ago

Snapshotted files in btrfs can be pretty slow from COW, we snapshot hourly. If VDO could support multiple volume mappings for snapshots then it might be a good alternative to dm-thin and btrfs+loop.

We tried btrfs+subvolumes+loop in place of dm-thin but it wasn't usable. Ultimately dm-thin provides the best COW snapshot performance we have found, though I've not tried ZFS zvol's since 0.6.x so not sure how it performs these days. As @DemiMarie points out, if the dedupe index has to be in-memory then that would be too much memory overhead for our use case.

corwin commented 2 years ago

Sorry for the late reply, but I was just reading back through this thread and saw some things that could use some clarification:

[...] per-volume crypto is precautionary to prevent re-used blocks from one user from being exposed to another user. dm-thin can accomplish this by zeroing newly-allocated blocks, but it is much slower to zero them than to encrypt the volumes with different keys. If a block is re-used for another customer then the decrypt with a different key presents no usable content. (There is a dm-devel list discussion about pre-zeroing blocks so they are quickly available for use without blocking the user, but I don't think anyone is working on that.)

If you don’t expose any information about deduplication stats, what’s exposed?

VDO may not have the same issue as dm-thin. I'm referring to this scenario in dm-thin when block-zeroing is disabled; block-zeroing zeroes a 64k block before allocating it in the btree and providing it to the user, but if turned off and:
1. User writes a single sector to an unallocated 64k block

2. User reads the whole 64k block

3. User can sniff leaked data from reclaimed space originally written by other volumes outside of the single sector they wrote.

Depending upon how you are using VDO, it may or may not have this issue. In a scenario where you have a single VDO with multiple logical volumes on top of it, and in which deleting one of those logical volumes does not issue discards for the address space of the deleted volume, and then a subsequent logical volume reuses the (VDO) logical address space for a new volume, a user could attempt to simply read their new volume and would see whatever data had been written there before.

In a scenario more like what Ken has been proposing, where each volume has its own VDO, this will never be a problem. If you make a VDO on some storage, and then destroy it and make a new one without zeroing the storage, the new VDO will not think any of the old blocks are mapped and will always return zeros.

(VDO does confirm that the blocks match before unifying the mappings, it doesn’t just go by hashes.)

Heh, sounds like a heated discussion Linus Torvalds once had about git hash collisions. Anyway, I agree that its possible, even intentionally with MD5. However, it might be a nice feature to turn off this validation to prevent the read-before-write overhead. If the hash is configurable, say SHA256, then the birthday problem means a collision will happen at least every 2^128 writes---which would be a lot of writes!

Relying on the birthday paradox to save you from collisions has two big performance issues. The first is that it means you need to use a hash like SHA256 rather than the MurmurHash3 we use now. This is a much more expensive hash to compute.

The bigger issue occurs the moment you overwrite a block. Now the index has two entries for two different hashes that point at the same physical block. One for the old data, and one for the new. If you then write a new copy of the old data, it will get referenced to the old block but the new data. This is bad, and doesn't involve a hash collision. The way VDO solves this problem is by treating the answers from the index as "advice" and verifying that the data matches before actually sharing a reference. There are two other ways we could have chosen to deal with this. One would be to read and recompute the hash of the old data on overwrite, and then removing the old entry from the index. This is worse than the choice we made since it results in reads for every overwrite instead of for every overwrite match. The other solution is to write down the full hashes somewhere. The deduplication index VDO uses is a one-way mapping, there's no way to look up a hash by address, and the particular details of what makes that index performant can't provide a reverse mapping. Because of the limitations of block storage, the hashes can't be directly attached to the data either. So there would need to be another metadata structure mapping VDO physical blocks to content hashes, and this structure would need to be updated on every write. Having to make additional metadata updates on every write, rather than only when the index retruns advice for a given block is both more complicated, and often less performant than the verification which VDO does.

The VDO module itself currently has no concept of volumes or volume management; it has one storage device that it sits on top of, and provides one storage device as its interface. To do otherwise would require VDO to be aware of, at the very least, different regions and some sort of associated key (cryptographic or otherwise), and at the other extreme, a full-blown LVM and dm-crypt implementation integrated into VDO itself (or vice versa). And VDO is already complex enough by itself.

True, it VDO is quite the project. How do deduplicated allocations map to logical volumes on the target that is exposed to the VDO block user?

VDO implements an address mapping layer that uses an on-disk tree structure (multiple parallel trees, actually, so each “zone” can be processed in a dedicated thread based on logical block address, without having to lock a single, global structure). There’s also a table of reference counts maintained per physical address. The tree structure is filled in as needed, but the allocations are permanent; empty parts of the tree (cleared out via trim/discard operations) aren’t removed.

Interesting.
* Is the usable disk space reclaimed?

Mostly. To clarify, VDO uses a radix tree to maintain logical to physical block mappings. This tree, and the physical blocks it describes are both allocated from the same storage. Once a block is allocated as part of the block map, it will never be freed even if all of the logical addresses it describes get discarded. The storage from this pool which is used to hold actual user data, and not the mapping tree does get reused when that data is overwritten or discarded.

* Does this mean the tree can run out of room over time?

Yes, but only in scenarios where the VDO is extremely close to or already out of space, in which case the writes which fail for lack of block map allocations would have failed anyway.

corwin commented 2 years ago

If one implements all of this, how much more is there left to do before one winds up with a complete filesystem? Serious question, because this seems to be practically the same feature set as ZFS zvols or of loop devices on top of bcachefs.

Interesting that you should say that: Looking seriously at bcachefs, but loop devices are serialized through a single work queue which isn't very good for performance. Native bcachefs block device exports would be ideal.

Could VDO replace dm-thin and dm-crypt? Both seem to be special cases of an enhanced VDO to me: dm-thin would only use thin provisioning and shapshotting, and dm-crypt would only use encryption.

From a purely theoretical standpoint, this is certinaly possible.

From a practical standpoint, it isn't a good idea. Each of these layers does different things and they are designed to do those things rather than doing lots of things together. It is very hard to combine all of this disparate functionality into a single layer and still maintain correctness and performance. Rather than combining things into a single layer, I think adding mechanisms which allow different layers of the stack to communicate better with each other would be more fruitful.

From a political/social standpoint, I think this is (nearly) impossible. The linux block device community has been committed to a modular approach for a very long time. I think it would be very difficult to convince that community to adopt a new paradigm at this juncture. Particular one which leads to bigger and more complicated pieces.

DemiMarie commented 2 years ago

Sorry for the late reply, but I was just reading back through this thread and saw some things that could use some clarification:

[...] per-volume crypto is precautionary to prevent re-used blocks from one user from being exposed to another user. dm-thin can accomplish this by zeroing newly-allocated blocks, but it is much slower to zero them than to encrypt the volumes with different keys. If a block is re-used for another customer then the decrypt with a different key presents no usable content. (There is a dm-devel list discussion about pre-zeroing blocks so they are quickly available for use without blocking the user, but I don't think anyone is working on that.)

If you don’t expose any information about deduplication stats, what’s exposed?

VDO may not have the same issue as dm-thin. I'm referring to this scenario in dm-thin when block-zeroing is disabled; block-zeroing zeroes a 64k block before allocating it in the btree and providing it to the user, but if turned off and:

User writes a single sector to an unallocated 64k block

User reads the whole 64k block

User can sniff leaked data from reclaimed space originally written by other volumes outside of the single sector they wrote.

Depending upon how you are using VDO, it may or may not have this issue. In a scenario where you have a single VDO with multiple logical volumes on top of it, and in which deleting one of those logical volumes does not issue discards for the address space of the deleted volume, and then a subsequent logical volume reuses the (VDO) logical address space for a new volume, a user could attempt to simply read their new volume and would see whatever data had been written there before.

The problem that arises with dm-thin is a combination of slow (and possibly unreliable) discards and a 64K minimum block size. The first means that zero-on-free is unreliable. The second means that writes can be smaller than the block size, so the rest of the block needs to be zeroed. This results in a performance hit. Qubes OS is also affected by this.

In a scenario more like what Ken has been proposing, where each volume has its own VDO, this will never be a problem. If you make a VDO on some storage, and then destroy it and make a new one without zeroing the storage, the new VDO will not think any of the old blocks are mapped and will always return zeros.

The problem with that is that it loses system-wide thin provisioning. One would need to put dm-thin underneath VDO.

(VDO does confirm that the blocks match before unifying the mappings, it doesn’t just go by hashes.)

Heh, sounds like a heated discussion Linus Torvalds once had about git hash collisions. Anyway, I agree that its possible, even intentionally with MD5. However, it might be a nice feature to turn off this validation to prevent the read-before-write overhead. If the hash is configurable, say SHA256, then the birthday problem means a collision will happen at least every 2^128 writes---which would be a lot of writes!

Relying on the birthday paradox to save you from collisions has two big performance issues. The first is that it means you need to use a hash like SHA256 rather than the MurmurHash3 we use now. This is a much more expensive hash to compute.

It is also very much desirable in some applications, and may be necessary for multi-tenant use. (Otherwise you run into side-channel attacks via hash collisions.) SHA256 is hardware accelerated on some architectures, which should help.

DemiMarie commented 2 years ago

From a practical standpoint, it isn't a good idea. Each of these layers does different things and they are designed to do those things rather than doing lots of things together. It is very hard to combine all of this disparate functionality into a single layer and still maintain correctness and performance. Rather than combining things into a single layer, I think adding mechanisms which allow different layers of the stack to communicate better with each other would be more fruitful.

What would these mechanisms look like?

ewheelerinc commented 2 years ago

In a scenario more like what Ken has been proposing, where each volume has its own VDO, this will never be a problem. If you make a VDO on some storage, and then destroy it and make a new one without zeroing the storage, the new VDO will not think any of the old blocks are mapped and will always return zeros.

The problem with that is that it loses system-wide thin provisioning. One would need to put dm-thin underneath VDO.

Hi @DemiMarie,

You could put VDO under the dm-thin pool, but then you loose per-volume encryption (because encryption invalidates dedupe and compression); you would still have the dm-thin block-size and discard issues.

FYI in case it is useful: While dm-thin has its issues with block zeroing, we have had success with per-volume LUKS encryption above each thin volume with --enable-discards in LUKS and then turn off block zeroing in dm-thin. This provides a dm-thin performance boost by avoiding the zero-before-use issue (which is synchronous and slow, not pre-zeroed).

It avoids data leak when you use different keys per volume (and, even with the same key, the re-used XTS tweak or CBC-ESSIV is very likely to be re-allocated to a different logical volume sector so it wouldn't line up). However, be aware of the snapshot issues detailed in this old mailing list that require serializing dm-crypt:

https://listman.redhat.com/archives/dm-devel/2016-September/028125.html

See "item number 3" in the original post (this github #45 issue) which started this whole discussion about wanting per-volume encrypted+deduplicated+compressed thin volumes.

You might try ZFS zvol's (which we would prefer not to use it is out of tree) or jump on the bcachefs mailing list and encourage native bcachefs volumes since bcachefs will likely find its way upstream someday. Here is the beginning of an idea, but its not yet implemented:

https://lore.kernel.org/all/51a52bd6-b535-e5ac-12a1-2f6dc1a84353@ewheeler.net/

DemiMarie commented 2 years ago

When it comes to snapshots, there are a couple of areas where I would like to see different design choices made than those made by dm-thin:

dm-thin does not have any userspace-accessible per-volume metadata except the volume number. This is fine for tools like Stratis and LVM that have their own userspace metadata storage, but makes writing a simple “thinsetup” tool much more difficult. This can be mitigated by allowing userspace to store some per-volume metadata. It doesn’t need to be much; 512 bytes should be plenty. This metadata should include an identifier for the responsible userspace tool, so that different tools using the same VDO pool do not conflict with each other.
dm-thin is optimized for writes to unshared blocks and for zeroing being disabled. This is a poor fit for storing frequently-snapshotted VM disks, which break sharing a lot and must have zeroing enabled.

corwin commented 2 years ago

Depending upon how you are using VDO, it may or may not have this issue. In a scenario where you have a single VDO with multiple logical volumes on top of it, and in which deleting one of those logical volumes does not issue discards for the address space of the deleted volume, and then a subsequent logical volume reuses the (VDO) logical address space for a new volume, a user could attempt to simply read their new volume and would see whatever data had been written there before.

The problem that arises with dm-thin is a combination of slow (and possibly unreliable) discards and a 64K minimum block size. The first means that zero-on-free is unreliable. The second means that writes can be smaller than the block size, so the rest of the block needs to be zeroed. This results in a performance hit. Qubes OS is also affected by this.

VDO's discards are not unreliable, and the minimum block size is either 4K or 512 bytes in 512e mode. However, VDO's discards can be quite slow (something which is definitely on our road map to fix).

In a scenario more like what Ken has been proposing, where each volume has its own VDO, this will never be a problem. If you make a VDO on some storage, and then destroy it and make a new one without zeroing the storage, the new VDO will not think any of the old blocks are mapped and will always return zeros.

The problem with that is that it loses system-wide thin provisioning. One would need to put dm-thin underneath VDO.

Not necessarily. In a scenario like Ken's where each encryption domain has its own VDO, there's no reason why all of the system's storage needs to be allocated when each VDO is created. Each of them can be created minimally, and as they fill up, storage can be added as needed. It would require some careful monitoring to ensure that none of the VDOs run out of space as long as there is some storage available, but should be a workable solution. And each of the individual VDOs will likely get more data reduction from deduplication than a single VDO would which was processing simultaneous inputs from different encryption domains since the different domains can't deduplicate against each other.

(VDO does confirm that the blocks match before unifying the mappings, it doesn’t just go by hashes.)

Heh, sounds like a heated discussion Linus Torvalds once had about git hash collisions. Anyway, I agree that its possible, even intentionally with MD5. However, it might be a nice feature to turn off this validation to prevent the read-before-write overhead. If the hash is configurable, say SHA256, then the birthday problem means a collision will happen at least every 2^128 writes---which would be a lot of writes!

Relying on the birthday paradox to save you from collisions has two big performance issues. The first is that it means you need to use a hash like SHA256 rather than the MurmurHash3 we use now. This is a much more expensive hash to compute.

It is also very much desirable in some applications, and may be necessary for multi-tenant use. (Otherwise you run into side-channel attacks via hash collisions.) SHA256 is hardware accelerated on some architectures, which should help.

While there are SHA256 hash accelerators are some hardware, that isn't universally true. Furthermore, the cost of having to read and rehash every block you overwrite or delete is much more expensive than the hash, and no hardware accelerator will help with that.

Given that VDO does not rely on the index for correctness, what type of side-channel attack do you fear (This is something we've put a lot of thought into, and the only attack we are aware of is that it is possible to knowingly generate data blocks with hash collisions which could allow an attacker to ruin a VDOs dedupe efficiency. However, to conduct that attack, the attacker would need to have write access to the VDO, and in that case, they could just as easily ruin the dedupe efficiency by just writing random data.

corwin commented 2 years ago

From a practical standpoint, it isn't a good idea. Each of these layers does different things and they are designed to do those things rather than doing lots of things together. It is very hard to combine all of this disparate functionality into a single layer and still maintain correctness and performance. Rather than combining things into a single layer, I think adding mechanisms which allow different layers of the stack to communicate better with each other would be more fruitful.

What would these mechanisms look like?

That is a very good question for which I don't currently have a good answer. I have started to have conversations with some of the people who work on other parts of the stack about what sorts of things would be useful to communicate between different layers.

I can say, for example, that from VDO's point of view, it would be useful to be able to tell the layers above that when the VDO is full, not only can they not issue writes to previously unallocated space, but they can't assume that overwrites of existing allocations will succeed.

It would also be useful for layers above VDO to be able to tell VDO, "don't bother trying to optimize this write," either because it is something like journal metadata, or because some layer (probably an application or filesystem) knows that what type of data it is and that it isn't likely to deduplicate or compress.

DemiMarie commented 2 years ago

That is a very good question for which I don't currently have a good answer. I have started to have conversations with some of the people who work on other parts of the stack about what sorts of things would be useful to communicate between different layers.

One thing that I have wanted to see is “here is how much space you actually have”, so that thin pools can fail gracefully.

DemiMarie commented 2 years ago

It avoids data leak when you use different keys per volume (and, even with the same key, the re-used XTS tweak or CBC-ESSIV is very likely to be re-allocated to a different logical volume sector so it wouldn't line up). However, be aware of the snapshot issues detailed in this old mailing list that require serializing dm-crypt:

Does this problem only arise with snapshots, or can it happen without snapshots?

KJ7LNW commented 2 years ago

It avoids data leak when you use different keys per volume (and, even with the same key, the re-used XTS tweak or CBC-ESSIV is very likely to be re-allocated to a different logical volume sector so it wouldn't line up). However, be aware of the snapshot issues detailed in this old mailing list that require serializing dm-crypt: https://listman.redhat.com/archives/dm-devel/2016-September/028125.html

Does this problem only arise with snapshots, or can it happen without snapshots?

It may not trigger a problem in all cases. Somewhere along the line when REQ_FLUSH was refactored as REQ_OP_FLUSH the kernel stopped making REQ_OP_FLUSH a write barrier, so it can be re-ordered w.r.t writes.

While I've not tested this, there is a possibility that when:

A virtual machine runs a pre-barrier kernel (circa 2016), and
the hypervisor runs a post-barrier kernel and re-orders flushes with writes, and
the hypervisor snapshots dm-thin or another thin technology (ZFS?), (or hard-crashes, because a snapshot is effectively a moment-in-time filesystem crash) then:

a filesystem in the VM might get confirmation of a flush before the writes intended to be isolated by barrier complete.

In the list email linked above this resulted in journal entries failing to commit before the metadata change and created orphaned files in ext4.

Theoretically, any post-barrier filesystem in a VM should work fine without serialized dm-crypt options...but again, I've not tested.

If you would like to test, that would be awesome!

Create a VM and pass a direct-io disk to it that is backed by dm-thin.
Format the whole volume directly so it is easy to mount in the hypervisor (ie, don't format a partition).
Do lots of file create/delete IOs simultaneously on the VM. There used to be some good NNTP/SMTP IO benchmarks out there for this, haven't looked in a while.
Run several snapshots on the hypervisor
Check for corruption on each snapshot:
1. Mount the filesystem to replay the journal
2. umount the filesystem
3. fsck the snapshot and see if there are any errors. If it works, then there should NOT be any fsck errors like "deleted inode referenced" in ANY of the snapshots.

Does this problem only arise with snapshots, or can it happen without snapshots?

To directly answer your question since the post above is getting long: If this bug does still exist, then it can happen without snapshots in the case of a hard crash (reset, power loss, kernel panic, etc).

DemiMarie commented 1 year ago

The bigger issue occurs the moment you overwrite a block.

The only other open-source storage stack I know of that provides deduplication is ZFS, and it does so by never overwriting blocks in-place. Instead, ZFS is permanently copy-on-write. This is also a requirement for performant parity-based RAID (otherwise one needs a data journal) and for temporal (as opposed to merely spacial) integrity protection.