all: add 'copy' mount option

allisonkarlitskaya commented 2 months ago

This copies the content of the named composefs image into a sealed memfd and mounts that, instead of using the underlying file.

The main benefit here is that the underlying filesystem isn't pinned by having one of its files used as the basis of the loopback device. This is useful if you want to embed a composefs into an initramfs, for example.

There are also integrity benefits. Without this option, the fsverity of the file is verified at mount time, but nothing stops the file from being modified after the fact. With the copy option, we measure and compare the fsverity signature after copying into and sealing the memfd.

Right now this is implemented as a straight copy, but we may also add support for decompression at some point. The erofs images compress reasonably well using general-purpose compression algorithms: a 52MB image of /usr on my system compresses down to 14MB with zstd and 13MB with xz.

allisonkarlitskaya commented 2 months ago

This is not currently intended for merging, but might end up being part of the work we need to realize #332.

allisonkarlitskaya commented 2 months ago

The verity stuff is wrong here, mostly because I didn't understand how it worked: it's enforced by the filesystem and we just have to verify that the fingerprint is the one we expected. After we do that, we can't possibly read incorrect data.

Also: memfd doesn't support it, which means we'll end up with ENOTTY if we try that.

Instead, if we really want to inspect the data in the memfd, we should probably use the userspace lcfs-fsverity stuff to calculate the checksum for ourselves. If we do that after we seal, we can be confident that nothing changes. This provides us with a nice way to enforce a particular image, even if fsverity isn't enabled on the underlying filesystem. Although: for that particular scenario, I wonder if it wouldn't be easier to just use the sha256sum...

I'm not sure if there's a real benefit to pursuing this, but in any case, if we do want to allow mixing copy and digest options, we need to retrieve the verity digest before we do the copy.

cgwalters commented 2 months ago

The verity stuff is wrong here, mostly because I didn't understand how it worked: it's enforced by the filesystem and we just have to verify that the fingerprint is the one we expected. After we do that, we can't possibly read incorrect data.

Right, we should be invoking lcfs_validate_verity_fd before copying (and then skip that verification on the memfd) I think.

cgwalters commented 2 months ago

The main benefit here is that the underlying filesystem isn't pinned by having one of its files used as the basis of the loopback device. This is useful if you want to embed a composefs into an initramfs, for example.

I think at least historically some initramfs code basically solved this with rm -rf /* after the pivot? But I guess it is perhaps cleaner to have a memfd instead of a tmpfs with one unlinked file.

travier commented 2 months ago

I've not tested this change but I really like the approach. 👍🏻

cgwalters commented 2 months ago

BTW we hard require clang-format, it looks like e.g. ninja -C target clang-format (depending on your builddir)

alexlarsson commented 2 months ago

So, yeah, the integrity parts here are not correct, but I think this makes sense in some cases to eg. allow the source fs to be unmounted, but also as a general robustness thing. Like, its nice that we can guarantee "right data or EIO", but with this, we can guarantee "right data" (at least for the fs metadata), at the cost of slower mount and more memory use.

alexlarsson commented 2 months ago

I'm not sure how much this matters, but we can also with this get fs-verity validation if the file on disk doesn't have fs-verity (i.e. if it is on FAT or whatever), because we can manually valdiate the digest when copying into the memfd. Of course, we would still need fs-verity for the backing files.

Again, not sure if this is practically useful, but worth considering.

allisonkarlitskaya commented 2 months ago

I'm not sure how much this matters, but we can also with this get fs-verity validation if the file on disk doesn't have fs-verity (i.e. if it is on FAT or whatever), because we can manually valdiate the digest when copying into the memfd. Of course, we would still need fs-verity for the backing files.

Again, not sure if this is practically useful, but worth considering.

It's absolutely my plan to do the fsverity checking in userspace on the already-sealed memfd.

allisonkarlitskaya commented 2 months ago

It's absolutely my plan to do the fsverity checking in userspace on the already-sealed memfd.

...until we find some kernel hacker who is willing to add support for fsverity to tmpfs/memfd...

alexlarsson commented 2 months ago

It's absolutely my plan to do the fsverity checking in userspace on the already-sealed memfd.

I don't think this is necessary if we validated the digest on the fd after we opened it. We just copied data that was already validated (by the read call). How could it have become modified after that?

I mean, in theory it could somehow become modified in memory after reading by some ptrace attacker, but that is also true for userspace validation after the fact (i.e. the code that checks the fsverity digest of the data read from the sealed fd could have been modified). Really, the security model is that mount.composefs comes from a signed trusted initrd, and runs only code that we trust, so there is no "attacker".

alexlarsson commented 2 months ago

Btw, I would do the fsverity manual computation in the copy-to-memfd loop, then you better parallelize the "read next block" and the "compute digest of block".

allisonkarlitskaya commented 2 months ago

I think someone could probably also open("/proc/($pidof mount.composefs)/fd/3") or whatever the memfd is and modify it like that. I'm not super-concerned about that case. Supporting filesystems that lack their own fsverity is more what I'm aiming for here, and I agree that this could be done while streaming the file from the disk...

alexlarsson commented 2 months ago

I think someone could probably also open("/proc/($pidof mount.composefs)/fd/3") or whatever the memfd is and modify it like that. I'm not super-concerned about that case. Supporting filesystems that lack their own fsverity is more what I'm aiming for here, and I agree that this could be done while streaming the file from the disk...

Not if it is sealed. Oh, or i guess you mean while its happening, similar to the ptrace attack. And yeah, that is not interesting.

alexlarsson commented 2 months ago

That said, you will still definitely need fs-verity for the filesystem backing the file objects. No way out of that.

allisonkarlitskaya commented 2 months ago

That said, you will still definitely need fs-verity for the filesystem backing the file objects. No way out of that.

This much is clear. What we're imagining, though, is that the erofs layer lives in the initramfs, where it doesn't get fsverity. The digest store is on a "real" filesystem, though.

alexlarsson commented 2 months ago

This much is clear. What we're imagining, though, is that the erofs layer lives in the initramfs, where it doesn't get fsverity. The digest store is on a "real" filesystem, though.

This sounds very interesting. However, in that case the file is typically trusted by being on the initrd itself, which is signed, so fs-verity of it is not strictly necessary then. (In particular, because generally all fs-verity digests we verify against are generally trustworthy due to having the public key in the initrd.)

cgwalters commented 2 months ago

Yes agreed, there is no need for us to enforce fsverity of the erofs in this model.

Maybe we change lcfs_validate_verity_fd() to check if the source fd is a memfd, and if it is then we require that it's sealed?

Actually though...will Linux ever swap out a memfd contents? If it can, then we do need to enforce verity for it right?

cgwalters commented 2 months ago

Digging a bit there's nowadays memfd_secret() which is defined to disallow swapping...we could use that. Although, it seems possible the "The memory region is removed from the kernel page tables" part would break us loopback mounting it?

cgwalters commented 2 months ago

Of course, the other alternative design to think about here is basically: "don't put an EROFS in the initramfs, just put it unpacked in e.g. /rootfs in the initramfs". No loopback mounts, memfd etc. needed. We'd just have something like LCFS_MOUNT_DIR or so that tells the mounting code it's a plain old directory file descriptor.

However, the swapping thing still seems like a concern to me...actually look, there's already a noswap option for tmpfs, so we'd just need to be sure that the initramfs tmpfs is set up with that option.

alexlarsson commented 1 month ago

Of course, the other alternative design to think about here is basically: "don't put an EROFS in the initramfs, just put it unpacked in e.g. /rootfs in the initramfs". No loopback mounts, memfd etc. needed. We'd just have something like LCFS_MOUNT_DIR or so that tells the mounting code it's a plain old directory file descriptor.

Does this work? For example, does the initramfs tar unpacking support full xattrs? Also, my guess is that a full in-memory tmpfs of the rootfs metadata will use a lot more ram than just the erofs image.

allisonkarlitskaya commented 1 month ago

I think we've decided not to do this.

containers / composefs

all: add 'copy' mount option #334