containers / composefs

a file system for mounting container images
Other
436 stars 32 forks source link

mount: add support for optionally skipping loopback device #361

Open allisonkarlitskaya opened 2 weeks ago

allisonkarlitskaya commented 2 weeks ago

This is cool: https://lore.kernel.org/lkml/CAMuHMdVqa2Mjqtqv0q=uuhBY1EfTaa+X6WkG7E2tEnKXJbTkNg@mail.gmail.com/T/

allisonkarlitskaya commented 2 weeks ago

The complexity of the mount code is growing a lot:

So this storing the fd in the state thing is probably not gonna survive... There's just too many fds now...

The thing about opening vs. getting an fd from the user is pretty boring and is well handled in the two separate entrypoint APIs for each of those cases, but then we need to go on a path:

plus all the cleanups of optionally closing fds if we were the ones that opened it. I start to think that maybe we could dup() the user's fd just to simply things there (ie: we always close it) and close() each fd from a helper before we return the "next layer" one.

Basically, the handling of trying to make sure we rmdir() and close() stuff as appropriate is getting a bit too complicated... particularly when mixed with various early error exits...

alexlarsson commented 2 weeks ago

This is definitely cool, and will help tons to avoid loads of loopback devices e.g. in the container image case. I agree that the mount code complexity is growing, in particular since we need to support legacy mount APIs, etc. But, this need only be included in the modern API mount codepath, and I believe the complexity is worth it.

allisonkarlitskaya commented 2 weeks ago

@alexlarsson what's the story behind needing to support the legacy mount APIs?

alexlarsson commented 1 week ago

@allisonkarlitskaya Its needed on e.g. rhel9 at least I belive.

cgwalters commented 1 week ago

As I understand things today, if we go with the current design in https://github.com/containers/composefs/issues/332 then we don't need the memfd path, right?

Skipping the loopback though would be quite valuable down the line, but is IMO not critical right now again versus just fleshing out the status quo.

I start to think that maybe we could dup() the user's fd just to simply things there (ie: we always close it) and close() each fd from a helper before we return the "next layer" one.

That seems fine to me if it simplifies the code though.

allisonkarlitskaya commented 1 week ago

Ya, I feel like the memfd thing was an interesting idea that got us on the right path but maybe not all that useful in the end...

allisonkarlitskaya commented 1 week ago

ps: skipping the loopback device still doesn't let us mount in containers, unfortunately. fsconfig(3, FSCONFIG_CMD_CREATE, ...) fails with EPERM because even though we're root in our own mount namespace we're not the "real root". That's down to the call to mount_capable() in vfs_cmd_create() in fsopen.c which checks that we're either "real root" or that our chosen filesystem has FS_USERNS_MOUNT (which few do, and erofs is not among them). Incidentally, overlayfs is (which makes sense since containers depend on this)...

allisonkarlitskaya commented 1 week ago

A lot of the caution in this "kernel shouldn't mount filesystems from untrusted sources" story is from the possibility that the user can modify the filesystem under the kernel. I wonder if we could convince erofs (and vfs) folks to trust filesystem images that come from non-mutable sources, eg:

cgwalters commented 1 week ago

ps: skipping the loopback device still doesn't let us mount in containers,

This subthread is I think covered by https://github.com/containers/storage/issues/2099 (also xref https://github.com/cgwalters/composefs-oci-experimental?tab=readme-ov-file#mounting-composefs-as-non-root )

A lot of the caution in this "kernel shouldn't mount filesystems from untrusted sources" story is from the possibility that the user can modify the filesystem under the kernel.

That's I think nowadays covered by https://lwn.net/Articles/941764/ - but no the real problem remains (mostly for more complex filesystems) that offline crafted input can result in memory corruption - the XFS maintainer mentioned things like having file blocks point to metadata as one that would start to be quite expensive to check. That's mainly a concern for writable filesystems - erofs is regularly fuzzed and is I think fairly robust but...

Anyways I think the "composefs as a service" is the best fix here.

alexlarsson commented 1 week ago

Yeah, EROFS is simpler than read-write OSes, but its not dead simple. The original in-kernel composefs was even simpler, and I believe something like that could eventually get fuzzed to be trustworthy for untrusted data. However, that ship has sailed, and I'm pretty sure general untrusted EROFS mounts are not gonna happen.

I wonder if we could handle unprivileged composefs mounts by just extracting the metadata from the erofs into a tmpfs. Maybe we can make that pretty fast by using io_uring to parallelize all the syscalls to create the files.

hsiangkao commented 1 week ago

Yeah, EROFS is simpler than read-write OSes, but its not dead simple. The original in-kernel composefs was even simpler, and I believe something like that could eventually get fuzzed to be trustworthy for untrusted data. However, that ship has sailed, and I'm pretty sure general untrusted EROFS mounts are not gonna happen.

I wonder if we could handle unprivileged composefs mounts by just extracting the metadata from the erofs into a tmpfs. Maybe we can make that pretty fast by using io_uring to parallelize all the syscalls to create the files.

As I've said in https://github.com/containers/storage/issues/2099#issuecomment-2362570162, personally it won't be hard to enable EROFS with FS_USERNS_MOUNT (e.g. disable compression). From the centos 7 backport, currently the core part of EROFS (un-encoded metadata/data path and xattr support) is 2150 LOC (https://github.com/erofs/kmod-erofs/tree/main/src). So on the simplicity side, I think un-encoded EROFS is simple enough. Yet I don't think VFS maintainer will accept this (regardless of fuzzing, loc or others), also they provide some alternative ways to achieve that too.

j1nx commented 5 days ago

Apologies for chiming in on this without to much know how on the matter, however more than interested to see rootless composefs being implemented for podman/storage.

Does this help? https://www.phoronix.com/news/FUSE-IDMAPPED-Mounts-6.12

giuseppe commented 4 days ago

Apologies for chiming in on this without to much know how on the matter, however more than interested to see rootless composefs being implemented for podman/storage.

Does this help? https://www.phoronix.com/news/FUSE-IDMAPPED-Mounts-6.12

I think that won't help with rootless idmapped mounts. We either need FUSE passthrough, or overlay in a user namespace needs to honor the .redirect extended attribute