Closed allisonkarlitskaya closed 1 month ago
One note about performance/memory trade-offs: having the erofs as part of the UKI (and then permanently stored in RAM) would mean that the entire metadata of the system partition is in RAM. ls -lR /usr
would always happen without touching the disk. It's more data to load when booting the kernel image, but having that data pre-loaded as a small blob up front seems like it should probably be a net win. It would have to be measured. It also means that we have a chunk of RAM that we've "wasted"...
Another requirement of the "UKI inside the OCI container" approach (and maybe the "UKI generated locally" approach as well): we'd probably want a tool that could scan the UKI to find out which blobs its refers to in the digest store. This is important for pruning the store when removing old images.
One part of implementing this idea is to adapt https://github.com/ostreedev/ostree/blob/main/src/switchroot/ostree-prepare-root.c to use this EROFS instead of looking at the sysroot.
Here is a potential flow where we could use that feature that would help us workaround SELinux issues and remove the need for build time commits:
# "Normal" build part where you customize your image
FROM base-image as target
RUN Make changes here as needed
# Use a side image to build the composefs & UKI
FROM target as builder
RUN Rebuild SELinux policy
RUN - Do an ostree commit with the changes (i.e. we need to figure out what changed)
- using the context from the updated SELinux policy
- and get the full composefs EROFS for the final root
RUN Compress and append the EROFS blob to the initramfs in a pre-defined place
RUN Install ukify & Secure Boot signing tools
RUN Build a UKI with the kernel, initramfs, command line config from the container image and sign it, output to /uki
# Go back to the final image and include just the UKI
FROM target
COPY --from builder /uki /uki
ostree container image pull
which will import all the objects from the "target" image, including the UKI. We will just ignore the xattrs and SELinux labels.We tried something similar while prototyping: https://github.com/travier/fedora-coreos-uki/blob/main/fcos-uki/Containerfile
The major change with this approach is that we clearly split the file content from the metadata and the container becomes a way to only transport object data plus a UKI which includes all the metadata. Thus the deployed rootfs becomes an object store only and we don't "care" about ostree commits anymore as we don't need to sign them or use them to regenerate the composefs metadata on the systems.
Another requirement of the "UKI inside the OCI container" approach (and maybe the "UKI generated locally" approach as well): we'd probably want a tool that could scan the UKI to find out which blobs its refers to in the digest store. This is important for pruning the store when removing old images.
Yes. Combining with this comment in general it argues for some new tooling - not too large or complex tooling but new tooling nevertheless. One option is to implement it in this repo as a build-time option - a variant of that is to implement it in Rust (also in this repo). Maybe something like a composefs-boot crate?
I chatted with @allisonkarlitskaya about this and there's a lot to like about the simplicity of this approach - I'm 100% on board with continuing investigation of this direction.
My biggest concern was that I'd also really like to build the story of using composefs for apps/extensions/configmaps etc. and this model reduces the alignment between those two approaches.
Combining, this issue also intersects strongly with https://github.com/containers/composefs/issues/294 where I was trying hard to think of a way to bring OCI metadata under verity protection. Hmmm...I guess probably the simplest variant that would work for this is to require the UKI to always be in a distinct layer (with a special annotation like composefs.boot
or something), and the manifest that gets included inside the image doesn't have that layer.
Also worth thinking about here is the related issue I was thinking about around how we store individual layers. We must support only fetching changed layers across upgrades.
In https://github.com/containers/composefs/issues/332#issuecomment-2337480631, I forgot that we still need to do the 3-way merge for /etc so we still a "deployment" of it, so this is a bit more complex.
We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.
We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.
(when using LUKS on the rootfs)
In #332 (comment), I forgot that we still need to do the 3-way merge for /etc so we still a "deployment" of it, so this is a bit more complex.
For ostree yes, though we also support etc.transient
where that wouldn't be needed.
I think in theory we could ship initramfs glue in this project such that "mount composefs from initramfs" logic could in theory be very agnostic, i.e. we have:
/sysroot
with a composefs setup, with backing objects in something native like /composefs/objects
maybe? But the backing store can be configured in some way (an xattr on the cfs? a config file?))/etc
and /var
in the way ostree does it today using the physical root, which also note the intersection with https://github.com/containers/composefs/issues/280 ), but the ostree bits could obviously be replaced with something else for non-ostree consumersWe've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.
Instead of "public" I would say "not encrypted on disk" to be clear. "public" often implies to me "accessible to the whole Internet" but for images generated on premise and deployed to servers that are physically secured, I wouldn't say the UKIs here are "public".
That said...AFAIK there's nothing that would block someone from encrypting the erofs in the initramfs, and decrypting using e.g. a key stored in the TPM or something.
As regards composefs/erofs inside of a UKI, this wouldn't work so well for CentOS Automotive Stream Distribution/Red Hat In-Vehicle OS. Two reasons.
We spent a lot of time minimising initramfs for super-fast boots, we are talking < 10M in size and < 2 seconds in boot time. Now we do have to read the whole composefs eventually for verification. During the initial read of the UKI, userspace cannot proceed with anything until the whole UKI is read, decompressed and the kernel populates the initramfs filesystem.
The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.
In Automotive can fork a little from the technique decided on here, we already do that as one of the users of composefs.
All we need to do is store a digest in initramfs to ensure what we are booting is what we intend and many of these concerns go away.
Also tagging @alexlarsson he'd likely be interested in a read here.
In fact, and I've discussed this with the systemd guys once or twice and they agree. initramfs is a dated filesystem, we should keep it as small (and as irrelevant) as possible. There are more efficient ways of creating volatile throwaway verified filesystems these days (composefs, erofs, overlayfs, fs-verity, dm-verity, etc.). Also, if one is referencing erofs inside you cannot unmount the initramfs.
We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support too many paths).
The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs".
The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.
That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.)
Let's be a bit more specific: ~how big is your initramfs today?~ (nevermind, < 10M), How big is the composefs for it?
+Kuznetsov, Vitaly @.> +Daniel Berrange @.>
On Fri, Sep 27, 2024 at 12:01 PM Colin Walters @.***> wrote:
We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support too many paths).
The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs".
The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.
That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.)
Let's be a bit more specific: how big is your initramfs today? How big is the composefs for it?
— Reply to this email directly, view it on GitHub https://github.com/containers/composefs/issues/332#issuecomment-2378800069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEB36BVBPT2GMGIMGQFYJC3ZYUNHXAVCNFSM6AAAAABNWYECY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZYHAYDAMBWHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I recommend people take at the Android Boot Image and composefs implementation in cs9 auto FWIW. Android Boot Image is a kernel+dtb+cmdline+initramfs blob, it's very similar to UKI.
We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support too many paths).
Understood, this feedback is not intended to block any efforts.
The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs".
I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.
The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.
That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.)
We should favour containers when possible, but there are cases where we cannot use containers. I think there are more scalable solutions than this.
Let's be a bit more specific: ~how big is your initramfs today?~ (nevermind, < 10M), How big is the composefs for it?
I'll build an OS image sometime for exact measurements, need to leave early today for a wedding, so it will be Monday...
There are also no definite sizes for these things in the Automotive OS, but I'll post the minimal sizes anyway. A partner may want to add something to initramfs, may want to add a camera application to composefs (these advanced camera applications can be huge and are not suitable for containers).
I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.
One detail here is that assuming you do the "transient key" model, you throw away reproducible builds - was mentioned in an ASG talk. The "static key" model solves that, but...hmm, I think has other problems.
Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g. /usr/lib/composefs/rootfs.digest
or something that and then we know to look in /objects/<digest>
, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.
I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.
One detail here is that assuming you do the "transient key" model, you throw away reproducible builds - was mentioned in an ASG talk. The "static key" model solves that, but...hmm, I think has other problems.
Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g.
/usr/lib/composefs/rootfs.digest
or something that and then we know to look in/objects/<digest>
, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.
^ This is what I mean "rootfs.digest" file concept... That's basically what we do in the automotive distro, it scales better...
That's basically what we do in the automotive distro, it scales better...
Is it though? Aren't you using the ostree+composefs integration which does this with signature covering the ostree commit, which has the composefs digest? That's all that ostree-prepare-root.service does today...and hence it requires a key.
I think what happened is probably a conceptual overlap between the ostree commit and the composefs. Today composefs is just awkwardly glued onto the side of ostree (not a criticism, doing more starts to get hard, but now we're at that point where doing the hard things is worth it for a cleanup).
But yes we could change ostree-prepare-root.service to look for /usr/lib/ostree/composefs.meta
in the initramfs which would be a pair of:
Hmm maybe yes the conceptual conflict was between ostree commits and the composefs blob, but if we're treating it as canonical then yeah my instinct here is:
/composefs/objects
as a recommended standard thing (or well, I guess it could be /usr/lib/composefs/objects
in the physical root...dunno)link()
the ostree-composefs into that directory based on its fsverity digest/usr/lib/composefs/rootfs.digest
per aboveAnyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g.
/usr/lib/composefs/rootfs.digest
or something that and then we know to look in/objects/<digest>
, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.
Generally this doesn't work with ostree because the UKI is stored in the ostree tree, so it becomes a recursive cycle. We break the cycle by using the one-time key.
It could work in a system where the UKI and the rootfs are completely independent though.
Generally this doesn't work with ostree because the UKI is stored in the ostree tree, so it becomes a recursive cycle
Yeah, I remember this. But we can also break that cycle via just excluding the UKI from the composefs. That seems quite simple to do.
But we can also break that cycle via just excluding the UKI from the composefs. That seems quite simple to do.
That is the conclusion we came to with @travier . It's okay because the uki is signed so not having it covered by fsverity does not matter
It could work in a system where the UKI and the rootfs are completely independent though.
This is the situation I have in my head. I imagine that we have a system image in the form of a container and a UKI somewhere in a "special" path in that image that does not become part of the composefs, but goes directly into the EFI ESP.
Ok, so the plan is something like this:
I think this works, although I would also like to propose this alternative that may work better for the automotive usecase where we want to minimize the initrd:
In fact, I can imagine mount.cfs supporting this mode of "composefs image is in object store" approach natively, so you just say "here is the object store, mount $digest at $path".
This approach does store the metadata for the rootfs twice in the tarball, so we could alternatively regenerate the composefs from the tarball metadata if we just skip the uki.
Ok, so the plan is something like this:
That "composefs in UKI" was indeed something like what was originally proposed here, although I think the "digest in UKI" is not much harder to implement.
I think this works, although I would also like to propose this alternative that may work better for the automotive usecase where we want to minimize the initrd:
Yes, let's call this "digest in UKI"
Convert to OCI tarball
Not sure what you mean here, I don't like the implication it's just one tarball - that's throwing away a huge advantage of OCI. Let me try outlining the steps as I see them:
/usr/lib/composefs/sysroot.digest
to the initramfs. We have a composefs-mount.service
with ConditionPathExists=/usr/lib/composefs/sysroot.digest
that keys off the presence of that and mounts it at /sysroot
in the initramfs.composefs.uki-digest=<digest>
for the rootfs, and must be the final layer in order to be usedOn deploy, extract UKI into ESP partition, and all files except /.boot.uki into the object store (note, this will include the composefs image in the object store).
Right, although I'd clarify that's just because the process will regenerate the composefs, not because it was extracted. For GC purposes, it will be handy to have the digest linked from the manifest.
In fact, I can imagine mount.cfs supporting this mode of "composefs image is in object store" approach natively, so you just say "here is the object store, mount $digest at $path".
Yeah.
This approach does store the metadata for the rootfs twice in the tarball, so we could alternatively regenerate the composefs from the tarball metadata if we just skip the uki.
My proposed variant doesn't, we regenerate the composefs locally. I don't see the need to ship it explicitly in this model (any more than we do with ostree today).
EDIT: I should emphasize none of this whole design requires or is tied to OCI - one could replicate something like this with DDIs or whatever else too. It's just useful to use OCI as a reference design target.
- Stitch the UKI together, make it a new layer in the OCI image that stores it in the standard place. That layer also has an annotation like
composefs.uki-digest=<digest>
for the rootfs, and must be the final layer in order to be used
Isn't a layer like this problematic when combining images, like in a dockerfile. Each bootable image (including a bootc base image) must have a layer like this, and when use use Dockerfiles to create new layers you will end up with multiple layers like this, no?
Otherwise I think that sounds fine.
For this case let's assume that the build system is capable of full control over the final image structure; this means that people building OCI images for this won't look like a plain Dockerfile. The two paths for full control are the FROM oci-archive trick or a process which accepts an existing already built container image as input and rewrites it.
I'm fine with having a digest or the entire file in the UKI. I think I mostly want to kill this signing-key idea and turn it into some kind of a definitive hard link where we don't have to trust that the signing key was only used once.
I think we maybe think too much about what's in the object database and what's not. I'd be happy putting the erofs image itself into the object database, or also the UKI. At the end of these days, these things are just blobs of data which have identifiers and some of those blobs can sometimes refer to other blobs in certain ways. I think the (only( think we need to abandon to eliminate the dependency loop is that idea that the kernel image will appear as part of the root filesystem at runtime.
I think we've decided not to do this.
Could you clarify why do you think we should not do this?
Never mind, I read the backlog of comments that I had missed. Sorry.
https://github.com/containers/composefs/issues/332#issuecomment-2383156923 looks like a good plan. Note that the actual place where the UKI is stored in the image does not really matter as it won't be part of the final rootfs as it won't be in the composefs, but putting it in a standard place is also good. We however will have to make sure that we don't include the content of this special layer when regenerating the composefs blob.
This is a really vague idea that I discussed with @cgwalters and @travier today. They both said it belongs here as an issue. At this point this is little more than a raw braindump. There's a lot to think through and discuss.
The erofs produced by mkcomposefs on a reasonably complete
/usr
is on the order of double digits MB. I've seen ~50MB generally, and it compresses well (down to more like 10MB). The initramfs+kernel on my Silverblue system is low triple digits (~150MB, most of which is the initramfs).It wouldn't be completely unreasonable, then, to have a complete static copy of the composefs "upper layer" erofs image inside of the UKI. This would completely side-step quite a lot of thorny issues around binding the UKI to the correct deployment: all you'd need is the kernel image and the digest store.
How we get a UKI with this erofs inside of it could go two ways:
generate this on the end-user system by (deterministic magic) which lets us get a UKI which is bit-for-bit the same as the one we were expecting it to be. We'd have some out-of-band signature somewhere (in some metadata that doesn't become part of the image) that we could then use for signing this.
push everything to the container image creation: the kernel image would be created as the last step of the image creation process. This would involve running mkcomposefs inside of the container, on the contents of the container itself, and embedding the resulting blob into the UKI, which we'd then write to the container image at a well-known path. Any signing that we might want to do as part of creating the image could happen at this point, inside of the image (or in another build stage and copied back into the final image).
The second approach has an extremely simple deployment strategy: just extract the container .tar directly into a composefs digest store (without creating the erofs). The backing store should now contain all of the files that the erofs referred to. Install the kernel image into the EFI ESP and you're done.
The second way seems wonderfully simple until you realize that there are some very serious drawbacks there:
I think the second approach could be extremely nice for specific deployment scenarios, but it's a very different flavour than what has been promised for the "FROM fedora / ADD / RUN / ..." approach to OS customization.
So that takes us back to a reality where we probably want to support the first scenario of building the composefs and assembling the UKI on the end system. That needs a lot of thinking...
This also intersects with the question about what a signature from an OS vendor on a particular kernel image means. Today it's possible to have a signed kernel boot an unsigned root filesystem. Tomorrow we seem to want to go into a direction where there's additional assurances about the root filesystem contents as well, but if it remains possible to continue booting arbitrary root filesystems with a different version of the same kernel, then this promise is a whole lot less meaningful. In fact, the entire "look how easy it is to customize your system!" bootc ideal sort of depends on being able to modify the root filesystem without needing to resign the kernel... @travier mentioned that we can support both scenarios with kernel variations which produce unique PCR measurements, allowing the data partition to be encrypted by a key that will only be available if we boot a "trusted" rootfs. There are some very deep product-level decisions here...