Add container-images to the compose / treefile

w4tsn commented 3 years ago

I'm currently trying to integrate specific container images into a read-only portion of the ostree, e.g. /usr/containers. Using it in podman with additionalimagestores: [] in storage.conf.

The use case is to:

pin the container image to a ostree commit
make the container image available through the rpm-ostree upgrade, instead of podman pull
enable container deltas through rpm-ostree deltas

Having the images as part of the ostree is especially important in my use-case because I'm operating in a very restricted resource area where updates should be deltas and most importantly image download should not occur through podman itself and I'm serving applications where RPMs are either hard to build or only available as containers.

Alternatives I have looked at, but not tried yet, are:

build RPMs containing the container image binary - rather strange
re-design the applications as flatpaks - how to incorporate flatpaks? Even possible / desired?

With rpm-ostree and the treefile I've already tried to use podman in the post process script, which doesn't work since it's a restricted, unrecommended and most importantly network-free environment.

Next thing I'll try is to do it manually using the rpm-ostree compose and commit tools.

Eventually this got me thinking, why can't I define a set of registry:auth:image entries in the treefile, e.g. in container-images that are integrated into the ostree under e.g. /usr/containers so the images are pulled as part of the compose and eventually accessible read-only. This would give them the following properties, if I'm correct:

reproducability, since my containers don't operate / update independently
delta updates
avoid storage corruption due to read-only permissions
always have one version ready to start containers from

What do you guys think of this use-case? Is it way too niche? Also this might be way off-topic since this is RPM-OSTree not Container-OSTree...

cgwalters commented 3 years ago

Copy-pasting some of my reply from IRC (edited for format):

The container lifecycle thing is a super interesting topic; in OpenShift 4 we (someone else, not me) invented this concept of a "release image" which is like a super-container that exactly pins the sha256 of a bunch of other containers we tested together as a single unit; we don't want each little bit of the platform (which is a lot of containers) updating independently.

e.g. https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.8.0-0.nightly/release/4.8.0-0.nightly-2021-03-18-075013 is a recent one

So lifecycle-binding containers with the host gives you that sanity of knowing "we tested version X" and you get exactly X on each device. This topic also came up in https://github.com/openshift/enhancements/pull/560 and I could imagine at some point we want to do this for OpenShift 4 too for that single node case; but doing so gets messy, would need to teach some of the container stack about these read-only images etc.

As rpm-ostree upstream I will say for sure if you hit any bugs in dropping container images into /usr/share/containers at build time that'd absolutely be treated as an important bug; at some point I am hopeful when we dedup the osbuild/cosa/etc stuff we have a high level opinionated tool that does this declaratively, including things like managing the podman systemd units.

ostree is explicitly designed to be not opinionated about what you put in it; it's not a build system, there's no required equivalent of the dpkg/rpm databases. And rpm-ostree ultimately generates ostree commits using RPMs as input, but we do ship a bit of not-RPM content today in FCOS. So this "container+OS binding" isn't a use case of CoreOS or OpenShift today but may be in the future, and I definitely want to support it.

(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).

cgwalters commented 3 years ago

Eventually this got me thinking, why can't I define a set of registry:auth:image entries in the treefile, e.g. in container-images that

The above hopefully answers this - we may eventually do container native stuff in this repository but it would greatly increase the scope of the project.

For now, the most "native" support for non-RPM content is using the ostree-layers functionality (doc entry). But obviously that's pretty "raw" - it's up to you to extract container images and commit them. But on the plus side, you can put anything in an ostree layer; this functionality is specifically used by CoreOS today with coreos-assembler to directly commit content from the config git repository.

Specifically, coreos-assembler auto-generates ostree commits from https://github.com/coreos/fedora-coreos-config/tree/testing-devel/overlay.d

cgwalters commented 3 years ago

Here's an old discussion on exactly this topic too: https://mail.gnome.org/archives/ostree-list/2017-October/msg00009.html

cgwalters commented 3 years ago

(Deleted some comments filed on wrong issue)

cgwalters commented 3 years ago

(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).

To give just one example of this, rpm-ostree has extensive "diffing" logic around things like printing the rpm-level diff between two ostree commits, but we'd have to invent something custom to do that for containers. Of course, the diffing logic isn't required; rpm-ostree will happily upgrade things that come from not-RPM.

w4tsn commented 3 years ago

The container lifecycle thing is a super interesting topic; in OpenShift 4 we (someone else, not me) invented this concept of a "release image" which is like a super-container that exactly pins the sha256 of a bunch of other containers we tested together as a single unit; we don't want each little bit of the platform (which is a lot of containers) updating independently.

Essentially what you say in that whole comment, I'm currently using Fedora IoT and I'm building my own Remix of it using a very simple cosa inspired tool (might switch over to cosa at some point) to get a relatively fixed, reproducable system for edge devices in environments with paid-per-megabyte contracts or very slow connections. Also the whole system as you say is developed and tested as a whole at some point and apart from the development of single applications and containers the whole package is what is interesting for the prod deployment. So yeah, it would be very neat to also have the ability to pin containers in such a situation.

To give just one example of this, rpm-ostree has extensive "diffing" logic around things like printing the rpm-level diff between two ostree commits, but we'd have to invent something custom to do that for containers. Of course, the diffing logic isn't required; rpm-ostree will happily upgrade things that come from not-RPM.

I think showing some simple to retrieve information like create/update dates, image hashes, layer numbers or size would be a good first diff between two commits.

So for now I'll have a look at cosa, overlay.d and ostree-layers to see how I can incorporate images in /usr/share/containers in my build tool / pipeline and test this.

(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).

Actually I'm not quite sure which project would be better fitted for this. I suppose it would be rpm-ostree, since the specifics of containers pull, placement, diffs etc. in the OSTree is quite raw and low-level like managing RPMs is and cosa / osbuild etc. make use of this low-level stuff and manage many more things around that. So yeah, I suppose building it into my build tool or eventually cosa would be more of a work-around and rpm-ostree seems to me the better place for this

w4tsn commented 3 years ago

Documenting my current approach / findings:

I'm currently working on creating a second commit with various container images under /usr/containers like this:

# first create a regular rpm-ostree tree commit
rpm-ostree compose tree --unified-core --cachedir="$CACHE_DIR" --repo="$BUILD_REPO" --write-commitid-to="$COMMIT_FILE" "$WK_DIR/$OSTREE_FILE"

# just checkout /usr to a temp, empty sysroot
ostree --repo="$BUILD_REPO" checkout --verbose --subpath=/usr "$(cat "$COMMIT_FILE")" "$WK_DIR"/sysroot/usr

mkdir "$WK_DIR"/sysroot/usr/containers
podman --root "$WK_DIR"/sysroot/usr/containers pull docker.io/library/alpine
podman --root "$WK_DIR"/sysroot/usr/containers pull docker.io/nodered/node-red:1.2.9

# create an orphan commit based on the rpm-ostree compose using the sysroot dir only containing /usr including /usr/containers created previously
# specify the selinux policy necessary so follow up commits won't complain about missing selinux policies
new_commit=$(ostree --repo="$BUILD_REPO" --parent="$(cat "$COMMIT_FILE")" commit --tree=dir="$WK_DIR"/sysroot -s "$COMMIT_SUBJECT" --orphan --selinux-policy=/usr "$WK_DIR"/sysroot/usr)

# Create the commit in the actual branch using both of the previous commits layered over each other
ostree --repo="$BUILD_REPO" commit -b "$OSTREE_REF" -s "$COMMIT_SUBJECT" --tree=ref="$(cat "$COMMIT_FILE")" --tree=ref="$new_commit"

What I receive is the desired ostree, with selinux labels and containers etc.

Currently the next problem is, that some container images work while others more complex container images don't. E.g. alpine is not a problem to include. node-red however causes ostree to complain: "error: Not a regular file or symlink: node". I have no idea how to resolve this yet, other than choosing a slightly different approach: pulling in compressed images, importing them with a service from the read-only part to e.g. memory in /run/containers

jlebon commented 3 years ago

I'm currently working on creating a second commit with various container images under /usr/containers like this:

I would suggest using ostree-layers instead as Colin mentioned higher up in https://github.com/coreos/rpm-ostree/issues/2675#issuecomment-802139453. That way you avoid having to do a secondary ostree commit --parent step, which also loses a bunch of metadata that doesn't get carried over.

Another approach is to use rpm-ostree compose install which creates the rootfs, then podman to pull down the containers into it, and then rpm-ostree compose postprocess and rpm-ostree compose commit. That approach takes farther from the CoreOS model though, so that might make it harder for you to leverage cosa if you want to do that down the road.

Currently the next problem is, that some container images work while others more complex container images don't.

I'm not a container runtime SME, but you can probably just nuke any non-regfile. Container runtimes should populate /dev with the bare necessities for the container to function properly.

w4tsn commented 3 years ago

I would suggest using ostree-layers instead as Colin mentioned higher up in #2675 (comment). That way you avoid having to do a secondary ostree commit --parent step, which also loses a bunch of metadata that doesn't get carried over.

tbh. I don't quite understand how that is supposed to work yet. This config option takes a number of string refs to already existing commits to add them as layers in the compose tree step, right? This means I'd do the same commit command as above, just before running compose tree and without the --parent option - saving metadata in the process?

EDIT

I'm building this in CI so in order for this to work I'd have to automatically update the treefile after the container image commit is created. Little inconvenience.

And a key-learning from this comment I did not get from the docs is that compose install does not only install single RPMs but is the start / setup for manual alteration of the rootfs.

ENDEDIT

Another approach is to use rpm-ostree compose install which creates the rootfs, then podman to pull down the containers into it, and then rpm-ostree compose postprocess and rpm-ostree compose commit. That approach takes farther from the CoreOS model though, so that might make it harder for you to leverage cosa if you want to do that down the road.

I already had a hard time understanding those commands from the man page and docs. Is this meant to be used in addition to the compose tree I already do or does it replace this with more fine-grained control of the process but I'll have to rework the single compose tree step and additionally do the podman / images steps? And if the former do I use this before or after the compose tree step? I'm struggling a bit since I understand in ostree that --parent, --tree, --orphan etc. gives me control on the relationship of subsequent commits. Is this implicitly handled by rpm-ostree? So if I do compose tree and then install, postprocess, commit afterwards will it know to build on the previous one?

I'm not a container runtime SME, but you can probably just nuke any non-regfile. Container runtimes should populate /dev with the bare necessities for the container to function properly.

The problem is more that ostree seems to have a problem with the node executable placed in the container image storage overlay. I'm not sure if that's easy to fix. Do you think the alternative approaches you mentioned will mitigate / solve this?

cgwalters commented 3 years ago

Yeah, we're missing docs around injecting non-rpm content into rpm-ostree. Will look at this somewhat soon.

w4tsn commented 3 years ago

I've now replaced the approach from above completely by starting with compose install then pulling the images into the rootfs and finishing this up with a compose commit as suggested. While this indeed feels like a more clean and slick approach, it still throws this error when trying to embed the nodejs based container image for node-red: error: While writing rootfs to mtree: Not a regular file or symlink: node.

@cgwalters I think you mentioned that I should be able to commit such things, so I suspect this to be some sort of unexpected behavior or even a bug?

cgwalters commented 3 years ago

Tangentially related to this issue - one might wonder why it doesn't work to just ship /var/lib/containers or ~/.local/share/containers in ostree today. First, note that ~/.local is really /var/home/$username/.local - they're both under /var. And ostree explicitly does not "own" or modify data in /var.

Conceptually a file should have a single "owner" (or multiple but with locking). In this particular case, either podman (containers/storage) should own it, or ostree should own it. If on an upgrade e.g. ostree started deleting/changing files underneath podman while containers were running, that would lead to chaos and a mess of ill-defined behavior.

This is why the right answer to having containers "lifecycle bound" with ostree is to ship them in /usr/share - that way they are clearly owned by ostree, and also the read-only bind mount will ensure that tools like podman/systemd-nspawn/bwrap/etc know they don't own the data and should just be reading it.

Now, one middle ground model is to ship the data in /usr/share but have a systemd unit that copies them into containers/storage location like /var/lib/containers. But IMO this mostly mixes the disadvantages of the two more than the advantages.

cgwalters commented 3 years ago

To further elaborate on this, another hybrid model that I'd like to enable with e.g. Fedora CoreOS is where we provide a tool to "preload" containers in the live ISO - here the ISO would install a combination of the OS + your containers as a versioned snapshot, but in-place updates for the OS and containers would be separate.

w4tsn commented 3 years ago

To further elaborate on this, another hybrid model that I'd like to enable with e.g. Fedora CoreOS is where we provide a tool to "preload" containers in the live ISO - here the ISO would install a combination of the OS + your containers as a versioned snapshot, but in-place updates for the OS and containers would be separate.

Actually this is a neat idea. We are building our own images and one of the irritating parts is if a system starts for the first time and behaves slightly differently / false because no container is actually present to run and there always has to be a download first which does not always work due to network restrictions. So we could potentially start by pulling in container images into the raw image. An iso installer would then be even more systematic.

Now, one middle ground model is to ship the data in /usr/share but have a systemd unit that copies them into containers/storage location like /var/lib/containers. But IMO this mostly mixes the disadvantages of the two more than the advantages.

I'd also like to avoid this. As I've checked libostree does not yet seem to be able to handle the images overlayfs correctly or at least some implementation is missing which prevents committing /usr/share/containers/storage. There are apparently work-arounds by removing certain file attributes / bits, re-applying them after a checkout / commit apply - but well, that's also not what I want. I have to take a look if I'm able to implement this in libostree.

w4tsn commented 3 years ago

Remotely related to this issue - is it possible to package a container image within an RPM? Just thought about layering an application with podman-systemd file but including the container image overlay such that I can rpm-ostree install it without the container being downloaded separately into read-write storage. Or to put it differently, I'd like the container image to be added to the system into the read-only portion under /usr/share as part of the installation process.

I could use this on the fly in a live system or in the tree-compose when it installs the specified RPM packages. Either with the image as part of the RPM or the image download into /usr/share as part of the installation process of the RPM.

jlebon commented 11 months ago

xref: https://github.com/containers/bootc/issues/128

mtalexan commented 1 week ago

I know this is a very old issue, but we've been trying to solve this problem for a few years already for a similar use case so I thought I'd chime in for future readers.

A few more recent things worth mentioning on this topic that have happened since the original discussion:

Cosa is already planning to be deprecated and replaced with osbuild.
- They aren't accepting or planning any new significant features or work on it as a result.
- Osbuild doesn't have you provide a treefile, you have to define the equivalent info in an osbuild-specific syntax ("blueprint"?) instead.
- Currently osbuild is "released" but is still missing a lot of the important bits of treefile syntax
Bootc is an alternative for building ostrees instead of rpm-ostree compose, and is effectvely just a container image specified via Containerfile starting from a bootc base image that has some extra tools and bits.
- The ostree layers are generated based on a specific command used in the Containerfile and provided by the base bootc image.

As a point of reference, osbuild has an Automotive SIG set of repos hosted on GitLab instead of GitHub, which includes instructions for how to embed container images in the output. Unfortunately this effectively just breaks down to what was being asked for here, and runs skopeo copy on a list of images specified in the osbuild-equivalent of treefile syntax. The actual images being copied into the /usr/share/containers/storage have to be available from a registry or in the rootfs somewhere already, and don't have a clear way to be coupled with an RPM holdling associated non-container files. That step also happens as part of the overall ostree build, so it needs to occur during ostree composing, looks like it may create a new ostree commit for each image added (unclear, I can't find any of the server-side code for ostree composition in the osbuild organization's repos), and means you can't do something like copying an RPM via USB stick to a development system and installing an updated/additional container image (because manual post-RPM-install steps on an unlocked rootfs need to be performed as part of an ostree compose).

Remotely related to this issue - is it possible to package a container image within an RPM? Just thought about layering an application with podman-systemd file but including the container image overlay such that I can rpm-ostree install it without the container being downloaded separately into read-write storage. Or to put it differently, I'd like the container image to be added to the system into the read-only portion under /usr/share as part of the installation process.

I could use this on the fly in a live system or in the tree-compose when it installs the specified RPM packages. Either with the image as part of the RPM or the image download into /usr/share as part of the installation process of the RPM.

We originally planned to do this by stuffing the images into the RPMs since our images have to run in an air-gapped local-only k3s cluster as well, and have some other configuration files that need to go along with the images.

A few major factors that affect this are:

What format are the image files stored in the RPM?
Where do you store the image files?
What do you do with the image files during install (e.g. %post scriptlet)?
How do you make use of the images at run-time?

Pretty much any solution involving RPMs requires some pre-SRPM step to get the images into a format that makes them part of a Source#: you can work with in the RPM. Your RPM specfile has to then put the content wherever it needs to go and in whatever format it needs to be as part of one of the build-time scriptlets, and you may need to do some post-install fixups/translations in a %post or %posttrans scriptlet.

The initial solution we had was to spin up a little localhost registry that has its backing storage bind-mounted to a local folder. We push our images into it, shut down the registry, and save the registry storage folder as part of the Source#: used by our RPM. Within the RPM specfile, we extract and copy these files to some pre-defined system location shared by all the RPMs providing images. At run-time our system spins up a localhost registry container and mounts that system folder as its backing store. All our images now have to reference the localhost registry's domain name to pull the images into the image cache (or mirrors for all possible domain names defined to point to that regsitry).
Pros: No network access required. All images can be in the ostree if that's the "system location" you specify. Possible to setup multiple local mirror registries so some are readable and some writeable. Shared layers between images are de-duped. Image cache doesn't need to be a containers-storage backend (i.e. containerd or docker engines can be used).
Cons: Have 2x the images on your system, one in the registry and another in the image cache. Your image cache also has to be run-time writeable. Maintenance of the run-time-writeable cache has to be linked to the ostree deployment (simple solution is to blow it away and re-pull for each reboot, but that causes lots of disk churn). "Pulling" of images has to be allowed, even though it should only be going to your localhost, but if you make a mistake it could end up trying to go external for missing images.

Our next better solution focused in on the containers-storage specific tools, and attempted to reduce the number of copies of images on the system. To do that we had to find some way to populate a containers-storage location at RPM-install time.
Pros: Single copy of images. Images tracked with ostree deployment. Layer file contents implicitly de-dupped by ostree format of rootfs. No network access required. No "pull" permissions required. Can be split read-only and read-write with storage.conf additioanlImageStore setting.
Cons: Limited to containers-storage backend. Unclear how to populate it.

First we tried creating a new podman --root=... location as part of the pre-RPM formatting, and just packing that into the RPM. Pretty quickly you find out though that the underlying format has all files from all layers uncompressed in place for each of the layers. That gives all kinds of conflicts with RPM builds, which try to parse everything being installed and sanitize/fix-up content. We disabled basically everything RPM builds automatically try to do and it still wouldn't let the raw files be included in the RPM without corrupting them or erroring out during the build.

Next we tried using the oci-dir (oci:) format, which is a directory with each layer as a gzip compressed blob instead of a directory full of the raw contents, and supports having multiple tagged images in the same oci-dir folder. We could easily get this included in the RPM and installed somewhere, but it's not in a format that containers-storage: can directly make use of. So we tried using a %post or %posttrans scriptlet to run skopeo copy oci:... containers-storage:... or podman pull oci:... to get it into the container cache.
In both these cases we were redirecting the storage root to /usr/share/containers/storage to keep it within the ostree (rather than the default /var/lib/containers), however it ran afoul of how rpm-ostree handles RPM installs.
We learned that rpm-ostree runs the scriptlets in a bwrap/bubblewrap instance. This makes most of the system inaccessible and many of the defaults from the /usr/share/containers/storage.conf don't work. You have to redirect the runroot setting to point to a writeable ephemeral location in the bwrap, and you have to manually (re)populate the /etc/sub{u,g}id because it's mounted as a blank ephemeral folder by bwrap and both podman and skopeo use user namespaces for populating the containers-storage. Furthermore, podman is simply unrunnable in the bwrap environment, something in its internal pre-run check always errors out saying "file not found" with no details whatsoever (even with --log-level=trace), and occurs even if just running podman info. Skopeo is able to be run, but ends up running into other problems depending on which storage driver you're using.
When using the vfs storage driver, skopeo errors out during the extraction of the gzip layers into the image cache location. It seems to succeed on some layers, but inevitably will hit something where the gzip extraction of the oci-dir blob suddenly errors out saying it can't get lgetxattrs for a file in one of the layers. We weren't able to get any further info to figure out what was going on with it.
When using the overlay storage driver, which is probably what you'd want anyway, you run into the fact that bwrap is apparently using an overlayfs for the file system isolation and mounting. That's a known problem when trying to create or copy images with the overlay storage driver, but the podman-in-a-container seems to have a solution for it: use fuse-overlayfs as the mount_program. Even still however, something in the rpm-ostree bwrap configuration is preventing it from working (maybe a CAP_? is missing?) and it just keeps reporting filesystem type 0x65735546 reported for /usr/share/containers/storage is not supported with 'overlay': backing file system is unsupported for this graph driver.

There isn't any other daemonless way to populate the containers-storage: that we can find than to call podman or skopeo with the graphroot redirected, and the location has to be part of the ostree that's not writeable after RPM install is complete or it fails to satisfy the intended purpose. So as far as we can tell, rpm-ostree's bwrap configuration directly blocks any interaction with containers-storage:, thereby inhibiting any RPM-based deployment of images into a read-only cache.

cgwalters commented 1 week ago

Hi, as of recently the focus of the maintainers of this project is now (EDIT: I meant now not not) on bootc, and on that topic one thing you may have not seen that's :new: is https://containers.github.io/bootc/logically-bound-images.html which I think is often what is desired here.

mtalexan commented 1 week ago

Hi, as of recently the focus of the maintainers of this project is now (EDIT: I meant now not not) on bootc, and

Good to know. For someone on the outside trying to figure out what's going to work tomorrow, it's very good to know which way things are leaning.

on that topic one thing you may have not seen that's 🆕 is https://containers.github.io/bootc/logically-bound-images.html which I think is often what is desired here.

I forgot I'd seen that and should have mentioned it in the new info.
However this original ticket was about physically bound images and had a use case where logically bound images explicitly won't work. I happen to be in the same situation myself and was looking for work arounds. I've seen a lot of people trying to figure out a similar situation (it seems the primary uptake for Atomic in industry has been regulated and IoT where physically bound images are the only option?).

coreos / rpm-ostree

Add container-images to the compose / treefile #2675