coreos / rpm-ostree

⚛📦 Hybrid image/package system with atomic upgrades and package layering
https://coreos.github.io/rpm-ostree
Other
858 stars 194 forks source link

Add container-images to the compose / treefile #2675

Open w4tsn opened 3 years ago

w4tsn commented 3 years ago

I'm currently trying to integrate specific container images into a read-only portion of the ostree, e.g. /usr/containers. Using it in podman with additionalimagestores: [] in storage.conf.

The use case is to:

Having the images as part of the ostree is especially important in my use-case because I'm operating in a very restricted resource area where updates should be deltas and most importantly image download should not occur through podman itself and I'm serving applications where RPMs are either hard to build or only available as containers.

Alternatives I have looked at, but not tried yet, are:

With rpm-ostree and the treefile I've already tried to use podman in the post process script, which doesn't work since it's a restricted, unrecommended and most importantly network-free environment.

Next thing I'll try is to do it manually using the rpm-ostree compose and commit tools.

Eventually this got me thinking, why can't I define a set of registry:auth:image entries in the treefile, e.g. in container-images that are integrated into the ostree under e.g. /usr/containers so the images are pulled as part of the compose and eventually accessible read-only. This would give them the following properties, if I'm correct:

What do you guys think of this use-case? Is it way too niche? Also this might be way off-topic since this is RPM-OSTree not Container-OSTree...

cgwalters commented 3 years ago

Copy-pasting some of my reply from IRC (edited for format):

The container lifecycle thing is a super interesting topic; in OpenShift 4 we (someone else, not me) invented this concept of a "release image" which is like a super-container that exactly pins the sha256 of a bunch of other containers we tested together as a single unit; we don't want each little bit of the platform (which is a lot of containers) updating independently.

e.g. https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.8.0-0.nightly/release/4.8.0-0.nightly-2021-03-18-075013 is a recent one

So lifecycle-binding containers with the host gives you that sanity of knowing "we tested version X" and you get exactly X on each device. This topic also came up in https://github.com/openshift/enhancements/pull/560 and I could imagine at some point we want to do this for OpenShift 4 too for that single node case; but doing so gets messy, would need to teach some of the container stack about these read-only images etc.

As rpm-ostree upstream I will say for sure if you hit any bugs in dropping container images into /usr/share/containers at build time that'd absolutely be treated as an important bug; at some point I am hopeful when we dedup the osbuild/cosa/etc stuff we have a high level opinionated tool that does this declaratively, including things like managing the podman systemd units.

ostree is explicitly designed to be not opinionated about what you put in it; it's not a build system, there's no required equivalent of the dpkg/rpm databases. And rpm-ostree ultimately generates ostree commits using RPMs as input, but we do ship a bit of not-RPM content today in FCOS. So this "container+OS binding" isn't a use case of CoreOS or OpenShift today but may be in the future, and I definitely want to support it.

(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).

cgwalters commented 3 years ago

Eventually this got me thinking, why can't I define a set of registry:auth:image entries in the treefile, e.g. in container-images that

The above hopefully answers this - we may eventually do container native stuff in this repository but it would greatly increase the scope of the project.

For now, the most "native" support for non-RPM content is using the ostree-layers functionality (doc entry). But obviously that's pretty "raw" - it's up to you to extract container images and commit them. But on the plus side, you can put anything in an ostree layer; this functionality is specifically used by CoreOS today with coreos-assembler to directly commit content from the config git repository.

Specifically, coreos-assembler auto-generates ostree commits from https://github.com/coreos/fedora-coreos-config/tree/testing-devel/overlay.d

cgwalters commented 3 years ago

Here's an old discussion on exactly this topic too: https://mail.gnome.org/archives/ostree-list/2017-October/msg00009.html

cgwalters commented 3 years ago

(Deleted some comments filed on wrong issue)

cgwalters commented 3 years ago

(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).

To give just one example of this, rpm-ostree has extensive "diffing" logic around things like printing the rpm-level diff between two ostree commits, but we'd have to invent something custom to do that for containers. Of course, the diffing logic isn't required; rpm-ostree will happily upgrade things that come from not-RPM.

w4tsn commented 3 years ago

The container lifecycle thing is a super interesting topic; in OpenShift 4 we (someone else, not me) invented this concept of a "release image" which is like a super-container that exactly pins the sha256 of a bunch of other containers we tested together as a single unit; we don't want each little bit of the platform (which is a lot of containers) updating independently.

Essentially what you say in that whole comment, I'm currently using Fedora IoT and I'm building my own Remix of it using a very simple cosa inspired tool (might switch over to cosa at some point) to get a relatively fixed, reproducable system for edge devices in environments with paid-per-megabyte contracts or very slow connections. Also the whole system as you say is developed and tested as a whole at some point and apart from the development of single applications and containers the whole package is what is interesting for the prod deployment. So yeah, it would be very neat to also have the ability to pin containers in such a situation.

To give just one example of this, rpm-ostree has extensive "diffing" logic around things like printing the rpm-level diff between two ostree commits, but we'd have to invent something custom to do that for containers. Of course, the diffing logic isn't required; rpm-ostree will happily upgrade things that come from not-RPM.

I think showing some simple to retrieve information like create/update dates, image hashes, layer numbers or size would be a good first diff between two commits.

So for now I'll have a look at cosa, overlay.d and ostree-layers to see how I can incorporate images in /usr/share/containers in my build tool / pipeline and test this.

(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).

Actually I'm not quite sure which project would be better fitted for this. I suppose it would be rpm-ostree, since the specifics of containers pull, placement, diffs etc. in the OSTree is quite raw and low-level like managing RPMs is and cosa / osbuild etc. make use of this low-level stuff and manage many more things around that. So yeah, I suppose building it into my build tool or eventually cosa would be more of a work-around and rpm-ostree seems to me the better place for this

w4tsn commented 3 years ago

Documenting my current approach / findings:

I'm currently working on creating a second commit with various container images under /usr/containers like this:

# first create a regular rpm-ostree tree commit
rpm-ostree compose tree --unified-core --cachedir="$CACHE_DIR" --repo="$BUILD_REPO" --write-commitid-to="$COMMIT_FILE" "$WK_DIR/$OSTREE_FILE"

# just checkout /usr to a temp, empty sysroot
ostree --repo="$BUILD_REPO" checkout --verbose --subpath=/usr "$(cat "$COMMIT_FILE")" "$WK_DIR"/sysroot/usr

mkdir "$WK_DIR"/sysroot/usr/containers
podman --root "$WK_DIR"/sysroot/usr/containers pull docker.io/library/alpine
podman --root "$WK_DIR"/sysroot/usr/containers pull docker.io/nodered/node-red:1.2.9

# create an orphan commit based on the rpm-ostree compose using the sysroot dir only containing /usr including /usr/containers created previously
# specify the selinux policy necessary so follow up commits won't complain about missing selinux policies
new_commit=$(ostree --repo="$BUILD_REPO" --parent="$(cat "$COMMIT_FILE")" commit --tree=dir="$WK_DIR"/sysroot -s "$COMMIT_SUBJECT" --orphan --selinux-policy=/usr "$WK_DIR"/sysroot/usr)

# Create the commit in the actual branch using both of the previous commits layered over each other
ostree --repo="$BUILD_REPO" commit -b "$OSTREE_REF" -s "$COMMIT_SUBJECT" --tree=ref="$(cat "$COMMIT_FILE")" --tree=ref="$new_commit"

What I receive is the desired ostree, with selinux labels and containers etc.

Currently the next problem is, that some container images work while others more complex container images don't. E.g. alpine is not a problem to include. node-red however causes ostree to complain: "error: Not a regular file or symlink: node". I have no idea how to resolve this yet, other than choosing a slightly different approach: pulling in compressed images, importing them with a service from the read-only part to e.g. memory in /run/containers

jlebon commented 3 years ago

I'm currently working on creating a second commit with various container images under /usr/containers like this:

I would suggest using ostree-layers instead as Colin mentioned higher up in https://github.com/coreos/rpm-ostree/issues/2675#issuecomment-802139453. That way you avoid having to do a secondary ostree commit --parent step, which also loses a bunch of metadata that doesn't get carried over.

Another approach is to use rpm-ostree compose install which creates the rootfs, then podman to pull down the containers into it, and then rpm-ostree compose postprocess and rpm-ostree compose commit. That approach takes farther from the CoreOS model though, so that might make it harder for you to leverage cosa if you want to do that down the road.

Currently the next problem is, that some container images work while others more complex container images don't.

I'm not a container runtime SME, but you can probably just nuke any non-regfile. Container runtimes should populate /dev with the bare necessities for the container to function properly.

w4tsn commented 3 years ago

I would suggest using ostree-layers instead as Colin mentioned higher up in #2675 (comment). That way you avoid having to do a secondary ostree commit --parent step, which also loses a bunch of metadata that doesn't get carried over.

tbh. I don't quite understand how that is supposed to work yet. This config option takes a number of string refs to already existing commits to add them as layers in the compose tree step, right? This means I'd do the same commit command as above, just before running compose tree and without the --parent option - saving metadata in the process?

EDIT

I'm building this in CI so in order for this to work I'd have to automatically update the treefile after the container image commit is created. Little inconvenience.

And a key-learning from this comment I did not get from the docs is that compose install does not only install single RPMs but is the start / setup for manual alteration of the rootfs.

ENDEDIT

Another approach is to use rpm-ostree compose install which creates the rootfs, then podman to pull down the containers into it, and then rpm-ostree compose postprocess and rpm-ostree compose commit. That approach takes farther from the CoreOS model though, so that might make it harder for you to leverage cosa if you want to do that down the road.

I already had a hard time understanding those commands from the man page and docs. Is this meant to be used in addition to the compose tree I already do or does it replace this with more fine-grained control of the process but I'll have to rework the single compose tree step and additionally do the podman / images steps? And if the former do I use this before or after the compose tree step? I'm struggling a bit since I understand in ostree that --parent, --tree, --orphan etc. gives me control on the relationship of subsequent commits. Is this implicitly handled by rpm-ostree? So if I do compose tree and then install, postprocess, commit afterwards will it know to build on the previous one?

I'm not a container runtime SME, but you can probably just nuke any non-regfile. Container runtimes should populate /dev with the bare necessities for the container to function properly.

The problem is more that ostree seems to have a problem with the node executable placed in the container image storage overlay. I'm not sure if that's easy to fix. Do you think the alternative approaches you mentioned will mitigate / solve this?

cgwalters commented 3 years ago

Yeah, we're missing docs around injecting non-rpm content into rpm-ostree. Will look at this somewhat soon.

w4tsn commented 3 years ago

I've now replaced the approach from above completely by starting with compose install then pulling the images into the rootfs and finishing this up with a compose commit as suggested. While this indeed feels like a more clean and slick approach, it still throws this error when trying to embed the nodejs based container image for node-red: error: While writing rootfs to mtree: Not a regular file or symlink: node.

@cgwalters I think you mentioned that I should be able to commit such things, so I suspect this to be some sort of unexpected behavior or even a bug?

cgwalters commented 3 years ago

Tangentially related to this issue - one might wonder why it doesn't work to just ship /var/lib/containers or ~/.local/share/containers in ostree today. First, note that ~/.local is really /var/home/$username/.local - they're both under /var. And ostree explicitly does not "own" or modify data in /var.

Conceptually a file should have a single "owner" (or multiple but with locking). In this particular case, either podman (containers/storage) should own it, or ostree should own it. If on an upgrade e.g. ostree started deleting/changing files underneath podman while containers were running, that would lead to chaos and a mess of ill-defined behavior.

This is why the right answer to having containers "lifecycle bound" with ostree is to ship them in /usr/share - that way they are clearly owned by ostree, and also the read-only bind mount will ensure that tools like podman/systemd-nspawn/bwrap/etc know they don't own the data and should just be reading it.

Now, one middle ground model is to ship the data in /usr/share but have a systemd unit that copies them into containers/storage location like /var/lib/containers. But IMO this mostly mixes the disadvantages of the two more than the advantages.

cgwalters commented 3 years ago

To further elaborate on this, another hybrid model that I'd like to enable with e.g. Fedora CoreOS is where we provide a tool to "preload" containers in the live ISO - here the ISO would install a combination of the OS + your containers as a versioned snapshot, but in-place updates for the OS and containers would be separate.

w4tsn commented 3 years ago

To further elaborate on this, another hybrid model that I'd like to enable with e.g. Fedora CoreOS is where we provide a tool to "preload" containers in the live ISO - here the ISO would install a combination of the OS + your containers as a versioned snapshot, but in-place updates for the OS and containers would be separate.

Actually this is a neat idea. We are building our own images and one of the irritating parts is if a system starts for the first time and behaves slightly differently / false because no container is actually present to run and there always has to be a download first which does not always work due to network restrictions. So we could potentially start by pulling in container images into the raw image. An iso installer would then be even more systematic.

Now, one middle ground model is to ship the data in /usr/share but have a systemd unit that copies them into containers/storage location like /var/lib/containers. But IMO this mostly mixes the disadvantages of the two more than the advantages.

I'd also like to avoid this. As I've checked libostree does not yet seem to be able to handle the images overlayfs correctly or at least some implementation is missing which prevents committing /usr/share/containers/storage. There are apparently work-arounds by removing certain file attributes / bits, re-applying them after a checkout / commit apply - but well, that's also not what I want. I have to take a look if I'm able to implement this in libostree.

w4tsn commented 3 years ago

Remotely related to this issue - is it possible to package a container image within an RPM? Just thought about layering an application with podman-systemd file but including the container image overlay such that I can rpm-ostree install it without the container being downloaded separately into read-write storage. Or to put it differently, I'd like the container image to be added to the system into the read-only portion under /usr/share as part of the installation process.

I could use this on the fly in a live system or in the tree-compose when it installs the specified RPM packages. Either with the image as part of the RPM or the image download into /usr/share as part of the installation process of the RPM.

jlebon commented 9 months ago

xref: https://github.com/containers/bootc/issues/128