Open w4tsn opened 3 years ago
Copy-pasting some of my reply from IRC (edited for format):
The container lifecycle thing is a super interesting topic; in OpenShift 4 we (someone else, not me) invented this concept of a "release image" which is like a super-container that exactly pins the sha256 of a bunch of other containers we tested together as a single unit; we don't want each little bit of the platform (which is a lot of containers) updating independently.
e.g. https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.8.0-0.nightly/release/4.8.0-0.nightly-2021-03-18-075013 is a recent one
So lifecycle-binding containers with the host gives you that sanity of knowing "we tested version X" and you get exactly X on each device. This topic also came up in https://github.com/openshift/enhancements/pull/560 and I could imagine at some point we want to do this for OpenShift 4 too for that single node case; but doing so gets messy, would need to teach some of the container stack about these read-only images etc.
As rpm-ostree upstream I will say for sure if you hit any bugs in dropping container images into /usr/share/containers
at build time that'd absolutely be treated as an important bug; at some point I am hopeful when we dedup the osbuild/cosa/etc stuff we have a high level opinionated tool that does this declaratively, including things like managing the podman systemd units.
ostree is explicitly designed to be not opinionated about what you put in it; it's not a build system, there's no required equivalent of the dpkg/rpm databases. And rpm-ostree ultimately generates ostree commits using RPMs as input, but we do ship a bit of not-RPM content today in FCOS. So this "container+OS binding" isn't a use case of CoreOS or OpenShift today but may be in the future, and I definitely want to support it.
(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).
Eventually this got me thinking, why can't I define a set of registry:auth:image entries in the treefile, e.g. in container-images that
The above hopefully answers this - we may eventually do container native stuff in this repository but it would greatly increase the scope of the project.
For now, the most "native" support for non-RPM content is using the ostree-layers
functionality (doc entry). But obviously that's pretty "raw" - it's up to you to extract container images and commit them. But on the plus side, you can put anything in an ostree layer; this functionality is specifically used by CoreOS today with coreos-assembler to directly commit content from the config git repository.
Specifically, coreos-assembler auto-generates ostree commits from https://github.com/coreos/fedora-coreos-config/tree/testing-devel/overlay.d
Here's an old discussion on exactly this topic too: https://mail.gnome.org/archives/ostree-list/2017-October/msg00009.html
(Deleted some comments filed on wrong issue)
(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).
To give just one example of this, rpm-ostree has extensive "diffing" logic around things like printing the rpm-level diff between two ostree commits, but we'd have to invent something custom to do that for containers. Of course, the diffing logic isn't required; rpm-ostree will happily upgrade things that come from not-RPM.
The container lifecycle thing is a super interesting topic; in OpenShift 4 we (someone else, not me) invented this concept of a "release image" which is like a super-container that exactly pins the sha256 of a bunch of other containers we tested together as a single unit; we don't want each little bit of the platform (which is a lot of containers) updating independently.
Essentially what you say in that whole comment, I'm currently using Fedora IoT and I'm building my own Remix of it using a very simple cosa inspired tool (might switch over to cosa at some point) to get a relatively fixed, reproducable system for edge devices in environments with paid-per-megabyte contracts or very slow connections. Also the whole system as you say is developed and tested as a whole at some point and apart from the development of single applications and containers the whole package is what is interesting for the prod deployment. So yeah, it would be very neat to also have the ability to pin containers in such a situation.
To give just one example of this, rpm-ostree has extensive "diffing" logic around things like printing the rpm-level diff between two ostree commits, but we'd have to invent something custom to do that for containers. Of course, the diffing logic isn't required; rpm-ostree will happily upgrade things that come from not-RPM.
I think showing some simple to retrieve information like create/update dates, image hashes, layer numbers or size would be a good first diff between two commits.
So for now I'll have a look at cosa, overlay.d and ostree-layers to see how I can incorporate images in /usr/share/containers in my build tool / pipeline and test this.
(the mind-bending thing for me is if we try to add any container related to stuff natively to rpm-ostree the name suddenly becomes nonsensical...but...that's a bridge we may have to cross).
Actually I'm not quite sure which project would be better fitted for this. I suppose it would be rpm-ostree, since the specifics of containers pull, placement, diffs etc. in the OSTree is quite raw and low-level like managing RPMs is and cosa / osbuild etc. make use of this low-level stuff and manage many more things around that. So yeah, I suppose building it into my build tool or eventually cosa would be more of a work-around and rpm-ostree seems to me the better place for this
Documenting my current approach / findings:
I'm currently working on creating a second commit with various container images under /usr/containers
like this:
# first create a regular rpm-ostree tree commit
rpm-ostree compose tree --unified-core --cachedir="$CACHE_DIR" --repo="$BUILD_REPO" --write-commitid-to="$COMMIT_FILE" "$WK_DIR/$OSTREE_FILE"
# just checkout /usr to a temp, empty sysroot
ostree --repo="$BUILD_REPO" checkout --verbose --subpath=/usr "$(cat "$COMMIT_FILE")" "$WK_DIR"/sysroot/usr
mkdir "$WK_DIR"/sysroot/usr/containers
podman --root "$WK_DIR"/sysroot/usr/containers pull docker.io/library/alpine
podman --root "$WK_DIR"/sysroot/usr/containers pull docker.io/nodered/node-red:1.2.9
# create an orphan commit based on the rpm-ostree compose using the sysroot dir only containing /usr including /usr/containers created previously
# specify the selinux policy necessary so follow up commits won't complain about missing selinux policies
new_commit=$(ostree --repo="$BUILD_REPO" --parent="$(cat "$COMMIT_FILE")" commit --tree=dir="$WK_DIR"/sysroot -s "$COMMIT_SUBJECT" --orphan --selinux-policy=/usr "$WK_DIR"/sysroot/usr)
# Create the commit in the actual branch using both of the previous commits layered over each other
ostree --repo="$BUILD_REPO" commit -b "$OSTREE_REF" -s "$COMMIT_SUBJECT" --tree=ref="$(cat "$COMMIT_FILE")" --tree=ref="$new_commit"
What I receive is the desired ostree, with selinux labels and containers etc.
Currently the next problem is, that some container images work while others more complex container images don't. E.g. alpine is not a problem to include. node-red however causes ostree to complain: "error: Not a regular file or symlink: node". I have no idea how to resolve this yet, other than choosing a slightly different approach: pulling in compressed images, importing them with a service from the read-only part to e.g. memory in /run/containers
I'm currently working on creating a second commit with various container images under /usr/containers like this:
I would suggest using ostree-layers
instead as Colin mentioned higher up in https://github.com/coreos/rpm-ostree/issues/2675#issuecomment-802139453. That way you avoid having to do a secondary ostree commit --parent
step, which also loses a bunch of metadata that doesn't get carried over.
Another approach is to use rpm-ostree compose install
which creates the rootfs, then podman
to pull down the containers into it, and then rpm-ostree compose postprocess
and rpm-ostree compose commit
. That approach takes farther from the CoreOS model though, so that might make it harder for you to leverage cosa if you want to do that down the road.
Currently the next problem is, that some container images work while others more complex container images don't.
I'm not a container runtime SME, but you can probably just nuke any non-regfile. Container runtimes should populate /dev
with the bare necessities for the container to function properly.
I would suggest using ostree-layers instead as Colin mentioned higher up in #2675 (comment). That way you avoid having to do a secondary ostree commit --parent step, which also loses a bunch of metadata that doesn't get carried over.
tbh. I don't quite understand how that is supposed to work yet. This config option takes a number of string refs to already existing commits to add them as layers in the compose tree
step, right? This means I'd do the same commit command as above, just before running compose tree
and without the --parent option - saving metadata in the process?
EDIT
I'm building this in CI so in order for this to work I'd have to automatically update the treefile after the container image commit is created. Little inconvenience.
And a key-learning from this comment I did not get from the docs is that compose install
does not only install single RPMs but is the start / setup for manual alteration of the rootfs.
ENDEDIT
Another approach is to use rpm-ostree compose install which creates the rootfs, then podman to pull down the containers into it, and then rpm-ostree compose postprocess and rpm-ostree compose commit. That approach takes farther from the CoreOS model though, so that might make it harder for you to leverage cosa if you want to do that down the road.
I already had a hard time understanding those commands from the man page and docs. Is this meant to be used in addition to the compose tree
I already do or does it replace this with more fine-grained control of the process but I'll have to rework the single compose tree
step and additionally do the podman / images steps? And if the former do I use this before or after the compose tree
step? I'm struggling a bit since I understand in ostree that --parent
, --tree
, --orphan
etc. gives me control on the relationship of subsequent commits. Is this implicitly handled by rpm-ostree? So if I do compose tree
and then install, postprocess, commit
afterwards will it know to build on the previous one?
I'm not a container runtime SME, but you can probably just nuke any non-regfile. Container runtimes should populate /dev with the bare necessities for the container to function properly.
The problem is more that ostree seems to have a problem with the node executable placed in the container image storage overlay. I'm not sure if that's easy to fix. Do you think the alternative approaches you mentioned will mitigate / solve this?
Yeah, we're missing docs around injecting non-rpm content into rpm-ostree. Will look at this somewhat soon.
I've now replaced the approach from above completely by starting with compose install
then pulling the images into the rootfs and finishing this up with a compose commit
as suggested. While this indeed feels like a more clean and slick approach, it still throws this error when trying to embed the nodejs based container image for node-red: error: While writing rootfs to mtree: Not a regular file or symlink: node
.
@cgwalters I think you mentioned that I should be able to commit such things, so I suspect this to be some sort of unexpected behavior or even a bug?
Tangentially related to this issue - one might wonder why it doesn't work to just ship /var/lib/containers
or ~/.local/share/containers
in ostree today. First, note that ~/.local
is really /var/home/$username/.local
- they're both under /var
. And ostree explicitly does not "own" or modify data in /var
.
Conceptually a file should have a single "owner" (or multiple but with locking). In this particular case, either podman (containers/storage) should own it, or ostree should own it. If on an upgrade e.g. ostree started deleting/changing files underneath podman while containers were running, that would lead to chaos and a mess of ill-defined behavior.
This is why the right answer to having containers "lifecycle bound" with ostree is to ship them in /usr/share
- that way they are clearly owned by ostree, and also the read-only bind mount will ensure that tools like podman/systemd-nspawn/bwrap/etc
know they don't own the data and should just be reading it.
Now, one middle ground model is to ship the data in /usr/share
but have a systemd unit that copies them into containers/storage location like /var/lib/containers
. But IMO this mostly mixes the disadvantages of the two more than the advantages.
To further elaborate on this, another hybrid model that I'd like to enable with e.g. Fedora CoreOS is where we provide a tool to "preload" containers in the live ISO - here the ISO would install a combination of the OS + your containers as a versioned snapshot, but in-place updates for the OS and containers would be separate.
To further elaborate on this, another hybrid model that I'd like to enable with e.g. Fedora CoreOS is where we provide a tool to "preload" containers in the live ISO - here the ISO would install a combination of the OS + your containers as a versioned snapshot, but in-place updates for the OS and containers would be separate.
Actually this is a neat idea. We are building our own images and one of the irritating parts is if a system starts for the first time and behaves slightly differently / false because no container is actually present to run and there always has to be a download first which does not always work due to network restrictions. So we could potentially start by pulling in container images into the raw image. An iso installer would then be even more systematic.
Now, one middle ground model is to ship the data in /usr/share but have a systemd unit that copies them into containers/storage location like /var/lib/containers. But IMO this mostly mixes the disadvantages of the two more than the advantages.
I'd also like to avoid this. As I've checked libostree does not yet seem to be able to handle the images overlayfs correctly or at least some implementation is missing which prevents committing /usr/share/containers/storage. There are apparently work-arounds by removing certain file attributes / bits, re-applying them after a checkout / commit apply - but well, that's also not what I want. I have to take a look if I'm able to implement this in libostree.
Remotely related to this issue - is it possible to package a container image within an RPM? Just thought about layering an application with podman-systemd file but including the container image overlay such that I can rpm-ostree install
it without the container being downloaded separately into read-write storage. Or to put it differently, I'd like the container image to be added to the system into the read-only portion under /usr/share as part of the installation process.
I could use this on the fly in a live system or in the tree-compose when it installs the specified RPM packages. Either with the image as part of the RPM or the image download into /usr/share as part of the installation process of the RPM.
I know this is a very old issue, but we've been trying to solve this problem for a few years already for a similar use case so I thought I'd chime in for future readers.
A few more recent things worth mentioning on this topic that have happened since the original discussion:
rpm-ostree compose
, and is effectvely just a container image specified via Containerfile starting from a bootc
base image that has some extra tools and bits.
As a point of reference, osbuild
has an Automotive SIG set of repos hosted on GitLab instead of GitHub, which includes instructions for how to embed container images in the output. Unfortunately this effectively just breaks down to what was being asked for here, and runs skopeo copy
on a list of images specified in the osbuild-equivalent of treefile syntax. The actual images being copied into the /usr/share/containers/storage
have to be available from a registry or in the rootfs somewhere already, and don't have a clear way to be coupled with an RPM holdling associated non-container files. That step also happens as part of the overall ostree build, so it needs to occur during ostree composing, looks like it may create a new ostree commit for each image added (unclear, I can't find any of the server-side code for ostree composition in the osbuild
organization's repos), and means you can't do something like copying an RPM via USB stick to a development system and installing an updated/additional container image (because manual post-RPM-install steps on an unlocked rootfs need to be performed as part of an ostree compose).
Remotely related to this issue - is it possible to package a container image within an RPM? Just thought about layering an application with podman-systemd file but including the container image overlay such that I can
rpm-ostree install
it without the container being downloaded separately into read-write storage. Or to put it differently, I'd like the container image to be added to the system into the read-only portion under /usr/share as part of the installation process.I could use this on the fly in a live system or in the tree-compose when it installs the specified RPM packages. Either with the image as part of the RPM or the image download into /usr/share as part of the installation process of the RPM.
We originally planned to do this by stuffing the images into the RPMs since our images have to run in an air-gapped local-only k3s cluster as well, and have some other configuration files that need to go along with the images.
A few major factors that affect this are:
%post
scriptlet)?Pretty much any solution involving RPMs requires some pre-SRPM step to get the images into a format that makes them part of a Source#:
you can work with in the RPM. Your RPM specfile has to then put the content wherever it needs to go and in whatever format it needs to be as part of one of the build-time scriptlets, and you may need to do some post-install fixups/translations in a %post
or %posttrans
scriptlet.
The initial solution we had was to spin up a little localhost registry that has its backing storage bind-mounted to a local folder. We push our images into it, shut down the registry, and save the registry storage folder as part of the Source#:
used by our RPM. Within the RPM specfile, we extract and copy these files to some pre-defined system location shared by all the RPMs providing images. At run-time our system spins up a localhost registry container and mounts that system folder as its backing store. All our images now have to reference the localhost registry's domain name to pull the images into the image cache (or mirrors for all possible domain names defined to point to that regsitry).
Pros: No network access required. All images can be in the ostree if that's the "system location" you specify. Possible to setup multiple local mirror registries so some are readable and some writeable. Shared layers between images are de-duped. Image cache doesn't need to be a containers-storage
backend (i.e. containerd
or docker
engines can be used).
Cons: Have 2x the images on your system, one in the registry and another in the image cache. Your image cache also has to be run-time writeable. Maintenance of the run-time-writeable cache has to be linked to the ostree deployment (simple solution is to blow it away and re-pull for each reboot, but that causes lots of disk churn). "Pulling" of images has to be allowed, even though it should only be going to your localhost, but if you make a mistake it could end up trying to go external for missing images.
Our next better solution focused in on the containers-storage
specific tools, and attempted to reduce the number of copies of images on the system. To do that we had to find some way to populate a containers-storage
location at RPM-install time.
Pros: Single copy of images. Images tracked with ostree deployment. Layer file contents implicitly de-dupped by ostree format of rootfs. No network access required. No "pull" permissions required. Can be split read-only and read-write with storage.conf
additioanlImageStore
setting.
Cons: Limited to containers-storage
backend. Unclear how to populate it.
First we tried creating a new podman --root=...
location as part of the pre-RPM formatting, and just packing that into the RPM. Pretty quickly you find out though that the underlying format has all files from all layers uncompressed in place for each of the layers. That gives all kinds of conflicts with RPM builds, which try to parse everything being installed and sanitize/fix-up content. We disabled basically everything RPM builds automatically try to do and it still wouldn't let the raw files be included in the RPM without corrupting them or erroring out during the build.
Next we tried using the oci-dir (oci:
) format, which is a directory with each layer as a gzip compressed blob instead of a directory full of the raw contents, and supports having multiple tagged images in the same oci-dir folder. We could easily get this included in the RPM and installed somewhere, but it's not in a format that containers-storage:
can directly make use of. So we tried using a %post
or %posttrans
scriptlet to run skopeo copy oci:... containers-storage:...
or podman pull oci:...
to get it into the container cache.
In both these cases we were redirecting the storage root to /usr/share/containers/storage
to keep it within the ostree (rather than the default /var/lib/containers
), however it ran afoul of how rpm-ostree
handles RPM installs.
We learned that rpm-ostree
runs the scriptlets in a bwrap
/bubblewrap instance. This makes most of the system inaccessible and many of the defaults from the /usr/share/containers/storage.conf
don't work. You have to redirect the runroot
setting to point to a writeable ephemeral location in the bwrap
, and you have to manually (re)populate the /etc/sub{u,g}id
because it's mounted as a blank ephemeral folder by bwrap and both podman
and skopeo
use user namespaces for populating the containers-storage
. Furthermore, podman
is simply unrunnable in the bwrap
environment, something in its internal pre-run check always errors out saying "file not found" with no details whatsoever (even with --log-level=trace
), and occurs even if just running podman info
. Skopeo
is able to be run, but ends up running into other problems depending on which storage driver you're using.
When using the vfs
storage driver, skopeo
errors out during the extraction of the gzip layers into the image cache location. It seems to succeed on some layers, but inevitably will hit something where the gzip extraction of the oci-dir blob suddenly errors out saying it can't get lgetxattrs
for a file in one of the layers. We weren't able to get any further info to figure out what was going on with it.
When using the overlay
storage driver, which is probably what you'd want anyway, you run into the fact that bwrap is apparently using an overlayfs for the file system isolation and mounting. That's a known problem when trying to create or copy images with the overlay
storage driver, but the podman-in-a-container seems to have a solution for it: use fuse-overlayfs
as the mount_program
. Even still however, something in the rpm-ostree
bwrap
configuration is preventing it from working (maybe a CAP_?
is missing?) and it just keeps reporting filesystem type 0x65735546 reported for /usr/share/containers/storage is not supported with 'overlay': backing file system is unsupported for this graph driver
.
There isn't any other daemonless way to populate the containers-storage:
that we can find than to call podman
or skopeo
with the graphroot
redirected, and the location has to be part of the ostree that's not writeable after RPM install is complete or it fails to satisfy the intended purpose. So as far as we can tell, rpm-ostree
's bwrap
configuration directly blocks any interaction with containers-storage:
, thereby inhibiting any RPM-based deployment of images into a read-only cache.
Hi, as of recently the focus of the maintainers of this project is now (EDIT: I meant now not not) on bootc, and on that topic one thing you may have not seen that's :new: is https://containers.github.io/bootc/logically-bound-images.html which I think is often what is desired here.
Hi, as of recently the focus of the maintainers of this project is now (EDIT: I meant now not not) on bootc, and
Good to know. For someone on the outside trying to figure out what's going to work tomorrow, it's very good to know which way things are leaning.
on that topic one thing you may have not seen that's 🆕 is https://containers.github.io/bootc/logically-bound-images.html which I think is often what is desired here.
I forgot I'd seen that and should have mentioned it in the new info.
However this original ticket was about physically bound images and had a use case where logically bound images explicitly won't work. I happen to be in the same situation myself and was looking for work arounds. I've seen a lot of people trying to figure out a similar situation (it seems the primary uptake for Atomic in industry has been regulated and IoT where physically bound images are the only option?).
I'm currently trying to integrate specific container images into a read-only portion of the ostree, e.g.
/usr/containers
. Using it in podman withadditionalimagestores: []
in storage.conf.The use case is to:
Having the images as part of the ostree is especially important in my use-case because I'm operating in a very restricted resource area where updates should be deltas and most importantly image download should not occur through podman itself and I'm serving applications where RPMs are either hard to build or only available as containers.
Alternatives I have looked at, but not tried yet, are:
With rpm-ostree and the treefile I've already tried to use podman in the post process script, which doesn't work since it's a restricted, unrecommended and most importantly network-free environment.
Next thing I'll try is to do it manually using the rpm-ostree compose and commit tools.
Eventually this got me thinking, why can't I define a set of registry:auth:image entries in the treefile, e.g. in
container-images
that are integrated into the ostree under e.g. /usr/containers so the images are pulled as part of the compose and eventually accessible read-only. This would give them the following properties, if I'm correct:What do you guys think of this use-case? Is it way too niche? Also this might be way off-topic since this is RPM-OSTree not Container-OSTree...