CentOS / centos-bootc

Create and maintain base bootable container images from Fedora ELN and CentOS Stream packages
https://centos.github.io/centos-bootc
Other
43 stars 28 forks source link

Tracker for support for nested containers #282

Open cgwalters opened 4 months ago

cgwalters commented 4 months ago

This relates to https://github.com/containers/bootc/issues/128 - but isn't quite the same thing. Let's use this as a tracker for supporting "nesting" container images.

We should ideally support something like this:

FROM quay.io/centos-bootc/centos-bootc:stream9
RUN podman --storage-driver=vfs --root=/usr/share/containers/storage pull <someimage>
COPY somecontainer.container /usr/share/containers/systemd

Where somecontainer.container is a podman systemd unit that also uses:

PodmanArgs=--root=/usr/share/containers/storage

The reason I mentioned --storage-driver=vfs is to avoid overlayfs and nested whiteouts...I think as of recent overlayfs this is supported at runtime, but...I can't make a whiteout in a default podman run invocation; I think the device cgroup may be coming into play?

$ cat Containerfile
FROM quay.io/centos/centos:stream9
RUN mknod somewh c 0 0
$ podman build -t localhost/test .
STEP 1/2: FROM quay.io/centos/centos:stream9
STEP 2/2: RUN mknod somewh c 0 0
mknod: somewh: Operation not permitted
Error: building at STEP "RUN mknod somewh c 0 0": while running runtime: exit status 1
$

Even if we could make the whiteout, I think we'd run into problems because there's no standard for nesting them at the OCI level. Also xref https://www.spinics.net/lists/linux-unionfs/msg11253.html

DanielFroehlich commented 4 months ago

what is also important is that the reference to <someimage> works unmodified with the runtime, e.g. if used in systemd file, scripts using podman, microshift etc. No matter what the reference is (labels, SHA digest, etc.). Esp. digests used to be a problem in the past because they could change when moving/embedding the oci container image. MicroShift/OpenShift release images rely on digest references.

rhatdan commented 4 months ago

Why not use additionalstores for this. Latest containers-common setup Podman and buildah to automatically look for an additional store in /usr/lib/containers/storage. If images are pulled into this store, then Podman will use this as a read/only store and /var/lib/containers/storage as a read/write store.

alexlarsson commented 4 months ago

I think using vfs backend is a bad idea btw, at least if you run non-readonly containers, because the vfs driver cannot use overlayfs for the container upper layer. The ideal approach would be to use the overlayfs backend with composefs enabled, because then there will be no whiteout files in the container storage (they are all inside the composefs blob in the storage).

rhatdan commented 3 months ago

Adding in CAP_SYS_ADMIN seems to allow this to work?

$ podman build --cap-add SYS_ADMIN /tmp
STEP 1/2: FROM quay.io/centos-bootc/centos-bootc:stream9
STEP 2/2: RUN podman --root=/usr/share/containers/storage pull alpine
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob sha256:4abcf20661432fb2d719aaf90656f55c287f8ca915dc1c92ec14ff61e67fbaf8
Copying config sha256:05455a08881ea9cf0e752bc48e61bbd71a34c029bb13df01e40e3e70e0d007bd
Writing manifest to image destination
05455a08881ea9cf0e752bc48e61bbd71a34c029bb13df01e40e3e70e0d007bd
COMMIT
--> c8edcbce04cd
c8edcbce04cda8c52eb2043f9bcd23c74cb6a1e90948bb08dde27f2bfd31b7bd
rhatdan commented 3 months ago

Here is a little test I did to make this work.

$ cat /tmp/Containerfile FROM quay.io/centos-bootc/centos-bootc:stream9 RUN sed -e '/additionalimage.*/a "/usr/lib/containers/storage",' -i /etc/containers/storage.conf RUN podman --root=/usr/lib/containers/storage pull alpine

$ podman build -t bootc --cap-add SYS_ADMIN /tmp STEP 1/3: FROM quay.io/centos-bootc/centos-bootc:stream9 STEP 2/3: RUN podman --root=/usr/lib/containers/storage pull alpine Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf) Trying to pull docker.io/library/alpine:latest... Getting image source signatures Copying blob sha256:4abcf20661432fb2d719aaf90656f55c287f8ca915dc1c92ec14ff61e67fbaf8 Copying config sha256:05455a08881ea9cf0e752bc48e61bbd71a34c029bb13df01e40e3e70e0d007bd Writing manifest to image destination 05455a08881ea9cf0e752bc48e61bbd71a34c029bb13df01e40e3e70e0d007bd --> b4e3d3d3506b STEP 3/3: RUN sed -e '/additionalimage.*/a "/usr/lib/containers/storage",' -i /etc/containers/storage.conf COMMIT bootc --> a94b77143258 Successfully tagged localhost/bootc:latest

$ podman run -ti --cap-add SYS_ADMIN bootc podman images REPOSITORY TAG IMAGE ID CREATED SIZE R/O docker.io/library/alpine latest 05455a08881e 8 weeks ago 7.67 MB true

In order to use Overlay within a container you need to run the container with CAP_SYS_ADMIN or play with rootless containers.

cgwalters commented 3 months ago

In order to use Overlay within a container you need to run the container with CAP_SYS_ADMIN or play with rootless containers.

We're having a realtime conversation about this and I think there's general agreement that if the problem is that podman pull is trying to do an overlayfs mount, then the bugfix would be to podman to have it stop doing that.

I still have an open uncertainty about whiteouts which I agree with Alex would be much better fixed by composefs - avoiding the need for metadata in general written directly into the container image filesystem.

sallyom commented 3 months ago

cross-building from arm M2 for x86_64 (after adding --cap-add SYS_ADMIN) there's an issue:

$ cat Containerfile
FROM quay.io/centos-bootc/centos-bootc:stream9
RUN podman pull alpine && podman pull busybox

This builds fine from arm M2 machine:

podman build --arch aarch64 -t myimage:arm --cap-add SYS_ADMIN .

This fails from my arm M2 machine:

podman build --arch x86_64 -t myimage:amd64 --cap-add SYS_ADMIN .

and here's the weird error:

STEP 2/2: RUN podman pull alpine && podman pull busybox
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob sha256:4abcf20661432fb2d719aaf90656f55c287f8ca915dc1c92ec14ff61e67fbaf8
Error: copying system image from manifest list: writing blob: adding layer with blob "sha256:4abcf20661432fb2d719aaf90656f55c287f8ca915dc1c92ec14ff61e67fbaf8": processing tar file(Error: unrecognized command `podman /`

Did you mean this?
    cp
    ps
    rm

Try 'podman --help' for more information
): exit status 125
Error: building at STEP "RUN podman pull alpine && podman pull busybox": while running runtime: exit status 125
DanielFroehlich commented 3 months ago

Thx for progressing on this!

I would feel better with some automated CI test cases that mimic the actual use case as a smoke test: a container image with whiteouts (!!!) referenced using sha digest in the containerfile. Then bootc the resulting image and ensure that the image referenced with the same digest as in the containerfile comes up and works correctly. Because: we had the same situation with Blueprints and image builder - it initially looked like it would be working, but actually was not. And this is a must have feature for microshift / edge deployments in airgapped / disconnected used cases.

And to add an additional requirement: building of these images has to work on OpenShift in a CI/CD pipeline without cluster-admin privilege's .

rhatdan commented 3 months ago

The issue seems to be that podman without CAP_SYS_ADMIN fails over to setting up a User Namespace with a single mapping. I am talking to @giuseppe about whether or not this is required or how we could work around this. For now this will work fine with CAP_SYS_ADMIN added to the build. I don't see any issues with the Whiteouts being stored in the images, as they normally do on a host. The running of containers on containers is blocking overlay on overlay, but I don't think this is an issue we would see here.

giuseppe commented 3 months ago

When we configure the user namespace we don't know what command is going to be executed by Podman so we don't check for that combination (and possibly we need also CAP_SETFCAP), but we check only for CAP_SYS_ADMIN.

I think it is correct this way because even if you pull the images in that environment, you won't be able to use them until you gain CAP_SYS_ADMIN, and setting the user namespace will probably use different mappings.

cgwalters commented 2 months ago

Also relevant is https://github.com/ostreedev/ostree/pull/2722