bootc integration tracker

cgwalters commented 5 months ago

Feature request description

:wave: the https://github.com/containers/bootc project is intending to closely tie to podman. We actually have https://github.com/containers/bootc/pull/215 which we queued up to make this an explicitly hard dependency.

I was working on https://github.com/containers/bootc/pull/559 and it'd probably be better to have most of this logic live in podman. For that feature it'd be something like podman pull --all-images-from-root <dirfd> where <dirfd> is an openat2() fd for the mounted new root. Actually when we do this we probably need to design something more general where images can be "pinned" in this case by...hmm, well, we could just teach podman to ask bootc what active roots there are? So it'd be denied by default podman rmi an image referenced in a root. Though there are some fun details here around how possibly floating tags behave.

Anyways implementing stuff like this is going to start to involve cross-calling between podman and bootc in general. Here's my initial thought: we add something like podman bootc-internals which is intended to be a hidden implementation detail of the backend of some things that are logically bootc features to start.

(One side challenge I'm trying to handle here is that while so far bootc is all Rust, I think as folks here know, there's a lot more people that can write Go for better or worse, and having a "land the code in podman" as a generic escape hatch might help...that has its own tradeoffs of course)

Suggest potential solution

A new podman bootc-backend API entrypoint.

Have you considered any alternatives?

We could of course try to make things pluggable on either end...in theory of course, some of what I'm talking about here could apply to other image-based update systems. And in theory, bootc could learn to speak to docker too...

But of course, the tradeoff there is a lot more complexity.

Luap99 commented 5 months ago

Keep in mind I have not looked at bootc at all. (I know what it is broadly speaking but not what features it actually needs)

Also, has this been discussed with other podman maintainers and I am just out of the loop or is this your starting point?

I was working on https://github.com/containers/bootc/pull/559 and it'd probably be better to have most of this logic live in podman. For that feature it'd be something like podman pull --all-images-from-root where is an openat2() fd for the mounted new root. Actually when we do this we probably need to design something more general where images can be "pinned" in this case by...hmm, well, we could just teach podman to ask bootc what active roots there are? So it'd be denied by default podman rmi an image referenced in a root. Though there are some fun details here around how possibly floating tags behave.

I do not understand this at all, what is -all-images-from-root <dirfd> supposed to do? And what does pinning mean, not being able to delete an image even when the users asks for it?

Anyways implementing stuff like this is going to start to involve cross-calling between podman and bootc in general. Here's my initial thought: we add something like podman bootc-internals which is intended to be a hidden implementation detail of the backend of some things that are logically bootc features to start.

I am not sure what the end solution should look like but I really dislike the idea of a special bootc command to handle opinionated communication between both tools. I much more like public interfaces that can be used by anyone and not have any bootc specific things in podman. Of course I am open to add new interfaces that bootc needs but they should be restricted to stuff that cannot be done on the bootc side or are to complex to do there.

If we add code in podman that depends on bootc and bootc depends on podman we end up with a circular dependency which is a real headache to work with.

(One side challenge I'm trying to handle here is that while so far bootc is all Rust, I think as folks here know, there's a lot more people that can write Go for better or worse, and having a "land the code in podman" as a generic escape hatch might help...that has its own tradeoffs of course)

I don't think this is great reason to move code into podman. Sure short term it might help but long term people need to get used to rust anyway.

cgwalters commented 5 months ago

Also, has this been discussed with other podman maintainers and I am just out of the loop or is this your starting point?

This is the starting point :smile: I try to default to public async discussion.

I do not understand this at all, what is -all-images-from-root supposed to do? And what does pinning mean, not being able to delete an image even when the users asks for it?

I guess I'm seeing a few levels; we could treat pinned images as if they have a running container (even if technically they don't). Maybe that'd be a good way to do it; create "dummy/virtual" stopped containers for pinned images, that the admin could explicitly podman rm if desired, but we're just adding a speed bump.

I am not sure what the end solution should look like but I really dislike the idea of a special bootc command to handle opinionated communication between both tools. I much more like public interfaces that can be used by anyone and not have any bootc specific things in podman.

Yes, fair. For the image pinning I think we can define a standard interface right? This also relates to https://github.com/openshift/enhancements/pull/1481 which IMO should have been in podman aka containers/* from the start.

cgwalters commented 5 months ago

I do not understand this at all, what is -all-images-from-root supposed to do? And what does pinning mean, not being able to delete an image even when the users asks for it?

To elaborate on this (I realize the context wasn't obvious in my initial description; there are a lot more details in https://github.com/containers/bootc/pull/559 ) but in a nutshell here:

In a bootc world, there are (at least) two root filesystems, which may have different versions of podman-systemd .image files (and .container) etc. See https://docs.fedoraproject.org/en-US/bootc/running-containers/#_lifecycling_and_updating_containers_separate for an example where the base OS container image has a .image file which points to a specific tag, and then there's the possibility of a new OS version pointing to a new tag
There is only one (by default) podman container storage in /var/lib/containers
What we want is when bootc upgrade happens (queuing, but not by default applying) an OS update, for the new referenced containers from the .image files to be downloaded, and be ready when we reboot (or in general, "apply", which may be a systemd live restart e.g.) into the new OS version
It'd also be important (but not strictly required) if the images were protected against GC by default

The key advantage here is that we avoid a potentially long latency after a reboot into the new OS when we download new containers.

This problem exists today in the real world in OpenShift: etcd is a static pod defined in /etc/kubernetes, and when etcd changes the controlplane node has to wait for a new etcd container to be fetched on OS upgrade. It'd just be Better to fetch the new image before the reboot, actually before we even stop the current etcd instance, to maximize availability.

mheon commented 5 months ago

I tend to agree with @Luap99 on not doing specific bootc integrations. Feature requests, sure, but I do not want to evolve a specific API for bootc.

It sounds like what's being asked for here is some kind of image pinning - a way to ensure an image is not pruned (and presumably not removed without --force). We could work around that right now with temporary containers using the image, but the request seems perfectly reasonable.

It would be on Bootc to unpin the images for the old root when it gets removed, though. Probably not a problem, I'd imagine?

Luap99 commented 5 months ago

@cgwalters Thanks for the details.

I think there are several issues with such a approach, first podman has no idea about the quadlet files. quadlet parses them and creates systemd units with podman commands in them, this is a very clear separation of concern. As such I don't want podman to ever parse the files directly. However I could imagine adding a special quadlet subcommand/flag that outputs all the known images that could then be passed to podman pull. If we want to get fancy we could even make podman pull read such list from stdin so one could easily pipe the output from quadlet into it. The main advantage is that it is possible to execute quadlet in a new mounts with the new rootfs while running the podman pull command in the current active rootfs. The podman logic is quite fragile and executing in from different roots (like you do right now) will likely cause a lot of unwanted side consequences, i.e. from your PR

bootc-image-builder errors out with: Error: mkdir /etc/containers/networks: read-only file system - need to fix podman to not try to create that directory

As such I strongly recommend that you never execute podman from two different roots, you would need to make sure all tmp paths would be available as well not just /var if you try this. And if the new root changes some containers.conf or other config that things can end up in a very bad state. Overall something that is not supportable in any way IMO.

Second, the pulling all quadlet images and not pruning part is also complicated because technically you can have these files but not have them enabled. And given these are fully fledged system unit dependencies you can have them hook into other units, etc... so knowing when one is started at boot or not is not easy at all and I doubt we would like to pre-pull images there are never used. Likewise we shouldn't prevent images from being pruned just because there is quadlet file referring to it, i.e. if the service was started once but then will never be used again in a long time it would just eat up space for nothing.

cgwalters commented 5 months ago

I think there are several issues with such a approach, first podman has no idea about the quadlet files.

I understand that there is an internal code separation, and that somewhat leaks out to users in how it basically boils down to a bunch of ExecStart=podman ... - but also I think someone who wasn't aware of the history might very reasonably perceive that it is simply part of "podman". It's in the same codebase...

It sounds like what's being asked for here is some kind of image pinning

Improved image pinning is clearly something that can be made generically applicable. However, it's just one of several things actually, which is why I started out with a more generic issue. However, we can also just try to tackle them separately. bootc is inherently doing some container things, but for the booted host - we share code today indirectly via skopeo, but that's pretty ad-hoc.

As such I strongly recommend that you never execute podman from two different roots, you would need to make sure all tmp paths would be available as well not just /var if you try this. And if the new root changes some containers.conf or other config that things can end up in a very bad state. Overall something that is not supportable in any way IMO.

The bootc code is always executing podman from the running system, just looking for the unit files in the new root.

Second, the pulling all quadlet images and not pruning part is also complicated because technically you can have these files but not have them enabled.

Yes, this is one reason I made the binding opt-in. However, in practice I think this would be a sufficient corner case that people may also be fine with a global opt-in, with the ability to do conditional opt-outs.

Likewise we shouldn't prevent images from being pruned just because there is quadlet file referring to it, i.e. if the service was started once but then will never be used again in a long time it would just eat up space for nothing.

Again I think we'd require opt-in, so that's not a concern.

Luap99 commented 5 months ago

As such I strongly recommend that you never execute podman from two different roots, you would need to make sure all tmp paths would be available as well not just /var if you try this. And if the new root changes some containers.conf or other config that things can end up in a very bad state. Overall something that is not supportable in any way IMO.

The bootc code is always executing podman from the running system, just looking for the unit files in the new root.

Ok this is good, not sure why you would get read only /etc problems then? Either /etc is always read only so any podman command fails or it is read write then it should just work? (Not you can change the default network config dir to another location in containers.conf of course, but I guess this specific issue is not really relevant for this context here)

Second, the pulling all quadlet images and not pruning part is also complicated because technically you can have these files but not have them enabled.

Yes, this is one reason I made the binding opt-in. However, in practice I think this would be a sufficient corner case that people may also be fine with a global opt-in, with the ability to do conditional opt-outs.

Yeah any opt-in works for me. I find your idea to pre-pull the images for the new root to minimize downtime very good as this matches what we do with podman auto-update, pull the new image then restart the service.

cgwalters commented 5 months ago

Ok this is good, not sure why you would get read only /etc problems then? Either /etc is always read only so any podman command fails or it is read write

Because bootc-image-builder (a separate project) is basically running the container images in a custom way (not via podman, but via bubblewrap) and it seems to default to a readonly rootfs as opposed to a writable overlayfs by default.

That said, there's some code somewhere in podman or maybe netavark (didn't dig yet) that basically unconditionally does a mkdir("/etc/containers/networks") even during a plain podman pull and seems to want to put state/lockfiles there; I'd say that's a bug.

Luap99 commented 5 months ago

That said, there's some code somewhere in podman or maybe netavark (didn't dig yet) that basically unconditionally does a mkdir("/etc/containers/networks") even during a plain podman pull and seems to want to put state/lockfiles there; I'd say that's a bug.

Sure it is fixable but there is a ton of initialization work done for every podman command so this isn't a bug IMO. Every podman commands creates a runtime internally that also has to initialize the network code. The fact that we pre create these directory is mostly a historical thing as we used to have the lockfile stored there. It is definitely possible to move the mkdir calls later in the stack only when it is needed at the cost of doing the mkdir every time we create a network. Not bad but a trade off overall and if we start writing in a another place we now have to remember to do the mkdir there as well.

So yes we can fix this specific case for the network code but there is a lot more initialization code that runs for every podman pull command no matter if is needed or not.

vrothberg commented 5 months ago

I have troubles following the conversation but I assume the following is the problem we want to solve (?):

What we want is when bootc upgrade happens (queuing, but not by default applying) an OS update, for the new referenced containers from the .image files to be downloaded, and be ready when we reboot (or in general, "apply", which may be a systemd live restart e.g.) into the new OS version

Assuming that is the problem we want to solve, than I would recommend considering podman auto-update. The --dry-run can spit out which images are being used in systemd units (configured for auto updates). It could also be extended to fit other needs. But it effectively does what's being described plus restarting the systemd units.

Luap99 commented 5 months ago

podman auto-update does not work for this use case I think, the described scenario assumes quadlet files shipped as part of the OS image ,i.e. /usr/share/containers/systemd/, that contain new images there with different tags. auto-update only works if the exact same image reference is updated on the registry side.

vrothberg commented 5 months ago

podman auto-update does not work for this use case I think, the described scenario assumes quadlet files shipped as part of the OS image ,i.e. /usr/share/containers/systemd/, that contain new images there with different tags. auto-update only works if the exact same image reference is updated on the registry side.

That is a good point, thanks!

cgwalters commented 5 months ago

cc https://github.com/containers/bootc/pull/215/files#r1631710498 in case people here have thoughts on that

github-actions[bot] commented 4 months ago

A friendly reminder that this issue had no activity for 30 days.

cgwalters commented 3 months ago

https://github.com/containers/bootc/pull/724#discussion_r1700535722 is the kind of big picture thing I was thinking about that comes to the fore as we¹ start to intersect the projects more.

¹well so far mostly I but hopefully that changes...

containers / podman