Open cgwalters opened 2 months ago
Note that this tool would also need to paper over https://github.com/containers/buildah/issues/5592 (which would still be great to fix).
ostree-rs-ext today caps at 64...
Bazzite uses 70 layers
70 layers is a sound maximum for a large OS image, although for OCI images you'd probably rather stay under 40 (until composefs solves the layering issue).
Seeing the popularity sched_ext has in the linux community, where everyone can make their own scheduler now, I think the logical first step would be to embed the basic functionality for doing this into buildah and once there is a good algorithm doing the next logical step, which is buildah rechunk
.
For me, this boils down to the following right now: 1) Allow reflinking existing files from container storage into a new container
Given my familiarity with OSTree, I know that it can stage a commit in around 20 seconds with hard links.
If 1 is implemented, this means that the rechunk process would take around 50 seconds for an image with a very large number of files, which is negligible, and result in very minimal thrashing. This is around 12x faster than the current ostree-rs-ext process, given the image is placed in containers storage again.
2 would then be the natural extension, allowing reordering the tar stream based on the previous manifest which is needed for zstd:chunked.
Even with zstd:chunked, layer invalidation remains important as every changed layer needs to be staged and stored in the registry. Composefs can partly deal with the former but not the latter.
Sidenote: on fedora based containers the image has to be squashed before pushing to the registry anyway, since every dnf
command updates the rpm database, adding 40mb-150mb to every layer, so I do not know how important preserving the layer structure is.
While this is another discussion, perhaps it is worth discussing if it is worth "bricking" the rpm database in in containers, forcing dnf
to write into WAL, which may be much smaller.
While this is another discussion, perhaps it is worth discussing if it is worth "bricking" the rpm database in in containers, forcing dnf to write into WAL, which may be much smaller.
On the general topic of the rpm database and containers,
In the very short term I think what makes the most sense is for us to just carry forward with "rechunking" work on the ostree-container side.
Generalizing it - what it would look like with buildah/podman in general is a big topic, and it really does snowball quickly into the general "reproducible builds" problem which in turn quickly snowballs into Intelligent Build System territory, which is not what podman or buildah really expose today; it looks more like things listed off of https://github.com/moby/buildkit?tab=readme-ov-file#used-by
So probably we can just keep this on the back burner unless there is an interested and motivated person to contribute here.
@nalind has started working on designing rechunking into buildah. And has some preliminary code for it. Thoughts on actual how you would rechunk is still being discussed.
A friendly reminder that this issue had no activity for 30 days.
See https://github.com/ostreedev/ostree-rs-ext/issues/69 and specifically this blog post I found very inspirational: https://grahamc.com/blog/nix-and-layered-docker-images
A lot more work on this happened in https://github.com/hhd-dev/rechunk/
This issue is about adding generic support for something like this to buildah. What might that look like? Looking at rechunk (which is building on what rpm-ostree is doing today) is that it's currently got an RPM dependency, which gets messy for buildah integration.
Here's a strawman: we create
buildah rechunk
(again, name totally subject to bikeshedding)...maybe it's "merge"?This could start by accepting an existing OCI image as input, and trying to do some heuristics on it (splitting large files into their own layers, etc.)
Alternatively, a lower level entrypoint may be accepting a large list of OCI images and "merging" them using many of the same approaches that the Nix builder does, to map them to some configurable number of higher level layers (ostree-rs-ext today caps at 64...whole other thread to discuss going higher than that).