Open ckyrouac opened 5 months ago
The latest version of containers-common found in Fedora39/40 repos sets pull_options.enable_partial_images=true in /usr/share/containers/storage.conf. This is the change that started causing this error. Toggling enable_partial_images to false resolves the error.
Ugh. Fun...thanks for finding and debugging this.
It's actually really embarrassing that this wasn't caught by our CI, needs fixing
Actually wait this is the -dev
image which is intentionally tracking git main, I don't think this has hit f40 or stream9 proper yet. I see a lot of activity in https://src.fedoraproject.org/rpms/containers-common/commits/rawhide and what I bet is happening here is those spec files are being pulled into the copr.
cc @rhatdan @lsm5
And yes, we need to add bootc test gating to containers-common and skopeo pretty soon.
hmm interesting, earlier in the week this was happening regardless of which base image I used. Just went to verify that and now this bug only happens with the -dev
base image.
Actually wait this is the
-dev
image which is intentionally tracking git main, I don't think this has hit f40 or stream9 proper yet. I see a lot of activity in https://src.fedoraproject.org/rpms/containers-common/commits/rawhide and what I bet is happening here is those spec files are being pulled into the copr.cc @rhatdan @lsm5
The last build of containers-common on the podman-next copr was an automatic rebuild of the rawhide sources from sometime back. I disabled this automatic rebuild after we got rawhide to a sane-enough state.
Let me know if you need an update to the fedora or copr rpm. I can do a one-off build.
We're currently working on a packit workflow from upstream c/common to downstream containers-common rpm, like we have for podman and the rest, with automatic builds going to podman-next right after every upstream commit to main. I'm hoping that change will land early next week.
so this works now using any base image. I'm not sure what changed. I guess something in the base images or in quay.io?
@vrothberg
Hi @cgwalters. We encountered this issue in QE CI environment many times in two days.
All bootc install to-existing-root
tests failed due to this error. https://artifacts.osci.redhat.com/testing-farm/10a21cd4-b029-48fa-9c23-9848288a7065/
This issue can be found in fedora-bootc:40
image testing. But can't be found in CS9 and RHEL 9.4/9.5 bootc image testing.
ERROR: Installing to filesystem: Creating ostree deployment: Performing deployment: Importing: Parsing layer blob sha256:f34e1c1a6f0f3ac0450db999825e519b67ac7c36697ad80ecfa3672ff285dbbc: Failed to invoke skopeo proxy method FinishPipe: remote error: expected 69427364 bytes in blob, got 72333312
~Ugh man, I think this is fallout from https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb which rolls in https://github.com/containers/storage/commit/23ff5f8c5723724e110eb4b086aead6167e3dc8c which somehow breaks things...~
EDIT: No, I was wrong, enable_partial_pulls = true
also in containers-common-0.58.0-2.fc40.
And yes, the fact that there is no gating CI in any of
That covers the ostree-container path let this all sail right through.
OH!!!. Since yesterday (Saturday), I can't run container inside quay.io/fedora/fedora:40
. And reports Error: configure storage: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver
error. I have to use fedora:39
image.
Hmm, at this very moment quay.io/fedora/fedora-bootc:40
with version=40.20240606.0
has containers-common-0.58.0-2.fc40.noarch
which predates that change (as the date stamp implies).
So...hum, this must somehow relate to the host environment version. Ah yes, if we look at the logs from that test run I can see that inside the fedora cloud AMI we have 'Installed: containers-common-5:0.59.1-1.fc40.noarch'`.
~@henrywang can you try patching the tests to do something like this as a sanity test:~
$ sed -ie 's/enable_partial_images = "true"/enable_partial_images = "false"/' /usr/share/containers/storage.conf
EDIT: See above, I'm no longer confident the relevant change here was in containers-common.
I'm trying to reproduce this locally initially by hacking up my podman-machine environment, but no luck yet.
Another thing that actually changed pretty recently too is there's a new podman: https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb And we're also now getting that in the host environment. Can you play with downgrading that in the host environment too?
And we're also now getting that in the host environment. Can you play with downgrading that in the host environment too?
Sure. I'll run that tomorrow.
@cgwalters I re-run the test with an old Fedora 40 runner, tests passed. I checked the log. I found the difference is the latest containers-common-5:0.59.1-1.fc40
added composefs. That might be the root cause.
I've added comments to https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb and I think that's the root cause is that image builds started defaulting to being zstd:chunked. I still need to dig in and see if that's what's causing the "remote error: expected 69427364 bytes in blob, got 72333312" but I'd bet so.
i think this is two fold issue -- but the end user impact is only see if you have a btrfs containers-storage. in my testing with a digitalocean f39 system which uses btrfs.
1) the default image from quay.io/fedora/fedora-bootc:41 doesn't see this problem performing a 'bootc install' (but the base image is missing cloud-init).
2) a system that is using btrfs appears impacted / not able to bootc install when using personally build bootc images. This issue doesn't occur on a f40/rawhide system using xfs for containers-storage (podman graphDriverName: overlay)
furthermore, the simple act of pulling a personally built bootc images on a f39 (or f40 or rawhide) to a system that uses btrfs as the containers-storage will cause the machine to wedge/freeze up when you have small resources 1c/1g. adding swap to the system prevented the freezing, but didn't produce a more reliable / predictable 'podman pull' behavior. if you run the podman pull in a loop 10 times it keeps on attempting to re-sync data.
making the suggested change to /usr/share/containers/storage.conf of enable_partial_images = "false" allowed for both a predictable 'podman pull' and a bootc install to-existing-root to succeed when graphDriverName: btrfs
once bootc is running the underlying containers-storage reverts to overlay.
@hanulec Is your input image in zstd:chunked
format? Try podman inspect
and look at the layers (you'll see zstd
instead of gzip
).
@hanulec Is your input image in
zstd:chunked
format? Trypodman inspect
and look at the layers (you'll seezstd
instead ofgzip
).
the image i built had the newest items from my containerfile be added in zstd format. i needed to use skopeo inspect to see this. the image was built on a default config from a fresh rawhide image (version: 41.20240530.0)
root@bootc:~/240617-bootc# skopeo inspect docker://quay.io/fedora/fedora-bootc:41 |grep MIMEType|sort |uniq -c
65 "MIMEType": "application/vnd.oci.image.layer.v1.tar+gzip",
root@bootc:~/240617-bootc# skopeo inspect docker://
and the more i look / re-test -- its the podman push action that is changing the MIMEType from "application/vnd.oci.image.layer.v1.tar" to either "+gzip" or "+zstd"
@cgwalters centos-bootc c10s bootc install
test start to failed from the past week, error output is the same like this issue, only derived image is affected, could be related to the host containers-common get updated 0.57.3-4.el10 -> 0.60.2-3.el10
, https://github.com/containers/common/pull/2048, how could I workaround this issue and is this a bug we need to fix?
I only recently realized on this issue why this may be happening. When I was testing https://github.com/ostreedev/ostree-rs-ext/pull/622 I did it via a registry.
But this bug is about "bootc install" where we're pulling from containers-storage: (unpacked representation) and as part of that we ask it to regenerate a tarball from the unpacked files, and by design today that tarball must be bit-for-bit compatible with the descriptor. It would not surprise me at all if there were corner cases where that breaks today. Inherently this "copy from c/storage" model is going through a different codepath than what is used by podman for skopeo today where it drives the copying.
The whole "synthesize a temporary tarball" is really lame of course, what we want instead is https://github.com/containers/storage/issues/1849
I probably hit https://github.com/containers/podman/issues/22813 and I've modified the runtime container.conf as a workaround.
This took awhile to track down. I'm going to continue investigating but I wanted to document what I've found so far.
The failure happens when attempting a
bootc install to-disk
using an image built from a base image with at least one extra layer, e.g.If the image is built locally
bootc install to-disk
works correctly. The failure happens when pushing the image to a repo (only tested with quay.io), clearing out the image from local storage viapodman system prune --all
, then runningbootc install to-disk
. Here's example output of the failure:So, the OpenImage call to the skopeo proxy is failing.
The latest version of containers-common found in Fedora39/40 repos sets
pull_options.enable_partial_images=true
in/usr/share/containers/storage.conf
. This is the change that started causing this error. Togglingenable_partial_images
to false resolves the error. I'm not familiar enough with this stack to know the root cause of this yet. I'll continue digging but I'm sure someone else would be able to track this down a lot quicker if you think it's urgent.