containers / bootc

Boot and upgrade via container images
https://containers.github.io/bootc/
Apache License 2.0
609 stars 74 forks source link

zstd:chunked issues #509

Open ckyrouac opened 5 months ago

ckyrouac commented 5 months ago

This took awhile to track down. I'm going to continue investigating but I wanted to document what I've found so far.

The failure happens when attempting a bootc install to-disk using an image built from a base image with at least one extra layer, e.g.

FROM quay.io/centos-bootc/centos-bootc-dev:stream9
RUN dnf install -y tmux

If the image is built locally bootc install to-disk works correctly. The failure happens when pushing the image to a repo (only tested with quay.io), clearing out the image from local storage via podman system prune --all, then running bootc install to-disk. Here's example output of the failure:

[test@fedora-39 ~]$ sudo podman run --pid=host --network=host --privileged --security-opt label=type:unconfined_t -v /var/lib/containers:/var/lib/containers -v .:/output -v /dev:/dev -e RUST_LOG=debug quay.io/ckyrouac/bootc-lldb bootc install to-disk --via-loopback --generic-image --skip-fetch-check /output/test.raw
Trying to pull quay.io/ckyrouac/bootc-lldb:latest...
Getting image source signatures
...
ERROR Installing to disk: Creating ostree deployment: Performing deployment: Importing: Parsing layer blob sha256:5d35bfe747b2c76a01310e69a14daa90baf352a11f990f73d4ce3e1917668719: Failed to invoke skopeo proxy method FinishPipe: remote error: corrupted blob, expecting sha256:dede69b8181937a545b87707fbe4ace2ee9936459ffd43b7ba957324861992a0

So, the OpenImage call to the skopeo proxy is failing.

The latest version of containers-common found in Fedora39/40 repos sets pull_options.enable_partial_images=true in /usr/share/containers/storage.conf. This is the change that started causing this error. Toggling enable_partial_images to false resolves the error. I'm not familiar enough with this stack to know the root cause of this yet. I'll continue digging but I'm sure someone else would be able to track this down a lot quicker if you think it's urgent.

cgwalters commented 5 months ago

The latest version of containers-common found in Fedora39/40 repos sets pull_options.enable_partial_images=true in /usr/share/containers/storage.conf. This is the change that started causing this error. Toggling enable_partial_images to false resolves the error.

Ugh. Fun...thanks for finding and debugging this.

cgwalters commented 5 months ago

It's actually really embarrassing that this wasn't caught by our CI, needs fixing

cgwalters commented 5 months ago

Actually wait this is the -dev image which is intentionally tracking git main, I don't think this has hit f40 or stream9 proper yet. I see a lot of activity in https://src.fedoraproject.org/rpms/containers-common/commits/rawhide and what I bet is happening here is those spec files are being pulled into the copr.

cc @rhatdan @lsm5

cgwalters commented 5 months ago

And yes, we need to add bootc test gating to containers-common and skopeo pretty soon.

ckyrouac commented 5 months ago

hmm interesting, earlier in the week this was happening regardless of which base image I used. Just went to verify that and now this bug only happens with the -dev base image.

lsm5 commented 5 months ago

Actually wait this is the -dev image which is intentionally tracking git main, I don't think this has hit f40 or stream9 proper yet. I see a lot of activity in https://src.fedoraproject.org/rpms/containers-common/commits/rawhide and what I bet is happening here is those spec files are being pulled into the copr.

cc @rhatdan @lsm5

The last build of containers-common on the podman-next copr was an automatic rebuild of the rawhide sources from sometime back. I disabled this automatic rebuild after we got rawhide to a sane-enough state.

Let me know if you need an update to the fedora or copr rpm. I can do a one-off build.

We're currently working on a packit workflow from upstream c/common to downstream containers-common rpm, like we have for podman and the rest, with automatic builds going to podman-next right after every upstream commit to main. I'm hoping that change will land early next week.

ckyrouac commented 4 months ago

so this works now using any base image. I'm not sure what changed. I guess something in the base images or in quay.io?

cgwalters commented 4 months ago

@vrothberg

henrywang commented 3 months ago

Hi @cgwalters. We encountered this issue in QE CI environment many times in two days. All bootc install to-existing-root tests failed due to this error. https://artifacts.osci.redhat.com/testing-farm/10a21cd4-b029-48fa-9c23-9848288a7065/ This issue can be found in fedora-bootc:40 image testing. But can't be found in CS9 and RHEL 9.4/9.5 bootc image testing.

ERROR: Installing to filesystem: Creating ostree deployment: Performing deployment: Importing: Parsing layer blob sha256:f34e1c1a6f0f3ac0450db999825e519b67ac7c36697ad80ecfa3672ff285dbbc: Failed to invoke skopeo proxy method FinishPipe: remote error: expected 69427364 bytes in blob, got 72333312
cgwalters commented 3 months ago

~Ugh man, I think this is fallout from https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb which rolls in https://github.com/containers/storage/commit/23ff5f8c5723724e110eb4b086aead6167e3dc8c which somehow breaks things...~

EDIT: No, I was wrong, enable_partial_pulls = true also in containers-common-0.58.0-2.fc40.


And yes, the fact that there is no gating CI in any of

That covers the ostree-container path let this all sail right through.

henrywang commented 3 months ago

OH!!!. Since yesterday (Saturday), I can't run container inside quay.io/fedora/fedora:40. And reports Error: configure storage: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver error. I have to use fedora:39 image.

cgwalters commented 3 months ago

Hmm, at this very moment quay.io/fedora/fedora-bootc:40 with version=40.20240606.0 has containers-common-0.58.0-2.fc40.noarch which predates that change (as the date stamp implies).

So...hum, this must somehow relate to the host environment version. Ah yes, if we look at the logs from that test run I can see that inside the fedora cloud AMI we have 'Installed: containers-common-5:0.59.1-1.fc40.noarch'`.

~@henrywang can you try patching the tests to do something like this as a sanity test:~

$ sed -ie 's/enable_partial_images = "true"/enable_partial_images = "false"/' /usr/share/containers/storage.conf

EDIT: See above, I'm no longer confident the relevant change here was in containers-common.

cgwalters commented 3 months ago

I'm trying to reproduce this locally initially by hacking up my podman-machine environment, but no luck yet.

Another thing that actually changed pretty recently too is there's a new podman: https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb And we're also now getting that in the host environment. Can you play with downgrading that in the host environment too?

henrywang commented 3 months ago

And we're also now getting that in the host environment. Can you play with downgrading that in the host environment too?

Sure. I'll run that tomorrow.

henrywang commented 3 months ago

@cgwalters I re-run the test with an old Fedora 40 runner, tests passed. I checked the log. I found the difference is the latest containers-common-5:0.59.1-1.fc40 added composefs. That might be the root cause.

cgwalters commented 3 months ago

I've added comments to https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb and I think that's the root cause is that image builds started defaulting to being zstd:chunked. I still need to dig in and see if that's what's causing the "remote error: expected 69427364 bytes in blob, got 72333312" but I'd bet so.

hanulec commented 3 months ago

i think this is two fold issue -- but the end user impact is only see if you have a btrfs containers-storage. in my testing with a digitalocean f39 system which uses btrfs.

1) the default image from quay.io/fedora/fedora-bootc:41 doesn't see this problem performing a 'bootc install' (but the base image is missing cloud-init).

2) a system that is using btrfs appears impacted / not able to bootc install when using personally build bootc images. This issue doesn't occur on a f40/rawhide system using xfs for containers-storage (podman graphDriverName: overlay)

furthermore, the simple act of pulling a personally built bootc images on a f39 (or f40 or rawhide) to a system that uses btrfs as the containers-storage will cause the machine to wedge/freeze up when you have small resources 1c/1g. adding swap to the system prevented the freezing, but didn't produce a more reliable / predictable 'podman pull' behavior. if you run the podman pull in a loop 10 times it keeps on attempting to re-sync data.

making the suggested change to /usr/share/containers/storage.conf of enable_partial_images = "false" allowed for both a predictable 'podman pull' and a bootc install to-existing-root to succeed when graphDriverName: btrfs

once bootc is running the underlying containers-storage reverts to overlay.

cgwalters commented 3 months ago

@hanulec Is your input image in zstd:chunked format? Try podman inspect and look at the layers (you'll see zstd instead of gzip).

hanulec commented 3 months ago

@hanulec Is your input image in zstd:chunked format? Try podman inspect and look at the layers (you'll see zstd instead of gzip).

the image i built had the newest items from my containerfile be added in zstd format. i needed to use skopeo inspect to see this. the image was built on a default config from a fresh rawhide image (version: 41.20240530.0)

root@bootc:~/240617-bootc# skopeo inspect docker://quay.io/fedora/fedora-bootc:41 |grep MIMEType|sort |uniq -c 65 "MIMEType": "application/vnd.oci.image.layer.v1.tar+gzip", root@bootc:~/240617-bootc# skopeo inspect docker:///f40jump05:240615-0058 | grep MIMEType|sort |uniq -c 65 "MIMEType": "application/vnd.oci.image.layer.v1.tar+gzip", 4 "MIMEType": "application/vnd.oci.image.layer.v1.tar+zstd", root@bootc:~/240617-bootc#

hanulec commented 3 months ago

and the more i look / re-test -- its the podman push action that is changing the MIMEType from "application/vnd.oci.image.layer.v1.tar" to either "+gzip" or "+zstd"

shi2wei3 commented 3 weeks ago

@cgwalters centos-bootc c10s bootc install test start to failed from the past week, error output is the same like this issue, only derived image is affected, could be related to the host containers-common get updated 0.57.3-4.el10 -> 0.60.2-3.el10, https://github.com/containers/common/pull/2048, how could I workaround this issue and is this a bug we need to fix?

cgwalters commented 2 weeks ago

I only recently realized on this issue why this may be happening. When I was testing https://github.com/ostreedev/ostree-rs-ext/pull/622 I did it via a registry.

But this bug is about "bootc install" where we're pulling from containers-storage: (unpacked representation) and as part of that we ask it to regenerate a tarball from the unpacked files, and by design today that tarball must be bit-for-bit compatible with the descriptor. It would not surprise me at all if there were corner cases where that breaks today. Inherently this "copy from c/storage" model is going through a different codepath than what is used by podman for skopeo today where it drives the copying.

The whole "synthesize a temporary tarball" is really lame of course, what we want instead is https://github.com/containers/storage/issues/1849

shi2wei3 commented 1 week ago

I probably hit https://github.com/containers/podman/issues/22813 and I've modified the runtime container.conf as a workaround.