containers / buildah

A tool that facilitates building OCI images.
https://buildah.io
Apache License 2.0
7.44k stars 785 forks source link

Buildah version 1.29.1 broken fuse-overlayfs in gitlab runner #4715

Closed DaanGebraad closed 1 year ago

DaanGebraad commented 1 year ago

Description

After our buildah image was upgraded using the v1.29.1 version we noticed all our pipelines that were using buildah build started to fail on our gitlab runners. Seems like an issue with the fuse-overlayfs package We're using the quay.io/buildah/stable:latest image, after downgrading to v1.29.0 buildah worked again.

Steps to reproduce the issue:

  1. Upgrade buildah-stable to v1.29.1
  2. Run buildah build in gitlab runner

Describe the results you received: Buildah build failing to unmount and mount

Command used: $ buildah build -q -f .docker/Dockerfile -t $CI_REGISTRY_IMAGE:$BUILD_TAG -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA . Output: Line 1 time="2023-04-06T08:20:26Z" level=error msg="Unmounting /var/lib/containers/storage/overlay/de00ab76e27e966fa7c0c0b79a5ad1247cdf765946bc05ada5b6d99be3a42be5/merged: invalid argument" Line 2 Error: mounting new container: mounting build container "9c4bcdfd7a1a64ae4bb1d399c930ae9a29e52d240c1405e00bf7638cde951901": creating overlay mount to /var/lib/containers/storage/overlay/de00ab76e27e966fa7c0c0b79a5ad1247cdf765946bc05ada5b6d99be3a42be5/merged, mount_data="lowerdir=/var/lib/containers/storage/overlay/l/X3DKLX3WMGVRBGZJYZ4MSULUEY:/var/lib/containers/storage/overlay/l/VLKLEB5WCVLZFQUO2W3ANMY7ZY:/var/lib/containers/storage/overlay/l/YWHPWE4M7KVTL6WXCXO6LBJW52:/var/lib/containers/storage/overlay/l/RJM72NK2LQWGBJU2CB2BWSTGBT:/var/lib/containers/storage/overlay/l/6TW2TMTSHXLBWS4KEJYUFY73CA:/var/lib/containers/storage/overlay/l/MTOGBS4AKRVUFTNZYIFGJLFZCX:/var/lib/containers/storage/overlay/l/ZZXGS22J43ZTSNW3SH2S2L4WRU,upperdir=/var/lib/containers/storage/overlay/de00ab76e27e966fa7c0c0b79a5ad1247cdf765946bc05ada5b6d99be3a42be5/diff,workdir=/var/lib/containers/storage/overlay/de00ab76e27e966fa7c0c0b79a5ad1247cdf765946bc05ada5b6d99be3a42be5/work,nodev,fsync=0,volatile": invalid argument

Describe the results you expected: A working buildah build command

Screenshot 2023-04-06 at 11 11 07

flouthoc commented 1 year ago

Could you try setting graph options to null in your storage.conf ?

Tiscs commented 1 year ago

Same issue for me, and I rolled back to v1.29.0 instead of the latest version.

elacheche commented 1 year ago

Hello @flouthoc

I have the same issue, I made some investigation and I noticed that the latest image were (re)uploaded 18 hours ago with a missing config!

Do you have any idea why and how is that possible?

Luckily, 7 days ago I creatde a custom image based on the latest 1.29.1, below are details from both:

Edit: The output below is the result of a clean version of /etc/containers/storage.conf (no comments and no empty lines) and buildah version

custom

[storage] 
driver = "overlay"                                  
runroot = "/run/containers/storage"                  
graphroot = "/var/lib/containers/storage"
[storage.options]
additionalimagestores = [
 "/var/lib/shared",
 ] 
pull_options = {enable_partial_images = "false", use_hard_links = "false", ostree_repos=""}      
[storage.options.overlay]                      
mount_program = "/usr/bin/fuse-overlayfs"       
mountopt = "nodev,fsync=0"                     
[storage.options.thinpool]                     
Version:         1.29.1                           
Go Version:      go1.19.5
Image Spec:      1.0.2-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        1.0.0
libcni Version:  v1.1.2
image Version:   5.24.1
Git Commit:
Built:           Fri Feb 17 10:05:41 2023
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

Today's latest

[storage]
driver = "overlay"
runroot = "/run/containers/storage"
graphroot = "/var/lib/containers/storage"
[storage.options]
additionalimagestores = [
"/var/lib/shared",
]
pull_options = {enable_partial_images = "false", use_hard_links = "false", ostree_repos=""}
[storage.options.overlay]
mountopt = "nodev,fsync=0"
[storage.options.thinpool]
Version:         1.29.1
Go Version:      go1.19.5
Image Spec:      1.0.2-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        1.0.0
libcni Version:  v1.1.2
image Version:   5.24.1
Git Commit:
Built:           Fri Feb 17 10:05:41 2023
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

diff

$ git diff buildah_custom.txt buildah_latest.txt
diff --git a/buildah_custom.txt b/buildah_latest.txt
index 1c7966c..27435fa 100644
--- a/buildah_custom.txt
+++ b/buildah_latest.txt
@@ -8,7 +8,6 @@ additionalimagestores = [
 ]
 pull_options = {enable_partial_images = "false", use_hard_links = "false", ostree_repos=""}
 [storage.options.overlay]
-mount_program = "/usr/bin/fuse-overlayfs"
 mountopt = "nodev,fsync=0"
 [storage.options.thinpool]
 Version:         1.29.1

Did the image build workflow changed?

Thanks in advance

ziouf commented 1 year ago

It looks like somthing is triggering a rebuild of images, overriding latest, v1, v1.29 and v1.29.1 tags on a daily basis.

https://quay.io/repository/buildah/stable?tab=history

I assume that v1.29.1 should be an immuable tag

Blaimi commented 1 year ago

same issue here. I wrote a minimal working example at https://gitlab.com/Blaimi/buildah-bughunt.

Blaimi commented 1 year ago

I assume that v1.29.1 should be an immuable tag

They are all built daily according to the readme. v1.29.0 seems not to be build daily anymore because it is outdated

ziouf commented 1 year ago

I assume that v1.29.1 should be an immuable tag

They are all built daily according to the readme. v1.29.0 seems not to be build daily anymore because it is outdated

That's a non-sense to me that stable release doesn't have stable tags ...

elacheche commented 1 year ago

same issue here. I wrote a minimal working example at https://gitlab.com/Blaimi/buildah-bughunt.

In my case, I have a lot of pipelines (different projects with multiple branchs), my workaround is to definne aa Gitlab Group CI/CD variable STORAGE_DRIVER=vfs

Blaimi commented 1 year ago

my workaround is to define a Gitlab Group CI/CD variable STORAGE_DRIVER=vfs

I extended my example with this variable in the matrix-builds and set an hourly scheduler on the build.

That's a non-sense to me that stable release doesn't have stable tags …

I wrote #4717 for that :smile_cat:.

TomSweeneyRedHat commented 1 year ago

@giuseppe might we get lucky and have a fix in the newly released fuse-overlayfs v1.11?

flouthoc commented 1 year ago

I think nothing's wrong in fuse-overlay, its just the config was removed here: https://github.com/containers/buildah/pull/4699

flouthoc commented 1 year ago

@giuseppe @rhatdan maybe we will need to revert this PR for users running builds on old kernels.

elacheche commented 1 year ago

@giuseppe @rhatdan maybe we will need to revert this PR for users running builds on old kernels.

Yes, this confirm my analysis https://github.com/containers/buildah/issues/4715#issuecomment-1498948921

But the real question here is, why a code merged two days ago triggered a re-build of a release that is more than a month old, with the same version/tag..

This is also a CI/CD bug.

@flouthoc , can you please share more details about you saying "old kernels"? I am interesting to learn more about that and why my Amazon Linux 2 is using an "old kernel", or maybe It's not and I just need to enable some extra modules. Thx

flouthoc commented 1 year ago

@elacheche native overlay is easily supported on rootless setups after kernel 5.13 and above ( its was added in 5.11 but I think there were some bugs in 5.11 ) therefore folks running old kernels have no option but to fallback to use fuse-overlays for rootless builds OTOH for users running newer kernels buildah will automatically use native overlay on rootless setups.

This is also a CI/CD bug.

Indeed CI/CD has a issue if its modifying older tags :)

hkrutzer commented 1 year ago

I have a Gitlab runner with kernel version 5.15 and I'm also seeing this issue.

TomSweeneyRedHat commented 1 year ago

@rhatdan PTAL Should we roll https://github.com/containers/buildah/pull/4699 back?

TomSweeneyRedHat commented 1 year ago

@cevich some CI questions in here for the quay container images, in case you didn't see this.

cevich commented 1 year ago

Ironically I too ran into this issue :disappointed:

Indeed CI/CD has a issue if its modifying older tags :)

As y'all found in the readme, the builds happen daily from main to incorporate updates (esp security) for all packages in the image. The image tags are simply extracted from the RPM versions. In the case of the v1.29.1 RPM, tags would be pushed for latest, v1.29.1, v1.29, and v1 - all with the exact same contents.

Since it's a Containerfile change and assuming #4722 is the fix (I haven't looked deeply), then as soon as it's merged, the daily builds will push out new latest, v1.29.1, v1.29, and v1. On the other hand, for users looking for truly immutable/unchanging images, you need to reference the image by sha256 or use the "n-1" tags that aren't updated daily.

Just for some history: There was a great design debate among the containers team, on which approach to take. We decided that since the tag represents the buildah-version, it was better to keep the images continuously updated on the off-chance some non-buildah critical security fix was released. Or in this case, a Containerfile bug.

dhduvall commented 1 year ago

4717 is probably where the image tag stability discussion belongs, but the problem there (as I see it) is not that the underlying OS bits are getting updated daily (that seems perfectly fine), but that the buildah bits are being rebuilt against main as well. And the workaround/even-more-stable-option of using hashes doesn't work because old manifests are discarded.

cevich commented 1 year ago

but that the buildah bits are being rebuilt against main as well.

Only the upstream flavor of the image does that. The other two (stable and testing) install from the distro. RPMs. If the RPM versions change, the image tags will change as well (since they're extracted from the RPM version).

dhduvall commented 1 year ago

I see my confusion:

The manifest from the latest image shows opencontainers.image.version=1.29.1 and org.opencontainers.image.revision=b80da50..., where that commit ID is the one from #4722, the current HEAD. That's true really only for the bits that build the container; like you say, buildah itself comes from the distro RPM, and although the commit it was built from isn't captured, the date does demonstrate it's not recent:

[root@1b888e6da80e /]# buildah version
Version:         1.29.0
Go Version:      go1.19.5
Image Spec:      1.0.2-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        1.0.0
libcni Version:  v1.1.2
image Version:   5.24.0
Git Commit:
Built:           Tue Jan 31 12:06:15 2023
OS/Arch:         linux/arm64
BuildPlatform:   linux/arm64/v8

So I went down the wrong path because of that label and the fact that it was the build that caused the problem and not changes to the executable. I don't know if there's a way to avoid that confusion, or even if it's worth trying.

cevich commented 1 year ago

Oh! hang on a sec...you're right! That's a bug in the build script. Those labels are completely wrong when the image uses RPMs. I'll open an issue on that and get on about fixing it. Thanks for pointing out the mismatch.

nolange commented 1 year ago

@giuseppe @rhatdan maybe we will need to revert this PR for users running builds on old kernels.

I am on linux 6.1 and I still need this config - atleast when buildah is invoked via a gitlab-runner, see #4669

rhatdan commented 1 year ago

You should be able to turn it on if we disable fuse-overlayfs by default.