Closed cevich closed 2 years ago
A lot of historical issues suggest that this is how podman says "out of disk space". Is it possible to check df
on the new f37 VM?
CI captures df output at the end of the integration (and system?) task, but doesn't log it during. That said, the VMs should start with 80+ GB free space, but it's possible the disk expansion mechanism broke...
...heh, no I don't think this is the problem (df output), but it was wise to suggest checking, thanks!
+ df -lhTx tmpfs
------------------------------------------------------------
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 4.0M 0 4.0M 0% /dev
/dev/sda5 btrfs 199G 3.4G 195G 2% /
/dev/sda5 btrfs 199G 3.4G 195G 2% /home
/dev/sda2 ext4 966M 99M 802M 11% /boot
/dev/sda3 vfat 100M 9.9M 90M 10% /boot/efi
overlay overlay 199G 3.4G 195G 2% /tmp/podman_test1390407731/server_root/overlay/12c842d64ba1a9b9d48f2d7f63656001f20429cdafea877049bb714d93d9cd7b/merged
This part of the error is curious:
* docker-archive: loading tar component manifest.json: archive/tar: invalid tar header
Maybe something changed in go-land (archive/tar
presumably) that removed some former "un-gzip it first" behavior?
@mtrmac knows about image stuff...does this ring any bells for you?
It doesn’t really ring any bells, and I don’t much understand how changing the build environment (without changing the Podman or c/image codebase at all, AFAICS) could make a difference here.
If I had to guess, missing gzip
command in the image?
how changing the build environment could make a difference here.
Fair point. Hmmm. It's possible gzip is missing, I can check that. Thanks for the suggestion.
No, it's not missing gzip. I would go as far as to say that a UNIX system without gzip is impossible in 2022.
This reproduces super-trivially:
# podman pull quay.io/libpod/alpine:latest
...
# podman save -o /tmp/foo.tar alpine
...
# gzip /tmp/foo.tar
# podman load -i /tmp/foo.tar.gz
Error: payload does not match any of the supported image formats:
* oci: initializing source oci:/tmp/foo.tar.gz:: open /tmp/foo.tar.gz/index.json: not a directory
* oci-archive: loading index: open /var/tmp/oci3715535303/index.json: no such file or directory
* docker-archive: loading tar component manifest.json: archive/tar: invalid tar header
* dir: open /tmp/foo.tar.gz/manifest.json: not a directory
podman-4.3.0~rc1-1.fc37.x86_64 gzip-1.12-2.fc37.x86_64
Damn...well that probably rules out easy fixes I can do.
Fails even with gzip-1.11-1.fc36.x86_64. Fails with podman-4.2.1-2.fc37.x86_64. Fails with podman built from source.
Should we be asking: How did this ever pass? Maybe the test was disabled and recently re-enabled by accident or fluke?
This reproduces super-trivially:
You’re right, this is clearly broken in c/image. Looking…
It passed because we don't do CI on f37, and e2e
tests don't run as part of Fedora gating, only system tests. (Oh hi. This is me with a friendly reminder that test/system
exists and feels neglected). So, something changed in f37, and the question is, what?
For context, I caught this in the e2e tests. What I meant in my comment above is...if it's a problem in the podman code (or a dependency), it seems strange it's been passing all along in F36. Since it has been, the problem must somehow be influenced by both the code AND the environment...which is hard to put my head around (in the context of the error).
@mtrmac I wouldn't be so sure it's c/image
. podman @ main on my f36 laptop passes happily. Only f37 fails. This strongly smells like a problem in f37.
Does c/image interact with the system's gzip/tar or kernel in some way? Oof. I'm glad Dr. Miloslav is on the case :grinning:
Looking at the generated file, it seems almost exactly the first 1 MB of the decompressed data is just missing.
Replacing the use of github.com/klauspost/pgzip
with standard library’s compress/gzip
makes things work.
No difference if I downgrade to glibc-2.35.9000-32.fc37.x86_64, make clean
, and make
.
Here's the last bit of strace
:
openat(AT_FDCWD, "/tmp/foo.tar.gz/manifest.json", O_RDONLY|O_CLOEXEC) = -1 ENOTDIR (Not a directory)
write(2, "Error: payload does not match an"..., 410) = 410
@edsantiago There’s a bit of confusing Podman behavior: “payload does not match any of the supported image formats:” i.e. it blindly tries four different formats. The last one is definitely going to fail like that; we are interested in the docker-archive:
failure.
Huh... okay, trimmed a little, istm that the untar is failing:
newfstatat(AT_FDCWD, "/tmp/foo.tar.gz", {st_mode=S_IFREG|0644, st_size=2715866, ...}, 0) = 0
openat(AT_FDCWD, "/tmp/foo.tar.gz", O_RDONLY|O_CLOEXEC) = 3
epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3004675888, u64=140156377478960}}) = -1 EPERM (Operation not permitted)
read(3, "\37\213\10\10\3\3635c", 8) = 8
mmap(0xc000c00000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc000c00000
epoll_pwait(4, [], 128, 0, NULL, 0) = 0
read(3, "\0\3foo.tar\0\354\234\17|T\325\225\307/\210\20\4\333\200(hU\302Z-\376\1"..., 4096) = 4096
openat(AT_FDCWD, "/var/tmp/docker-tar3612857519", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 7
.....
read(3, "\234m!`201\367U\24\230\344\342\233\r\354\364h\347\272_\vL\1_1\357;v\334Q\345"..., 4096) = 4096
write(7, "\2\0\2\0\2\0\2\0\2\0\2\0\2\0\2\0\2\0\2\0\v\0\2\0\2\0\2\0\v\0\2\0"..., 1048576) = 1048576
futex(0xc00043c548, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ exited with 125 +++
My current hypothesis is that this is Go 1.19 and
NopCloser's result now implements WriterTo whenever its input does.
triggering a bug in pgzip’s WriteTo
.
Would it be trivially easy to try downgrading to Go 1.18 to see if that makes a difference? (If not, don’t do anything complicated, I can continue working from my end.)
Oh, good thinking. With golang-1.18.4-1.fc37.x86_64 it works again.
Downgrading in CI is a "hard sell" for me with the F37 images. I feel it's important that we're running the newer/native distro toolchain to catch things like this (since it will be used at rpmbuild time). I'm okay with leaving the test skip()
in place for now, though the underlying issue will certainly affect end-users of F37 in a few weeks time if it's not fixed :cry:
@mtrmac is this going to break your back? Should we start a conversation with upstream Fedora about "Don't use golang 1.19" in 37?
triggering a bug in pgzip’s WriteTo.
Seems like the obvious solution is to fix pgzip
. I trust that is what @mtrmac's first approach will be.
I think the ultimately correct fix will be some variant of https://github.com/klauspost/pgzip/pull/50 . There might be smaller workarounds we can do with more confidence in the meantime (e.g. wrap the code to make the WriteTo
method unreachable, falling back to the pre-Go 1.19 code path).
FYI @mheon — current Podman is succeeding in tests, but podman load
is broken when compiled with Go 1.19 (which we don’t yet test). I have no idea what that should mean for Podman 4.3, I’ll leave that to you and others.
(A hypothetical bigger worry is that the pgzip
package is used in other decompression paths, notably the primary pull
code. I don’t immediately know if it’s possible for the problematic call combination to happen there — I suppose if it happened on the primary code paths, we would have seen much more breakage, so we are probably fine.)
FWIW https://github.com/klauspost/pgzip/pull/50 was merged; so we need it in all the top-level commands.
Okay, so sounds like we're likely to hit the same/similar/related issues bringing F37 into other repo's CI. Dang :disappointed:
Oops, didn't refresh before commenting. I'll rebase my PR w/o the test-skip and cross my fingers :crossed_fingers:
Rebased my PR with these changes and force-pushed #15760. CI is running now. Thanks a bunch @mtrmac
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
While attempting to update the CI VM images to F37 (beta), found this test is failing on most test-configurations.
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Test should pass
Additional information you deem important (e.g. issue happens only occasionally):
Reproduces on:
int remote fedora-37 root host
int podman fedora-37 rootless host
int podman fedora-37 root host
int podman fedora-37 root container
Output of
podman version
:See below
Output of
podman info
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
Annotated log