Closed dustymabe closed 5 months ago
some more info.. it appears something with the ext4 filesystem when OSBuild runs the qemu-img convert
to get from raw
to qcow2
is getting corrupted. I converted the image back into a raw
image from the qcow2
and mount up the third partition and when I ls on the ostree/
dir I see:
bash-5.2# ls /mnt/ostree/
[ 7220.329345] EXT4-fs error (device loop0p3): ext4_lookup:1855: inode #65538: comm ls: iget: checksum invalid
ls: cannot access '/mnt/ostree/': Bad message
ok further update.. I became increasingly suspicious of the corruption happening earlier in the OSBuild steps themselves rather than when copying the file out of the supermin VM.
I decided to just add a sanity check into the org.osbuild.qemu
OSBuild stage that immediately ran a qemu-img compare
after it created the qcow2
from the raw
image. Here's the patch:
diff --git a/stages/org.osbuild.qemu b/stages/org.osbuild.qemu
index 642b5146..54e707d4 100755
--- a/stages/org.osbuild.qemu
+++ b/stages/org.osbuild.qemu
@@ -219,22 +219,34 @@ def main(inputs, output, options):
if coroutines:
print(f"qemu-img coroutines: {coroutines}")
extra_args += ["-m", coroutines]
cmd = [
"qemu-img", "convert",
"-O", fmt["type"],
*extra_args,
source, target
]
subprocess.run(
cmd, check=True
)
+ # Sanity check that the image is 100%
+ cmd = [
+ "qemu-img", "compare",
+ "-f", "raw",
+ "-F", fmt["type"],
+ source, target
+ ]
+ subprocess.run(
+ cmd, check=True
+ )
+
+
return 0
if __name__ == '__main__':
args = osbuild.api.arguments()
r = main(args["inputs"], args["tree"], args["options"])
sys.exit(r)
and sure enough one of the first runs I went through complained:
⏱ Duration: 0s
Pipeline qemu: dd43158cea505a60ebd511dc09e0fba0247cef4903ebf9c5b2c70e813b54a2cf
Build
root: <host>
runner: org.osbuild.fedora38 (org.osbuild.fedora38)
org.osbuild.qemu: dd43158cea505a60ebd511dc09e0fba0247cef4903ebf9c5b2c70e813b54a2cf {
"filename": "fedora-coreos-39.20240214.dev.0-qemu.x86_64.qcow2.tmp",
"format": {
"type": "qcow2",
"compression": false,
"compat": "1.1"
}
}
[ 97.999904] audit: type=1400 audit(1707944575.724:72): avc: denied { mounton } for pid=961 comm="mount" path="/srv/cache/osbuild/store/tmp/buildroot-tmp-bzjrq14z/inputs/image/disk.img" dev="vdb1" ino=17809672 scontext=system_u:system_r:ke1
Failed to open file "/sys/fs/selinux/checkreqprot": Read-only file system
Content mismatch at offset 403718656!
Traceback (most recent call last):
File "/run/osbuild/bin/org.osbuild.qemu", line 251, in <module>
r = main(args["inputs"], args["tree"], args["options"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/run/osbuild/bin/org.osbuild.qemu", line 241, in main
subprocess.run(
File "/usr/lib64/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['qemu-img', 'compare', '-f', 'raw', '-F', 'qcow2', '/run/osbuild/inputs/image/disk.img', '/run/osbuild/tree/fedora-coreos-39.20240214.dev.0-qemu.x86_64.qcow2.tmp']' returned non-zero exit status 1.
⏱ Duration: 4s
Failed
From the Content mismatch at offset 403718656!
line we can see that offset 403718656
is ~ 385 MiB
which places it squarely in the /boot/
parition 3.
Talked to @jlebon on this. We think (actually more @jlebon, I was just nodding my head to everything he said) that this might be a bug with reflinks on XFS.
The cache qcow2 is an XFS filesystem and when we switch to not using compression for the org.osbuild.qemu
stage we think it enables qemu-img convert
to use reflinks for some of the data ranges in the image (i.e. to share between the source raw
image and newly created qcow2
image).
i.e. this is a theory that could explain why we see issues only after switching off compression. This particular issue reminded @jlebon of https://github.com/coreos/coreos-assembler/pull/935 which we never fully got to the bottom of.
So I don't know if it's this, but the classic failure mode when you create a disk image and then boot from it shortly afterwards is:
cache=none
mode (which uses O_DIRECT
)If this is the scenario, then you need to modify your image creation pipeline so that it does an fsync
or similar on the disk image before qemu runs.
Here's how virt-builder does that:
@rwmjones in the failure I detailed in https://github.com/coreos/coreos-assembler/issues/3728#issuecomment-1944607900 all that is happening is:
qemu-img convert input.raw output.qcwo2
qemu-img compare -f raw -F qcow2 input.raw output.qcow2
and you can see the failure:
Content mismatch at offset 403718656!
Would you anticipate this to be a kernel page cache issue?
No that wouldn't the kernel page cache issue.
It's extremely suspicious though. Usually qcow2 and qemu-img are rock solid tools.
It's extremely suspicious though. Usually qcow2 and qemu-img are rock solid tools.
Yes :). I'd like to take a moment here and thank you and all the other maintainers of those tools over the years. You've truly built something that is load bearing for half of the internet and the world's economy at this point.
I currently don't think it's an issue in qemu-img, but will let you know if that changes.
I do appreciate you weighing in here, though.
OK - I was asked by @sandeen to provide a disk image containing an XFS filesystem with the good and bad generated disk images on it.
I had to make a slight modification to OSBuild to make it save off the bad disk image on failure:
diff --git a/osbuild/pipeline.py b/osbuild/pipeline.py
index af4c3944..93184bf6 100644
--- a/osbuild/pipeline.py
+++ b/osbuild/pipeline.py
@@ -358,6 +358,8 @@ class Pipeline:
results["stages"].append(r)
if not r.success:
+ print(f"XXXX failed tree object at {tree.path} committing")
+ object_store.commit(tree, stage.id)
cleanup(build_tree, tree)
results["success"] = False
return results
Then I ran this in a loop:
while cosa buildextend-qemu --force; do
echo "COUNT is $count";
count=$((count+1));
rm -f cache/cache2.qcow2;
done
It took a couple of iterations, but it did fail with:
⏱ Duration: 0s
Pipeline qemu: e0ab569c34dd263e53d28e56e0b2f9d6f27c4cb7c4fe5d7177f57f8b719fd229
Build
root: <host>
runner: org.osbuild.fedora38 (org.osbuild.fedora38)
org.osbuild.qemu: e0ab569c34dd263e53d28e56e0b2f9d6f27c4cb7c4fe5d7177f57f8b719fd229 {
"filename": "qemu.qcow2",
"format": {
"type": "qcow2",
"compression": false,
"compat": "1.1"
}
}
Failed to open file "/sys/fs/selinux/checkreqprot": Read-only file system
Content mismatch at offset 403718656!
Traceback (most recent call last):
File "/run/osbuild/bin/org.osbuild.qemu", line 251, in <module>
r = main(args["inputs"], args["tree"], args["options"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/run/osbuild/bin/org.osbuild.qemu", line 241, in main
subprocess.run(
File "/usr/lib64/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['qemu-img', 'compare', '-f', 'raw', '-F', 'qcow2', '/run/osbuild/inputs/image/disk.img', '/run/osbuild/tree/qemu.qcow2']' returned non-zero exit status 1.
⏱ Duration: 5s
XXXX failed tree object at /srv/cache/osbuild/store/stage/uuid-66d99595d47645869c04b0608ce618f9/data committing
Failed
+ rm -rf /srv/tmp/build.qemu/supermin.out /srv/tmp/build.qemu/supermin.prepare /srv/tmp/build.qemu/supermin.build
+ '[' '!' -f /srv/tmp/build.qemu/rc ']'
++ cat /srv/tmp/build.qemu/rc
+ rc=1
+ '[' -n '' ']'
+ return 1
+ rm -f /srv/builds/39.20240216.dev.0/x86_64/.qemu.building
failed to execute cmd-buildextend-qemu: exit status 1
[coreos-assembler]$
[coreos-assembler]$ rpm -q kernel-core xfsprogs
kernel-core-6.7.3-200.fc39.x86_64
xfsprogs-6.4.0-1.fc39.x86_64
Here's a link to the disk image: cache2.qcow2.zst.
[dustymabe@media cache]$ md5sum cache2.qcow2*
99e49544af7c10ed419afa6040cf3a7e cache2.qcow2
48718e271056210d82e9ecf04fa7f624 cache2.qcow2.zst
Within the disk XFS filesystem the files for input and output of qemu-img convert
are located at:
raw
image:
osbuild/store/objects/0a4be93ab2bb5c17851a57574f3af0cffe135c58ccfc840e4f0ea37dda36f77c/data/tree/disk.img
qcow2
image:
osbuild/store/objects/e0ab569c34dd263e53d28e56e0b2f9d6f27c4cb7c4fe5d7177f57f8b719fd229/data/tree/qemu.qcow2
I was also asked to run xfs_bmap -vvp
against the source raw
image. Here is the output of that:
As a lazy way to test this in other environments I opened a test PR against OSBuild that would then run in their CI
Sure enough CI failed on two environments:
Thanks. So the disk.img does indeed have many reflinked extents, for starters (extents with flags 100000) When I compare the original disk image to the converted-back-to-raw qcow2 image, the qcow2 image seems to have many ranges of zeros rather than correct data:
# qemu-img convert osbuild/store/objects/e0ab569c34dd263e53d28e56e0b2f9d6f27c4cb7c4fe5d7177f57f8b719fd229/data/tree/qemu.qcow2 ../disk.img
# cmp -l ./osbuild/store/objects/0a4be93ab2bb5c17851a57574f3af0cffe135c58ccfc840e4f0ea37dda36f77c/data/tree/disk.img ../disk.img | awk '{print $3}' | sort | uniq
0
Not sure what to make of this yet... my first thought is that it's possibly related to preallocated/unwritten extents in the original disk.img but that's just a guess.
Hi,
Can you upload the pre-convert raw image somewhere, or is there an easy way for me to get it from some pipeline (like https://github.com/osbuild/osbuild/pull/1594)? (I’ve tried downloading the cache2.qcow2.zst that you linked, but the download is very slow for me (~30 kB/s), so would take more than a day to download.)
Talked with @XanClic and she was able to get the file downloaded
Some more context here about what OSBuild is doing that might be unique here.
OSbuild builds things in stages. We have stages that do one or two things, but the entire output from that stage gets saved and copied into the next stages for the next work item.
A simplified version of what is going on here is:
tree
raw-image
raw-qemu-image
qemu
qcow2
So what's happening here is there is a disk.img
from the raw-image
stage that gets copied into the raw-qemu-image
stage and then modified slightly which then gets qemu-img convert
ed in the qemu
stage.
So there is an original disk.img
from the raw-image
pipeline that propably shares reflinks with the disk.img
from the raw-qemu-image
pipeline that then shares reflinks with the qemu.qcow2
from the qemu
pipeline.
This series of steps may be what is contributing to the problem here.
Also note this isn't 100% reproducible. Sometimes I can reproduce it every time and other times only 10% or so.
More data:
I patched OSBuild to not cp
using --refink=auto
when it copies the first disk.img
to the second disk.img
in the workflow described above:
diff --git a/stages/org.osbuild.copy b/stages/org.osbuild.copy
index 5bc9067f..a7570ee7 100755
--- a/stages/org.osbuild.copy
+++ b/stages/org.osbuild.copy
@@ -157,7 +157,7 @@ def main(args, options):
print(f"copying '{src}' -> '{dst}'")
- cmd = ["cp", "-a", "--reflink=auto"]
+ cmd = ["cp", "-a", "--reflink=never"]
if remove_destination:
cmd.append("--remove-destination")
subprocess.run(cmd + [src, dst], check=True)
I ran the loop 30 times and didn't see any failures.
So far, I haven’t found any reproducer on my system (and I’d like to have some simple local reproducer :)).
What I’ve tried that didn’t work (50 iterations for most items, 1000 for the last):
However, I haven’t scrubbed the image between runs (i.e. basically rm cache2.qcow2 && zstd -d cache2.qcow2.zst), that’s one thing I still ought to try (but will of course take much longer).
However, I haven’t scrubbed the image between runs (i.e. basically rm cache2.qcow2 && zstd -d cache2.qcow2.zst), that’s one thing I still ought to try (but will of course take much longer).
I’ve done that now for the VM case (doesn’t take that long with hot-plugging and -unplugging), and still couldn’t reproduce it (50 runs).
OK I might have a more minimal reproducer that doesn't include OSBuild.
I took a look at exactly what the OSbuild stages were doing when copying things around and wrote a script to try to simulate it more closely. The summary is that we are now:
disk.img
into a new disk.img
with --reflink=always
qemu-img convert
to generate the qcow2
Using the filesystem I uploaded before, mount it and cd into the root of that filesystem and then run this script:
[builder@2f7c40c67cf8 srv]$ cat repro.sh
#/usr/bin/bash
set -eux -o pipefail
[ ! -e /dev/loop-control ] && mknod /dev/loop-control c 10 237
mkdir -p mnt
count=1
while true; do
echo "COUNT is $count"; count=$((count+1))
cp --remove-destination --reflink=always \
osbuild/store/objects/06ead42642e12a8f1ab109bf29935661ed4060a0bc59accbae7549a2f0cd900f/data/tree/disk.img ./
truncate --size=$((10 * 1024 * 1024 * 1024)) disk.img
loopdev=$(losetup -f --show --partscan disk.img)
sleep 1
mount ${loopdev}p3 ./mnt/
sed -i -E 's/^options (.*)$/options \1 ignition.platform.id=qemu console=ttyS0 console=ttyS0,115200n8 ignition.platform.id=qemu/' \
./mnt/loader/entries/*.conf
umount ./mnt
qemu-img convert -O qcow2 -o compat=1.1 disk.img qemu.qcow2
qemu-img compare -f raw -F qcow2 disk.img qemu.qcow2
done
bash-5.2# bash /srv/repro.sh
+ '[' '!' -e /dev/loop-control ']'
+ mkdir -p mnt
+ count=1
+ true
+ echo 'COUNT is 1'
COUNT is 1
+ count=2
+ cp --remove-destination --reflink=always osbuild/store/objects/06ead42642e12a8f1ab109bf29935661ed4060a0bc59accbae7549a2f0cd900f/data/tree/disk.img ./
+ truncate --size=10737418240 disk.img
++ losetup -f --show --partscan disk.img
[ 234.459387] loop0: detected capacity change from 0 to 20971520
[ 234.461171] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 234.462675] GPT:5142527 != 20971519
[ 234.463235] GPT:Alternate GPT header not at the end of the disk.
[ 234.463979] GPT:5142527 != 20971519
[ 234.464393] GPT: Use GNU Parted to correct GPT errors.
[ 234.465016] loop0: p1 p2 p3 p4
+ loopdev=/dev/loop0
+ sleep 1
+ mount /dev/loop0p3 ./mnt/
[ 235.488936] EXT4-fs (loop0p3): mounted filesystem 96d15588-3596-4b3c-adca-a2ff7279ea63 r/w with ordered data mode. Quota mode: none.
+ sed -i -E 's/^options (.*)$/options \1 ignition.platform.id=qemu console=ttyS0 console=ttyS0,115200n8 ignition.platform.id=qemu/' ./mnt/loader/entries/ostree-1-fedora-coreos.conf
+ umount ./mnt
[ 235.515810] EXT4-fs (loop0p3): unmounting filesystem 96d15588-3596-4b3c-adca-a2ff7279ea63.
+ qemu-img convert -O qcow2 -o compat=1.1 disk.img qemu.qcow2
+ qemu-img compare -f raw -F qcow2 disk.img qemu.qcow2
Content mismatch at offset 403718656!
Nice job on the reproducer! If I omit the step where it modifies the copied disk.img via sed, it passes. If I md5sum the copied disk.img before mounting it, it passes.
Not sure what to make of this yet.
Thanks, that works indeed!
With debugging information put into qemu-img, I can see that it believes the offset is zero, whereas it is not zero when you actually inspect it. qemu-img gets this information via SEEK_HOLE/SEEK_DATA.
There seems to be some inconsistency in this hole information. Fully reading disk.img seems to update this information, so putting a cat disk.img > /dev/null
before the qemu-img convert
has the test pass.
Consequently, neither qcow2 nor qemu-img nor what target filesystem you use seem to be of importance, but beware that if you use qemu-img compare
, it will again check hole information, so the following passes for me:
qemu-img convert -f raw -O raw disk.img /tmp/qemu.img
qemu-img compare -f raw -F raw disk.img /tmp/qemu.img
(Side note, I actually have no idea why it does report a mismatch when using qcow2. The hole information in disk.img should still be wrong, making qemu-img compare
skip the same areas as convert
did. Perhaps the qcow2 clustering changes something about the compare
granularity.)
So this says both images are identical, but if you use diff -q
instead of qemu-img compare
(or compare a hexdump), you will indeed see that they differ, because diff
actually reads the whole image. Same if you put a cat disk.img > /dev/null
in between qemu-img convert
and qemu-img compare
(because this seems to update the hole information).
So replacing qemu-img convert
+qemu-img compare
by the following reproduces the problem completely without qemu or qcow2:
cp --sparse=always disk.img /tmp/qemu.img
diff -q disk.img /tmp/qemu.img
(Again, note that using qemu-img compare
instead of diff
would report for both images to be identical because it checks hole information and skips everything that’s reportedly zero. But if you run it after something has read the whole image (e.g. after diff
, which does that), it reports a difference.)
A lot to unpack there. Thanks @XanClic! I'm not sure quite what to make of it all, but there is one thing you mention in there that seems interesting:
Consequently, neither qcow2 nor qemu-img nor what target filesystem you use seem to be of importance
I'd argue that the target filesystem is of importance here. We switched from xfs
to ext4
to get unblocked on this and that is working fine in our pipeline. I've also been running builds on Fedora Cloud VMs (BTRFS, which also supports reflinks) and haven't seen any issues there.
I'm not saying it is 100% filesystem related, but there is at least some data to point towards it being related.
Oh, absolutely, the (source) filesystem does matter. I just meant the target filesystem of the copy/convert operation (for qemu.qcow2), which I changed to tmpfs, and still saw the problem.
I do believe that the (seemingly) incorrect hole information for disk.img is XFS-related.
Patch from dchinner:
https://lore.kernel.org/linux-xfs/20240220224928.3356-1-david@fromorbit.com/T/#u
Still trying to craft a simple reproducer though.
Landed upstream in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4b2f459d86252619448455013f581836c8b1b7da
Hasn't been backported to any stable branches yet.
This was backported to Fedora 6.7 series in https://gitlab.com/cki-project/kernel-ark/-/commit/c0412c5250f7bdb2ea61b61f27eada0d8a135692 and should land in F39 with kernel-6.7.11-200
(https://bodhi.fedoraproject.org/updates/FEDORA-2024-2fcce4ffb7)
With the new kernel landed and our cache qcow switched back to XFS we can now close this.
When we switched OSbuild over to not compressing we started getting failures where images wouldn't boot like: