Closed dustymabe closed 2 years ago
I do a local build with the rawhide branch of the fcos configs repo, run coreos.boot-mirror.luks
test with kola, and result is passed
There are something different between the logs, maybe the failed reason is:
when running ignition-ostree-transposefs
, find root device /dev/disk/by-label/root
, but actual is /dev/dm-0
Failed logs:
Reached target - Initrd Root Device.
[ 106.786833] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed condition check (ConditionPathExists=!/run/ignition-ostree-transposefs).
[ 106.816385] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...
Starting mignition-ostree-t…tion OSTree: Restore Partitions...
[ 106.873485] ignition-ostree-transposefs[1852]: Restoring rootfs from RAM...
[ 106.983108] ignition-ostree-transposefs[1856]: realpath: /dev/disk/by-label/root: No such file or directory
Successful logs:
[ 143.842513] systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.^M
[ 143.859420] ignition-ostree-transposefs[1825]: Expected /dev/disk/by-label/root to point to /dev/dm-0, but points to /dev/disk/by-label/root; triggering udev^M
[ 143.876207] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed condition check (ConditionPathExists=!/run/ignition-ostree-transposefs).^M
[ 143.898409] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M
[ 143.991660] ignition-ostree-transposefs[1825]: Mounting /dev/disk/by-label/root (/dev/dm-0) to /sysroot^M
If you run it ten times does it pass every time? cosa kola run coreos.boot-mirror.luks --multiply=10
should do it.
If you run it ten times does it pass every time?
cosa kola run coreos.boot-mirror.luks --multiply=10
should do it.
Yes, all passed after run totally 20 times
Hmm. Maybe the issue only happens in our CI environment. Regardless we can remove the snooze and see if it happens again.
Looking at the error, this patch may help:
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
index 18224c36..4f72aacb 100755
--- a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
@@ -56,7 +56,7 @@ udev_trigger_on_label_mismatch() {
local expected_dev=$1; shift
local actual_dev
expected_dev=$(realpath "${expected_dev}")
- actual_dev=$(realpath "/dev/disk/by-label/$label")
+ actual_dev=$(realpath "/dev/disk/by-label/$label" || :)
if [ "$actual_dev" != "$expected_dev" ]; then
echo "Expected /dev/disk/by-label/$label to point to $expected_dev, but points to $actual_dev; triggering udev"
udevadm trigger --settle "$expected_dev"
Though would be good to narrow down on what's happening there at the udev level and possibly file another RHBZ before working around it. The udevadm trigger
was initially added because of issues in RHCOS, not FCOS.
This is back in
37.20220217.91.0
(build#80).I was able to reproduce locally pretty easily. #HuijingHei - can you try with this qcow (will need to uncompress first).
Test with fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2, failed to entered emergency.target in initramfs
2 times(total 10), but the error logs are different with the first description
Details
[ 204.645515] systemd[1]: Timed out waiting for device dev-disk-by\x2dpartlabel-root\x2d2.device - /dev/disk/by-partlabel/root-2.
[ 204.695643] ignition[861]: disks: createRaids: op(a): [failed] waiting for devices [/dev/disk/by-partlabel/boot-1 /dev/disk/by-partlabel/boot-2 /dev/dist
[ 204.788764] multipathd[552]: exit (signal)
[ 206.313988] systemd[1]: dev-disk-by\x2dpartlabel-root\x2d2.device: Job dev-disk-by\x2dpartlabel-root\x2d2.device/start failed with result 'timeout'.
[ 206.361174] ignition[861]: Ignition failed: failed to create raids: failed to wait on raids devs: device unit dev-disk-by\x2dpartlabel-root\x2d2.device tit
[ 206.423551] ignition[861]: }
[ 206.431028] systemd[1]: ignition-disks.service: Main process exited, code=exited, status=1/FAILURE
[ 206.459967] systemd[1]: ignition-disks.service: Failed with result 'exit-code'.
@dustymabe is there anyway I can find console logs about coreos.boot-mirror.luks
in build#80?
It looks like maybe the logs got cleaned up from that run already :(
I just ran this locally against the ISO from https://github.com/coreos/fedora-coreos-tracker/issues/1092#issuecomment-1047592867 and it failed:
[ 81.712037] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M
Starting ^[[0;1;39mignition-ostree-t…tion OSTree: Restore Partitions...^M
[ 81.733279] ignition-ostree-transposefs[1856]: Restoring rootfs from RAM...^M
[ 81.802958] ignition-ostree-transposefs[1860]: realpath: /dev/disk/by-label/root: No such file or directory^M
[ 81.805947] systemd[1]: ignition-ostree-transposefs-restore.service: Main process exited, code=exited, status=1/FAILURE^M
[ 81.806792] systemd[1]: ignition-ostree-transposefs-restore.service: Failed with result 'exit-code'.^M
[^[[0;1;31mFAILED^[[0m] Failed to start ^[[0;1;39mignition-o…nition OSTree: Restore Partitions.^M
So.. indeed looks like your failure is different because mine found those devices:
[ 39.664588] ignition[855]: disks: createRaids: op(a): [finished] waiting for devices [/dev/disk/by-partlabel/boot-1 /dev/disk/by-partlabel/boot-2 /dev/disk/by-partlabel/root-1 /dev/disk/by-partlabel/root-2]
I can reproduce with fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2
1 time ( totally 20), the error logs are the same as the description. But I have no idea how to debug this, any suggestions? Thanks!
The snooze for coreos.boot-mirror.luks
was extended in https://github.com/coreos/fedora-coreos-config/pull/1543.
@HuijingHei One thing I find helpful when debugging udev-related things is to boot with rd.udev.log_level=debug
. That will give you debug output for systemd-udevd and you should be able to see for example at what point the root symlink goes away.
Thanks @jlebon for your suggestion.
Append rd.udev.log_level=debug
to kernel argument and run coreos.boot-mirror.luks
, here is console log, hope it will be helpful
console.txt
@HuijingHei The debug logs are in the journal. If you want it on the console, you can add systemd.log_target=console
but because it's a lot of data, the printing may influence timing and make the bug not appear.
Hmm but actually, you're triggering this by running the test and not manually doing cosa run
with a similar Butane config, right? In that case, you should have the initrd journal dump in the test output directory. The file is called ignition-virtio-dump.txt
.
Thanks @jlebon , I will try tomorrow
Hmm but actually, you're triggering this by running the test and not manually doing
cosa run
with a similar Butane config, right? In that case, you should have the initrd journal dump in the test output directory. The file is calledignition-virtio-dump.txt
.
I append AppendKernelArgs: "rd.udev.log_level=debug",
to
https://github.com/coreos/coreos-assembler/blob/d5f1623aad6d133b2c7c00e784c04ab6828450c1/mantle/kola/tests/misc/boot-mirror.go#L144, rebuild kola and exec cosa kola run --qemu-image fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2 coreos.boot-mirror.luks --multiply=10
, is the step correct? (upload the ignition-virtio-dump.txt
)
just noting here that I see a similar error in the ignition-ostree-transposefs
service in the ext.config.root-reprovision.filesystem-only
failure we just saw over in branched
(build#201).
The snooze for coreos.boot-mirror.luks
was extended in https://github.com/coreos/fedora-coreos-config/pull/1576.
I went back to see when this problem was introduced.
36.20220205.91.0
36.20220208.91.0
Added:
openldap-compat-2.6.1-2.fc36.x86_64
Upgraded:
NetworkManager 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
NetworkManager-cloud-setup 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
NetworkManager-libnm 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
NetworkManager-team 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
NetworkManager-tui 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
btrfs-progs 5.16-2.fc36.x86_64 → 5.16.1-1.fc36.x86_64
containerd 1.5.8-1.fc36.x86_64 → 1.6.0~rc.2-2.fc36.x86_64
containers-common 4:1-47.fc36.noarch → 4:1-50.fc36.noarch
crypto-policies 20220119-1.git50109e7.fc36.noarch → 20220203-1.git112f859.fc36.noarch
kernel 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
kernel-core 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
kernel-modules 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
krb5-libs 1.19.2-4.fc36.1.x86_64 → 1.19.2-5.fc36.x86_64
libgcc 12.0.1-0.4.fc36.x86_64 → 12.0.1-0.6.fc36.x86_64
libgomp 12.0.1-0.4.fc36.x86_64 → 12.0.1-0.6.fc36.x86_64
libibverbs 38.1-3.fc36.x86_64 → 39.0-1.fc36.x86_64
libicu 69.1-3.fc36.x86_64 → 69.1-5.fc36.x86_64
libstdc++ 12.0.1-0.4.fc36.x86_64 → 12.0.1-0.6.fc36.x86_64
nftables 1:1.0.1-2.fc36.x86_64 → 1:1.0.1-3.fc36.x86_64
openldap 2.6.1-1.fc36.x86_64 → 2.6.1-2.fc36.x86_64
rpm-ostree 2022.1-3.fc36.x86_64 → 2022.2-1.fc36.x86_64
rpm-ostree-libs 2022.1-3.fc36.x86_64 → 2022.2-1.fc36.x86_64
squashfs-tools 4.5-6.20220202git11c9591.fc36.x86_64 → 4.5-7.20220204git8a9d02e.fc36.x86_64
Here are the fedora-coreos-config
changes between those two: f4cf5f9..b1fa0e2
Doing a local rawhide
build with all current (as of 20220309) packages except for reverting the kernel to kernel-5.17.0-0.rc2.83.fc36
makes it so that I can't reproduce the issue. The root cause of this looks to be something in the kernel transition:
kernel 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
IIUC the set of changes between those two can be seen by comparing v5.17-rc2..dcb85f85fa6f
in the kernel-ark repo.
Skimmed that diff, but nothing jumped out related to storage devices, mounts, udev, etc. I think we'd need to bisect.
$ git log --pretty=oneline --no-merges v5.17-rc2..dcb85f85fa6f | wc -l
143
I do not know why I can not reproduce locally, not sure my steps are correct.
cosa init -b rawhide https://github.com/coreos/fedora-coreos-config
cosa fetch
cosa build
kola run coreos.boot-mirror.luks --multiply=10
I do not know why I can not reproduce locally, not sure my steps are correct.
Most likely it's a race condition and my system is either faster or slower than yours :)
I'm no longer seeing this issue with latest kernels in rawhide
. It appears it stopped showing up in the following kernel transition:
kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64
ok more information here.. I was mislead in my previous understanding that the issue was fixed. It turns out that kernel NVR versions that include ${date}git${hash}
have debug turned on, whereas the kernels that don't do not. For example:
have debug turned ON, where:
do NOT.
This problem actually only shows itself when debug is turned on. I was able to reproduce this issue on testing-devel
based on Fedora Linux 35 when I used the debug kernel:
$ git diff
diff --git a/manifests/bootable-rpm-ostree.yaml b/manifests/bootable-rpm-ostree.yaml
index 784acd4..d23d3a6 100644
--- a/manifests/bootable-rpm-ostree.yaml
+++ b/manifests/bootable-rpm-ostree.yaml
@@ -7,7 +7,7 @@
packages:
# Kernel + systemd. Note we explicitly specify kernel-{core,modules}
# because otherwise depsolving could bring in kernel-debug.
- - kernel kernel-core kernel-modules systemd
+ - kernel-debug kernel-debug-core kernel-debug-modules systemd
# linux-firmware now a recommends so let's explicitly include it
# https://gitlab.com/cki-project/kernel-ark/-/commit/32271d0cd9bd52d386eb35497c4876a8f041f70b
# https://src.fedoraproject.org/rpms/kernel/c/f55c3e9ed8605ff28cb9a922efbab1055947e213?branch=rawhide
NOTE: For now I needed this patch to rpm-ostree
to do the build.
The problem actually goes back much farther. I chased it for a while but then gave up. Here are the debug kernels I tried and confirmed the issue occurs on (all built on top of testing-devel
as of today):
kernel-5.16.0-60.fc36
kernel-5.15.18-200.fc35
kernel-5.14.18-300.fc35
kernel-5.13.16-200.fc34
kernel-5.12.19-300.fc34
kernel-5.10.19-200.fc33
So this problem has been around over a year in the kernel. I imagine we only started hitting it this year because some of our code introduced some path which made the race easier to hit.
Rather than waste much more time on this issue (since it only happens with debug kernels) we should consider ways to make the test artificially not fail if we're on a debug kernel.
I decided for now to punt on this issue with a workaround since I've spent way too much time on it and the fact that it only presents itself with a debug kernel, which we'll never ship to our prod streams.
Someone else is welcome to pick up and investigate further and open bugs if they find enough information to open a BZ, but I need to move on.
The
coreos.boot-mirror.luks
just started failing semi consistently in therawhide
stream. We first saw it in CI and now it's happening in the pipeline fedora-coreos-pipeline#19. Cracking open theconsole.txt
for the test shows us:Full console.txt file: coreos-boot-mirror-luks-console.txt