coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 59 forks source link

rawhide: kola: 20220205: `coreos.boot-mirror.luks` test enters emergency.target #1092

Closed dustymabe closed 2 years ago

dustymabe commented 2 years ago

The coreos.boot-mirror.luks just started failing semi consistently in the rawhide stream. We first saw it in CI and now it's happening in the pipeline fedora-coreos-pipeline#19. Cracking open the console.txt for the test shows us:

  106.786833] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed c
[  106.816385] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M
         Starting ^[[0;1;39mignition-ostree-t…tion OSTree: Restore Partitions...^M
[  106.873485] ignition-ostree-transposefs[1852]: Restoring rootfs from RAM...^M
[  106.983108] ignition-ostree-transposefs[1856]: realpath: /dev/disk/by-label/root: No such file or directory^M
[  106.991587] systemd[1]: ignition-ostree-transposefs-restore.service: Main process exited, code=exited, status=1/FAILURE^M
[  106.994594] systemd[1]: ignition-ostree-transposefs-restore.service: Failed with result 'exit-code'.^M
[  106.999748] systemd[1]: Failed to start ignition-ostree-transposefs-restore.service

Full console.txt file: coreos-boot-mirror-luks-console.txt

HuijingHei commented 2 years ago

I do a local build with the rawhide branch of the fcos configs repo, run coreos.boot-mirror.luks test with kola, and result is passed

There are something different between the logs, maybe the failed reason is: when running ignition-ostree-transposefs, find root device /dev/disk/by-label/root, but actual is /dev/dm-0

Failed logs:

Reached target  - Initrd Root Device.
[  106.786833] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed condition check (ConditionPathExists=!/run/ignition-ostree-transposefs).
[  106.816385] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...
         Starting mignition-ostree-t…tion OSTree: Restore Partitions...
[  106.873485] ignition-ostree-transposefs[1852]: Restoring rootfs from RAM...
[  106.983108] ignition-ostree-transposefs[1856]: realpath: /dev/disk/by-label/root: No such file or directory

Successful logs:

[  143.842513] systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.^M
[  143.859420] ignition-ostree-transposefs[1825]: Expected /dev/disk/by-label/root to point to /dev/dm-0, but points to /dev/disk/by-label/root; triggering udev^M
[  143.876207] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed condition check (ConditionPathExists=!/run/ignition-ostree-transposefs).^M
[  143.898409] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M
[  143.991660] ignition-ostree-transposefs[1825]: Mounting /dev/disk/by-label/root (/dev/dm-0) to /sysroot^M
dustymabe commented 2 years ago

If you run it ten times does it pass every time? cosa kola run coreos.boot-mirror.luks --multiply=10 should do it.

HuijingHei commented 2 years ago

If you run it ten times does it pass every time? cosa kola run coreos.boot-mirror.luks --multiply=10 should do it.

Yes, all passed after run totally 20 times

dustymabe commented 2 years ago

Hmm. Maybe the issue only happens in our CI environment. Regardless we can remove the snooze and see if it happens again.

dustymabe commented 2 years ago

This is back in 37.20220217.91.0 (build#80).

I was able to reproduce locally pretty easily. #HuijingHei - can you try with this qcow (will need to uncompress first).

jlebon commented 2 years ago

Looking at the error, this patch may help:

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
index 18224c36..4f72aacb 100755
--- a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
@@ -56,7 +56,7 @@ udev_trigger_on_label_mismatch() {
     local expected_dev=$1; shift
     local actual_dev
     expected_dev=$(realpath "${expected_dev}")
-    actual_dev=$(realpath "/dev/disk/by-label/$label")
+    actual_dev=$(realpath "/dev/disk/by-label/$label" || :)
     if [ "$actual_dev" != "$expected_dev" ]; then
         echo "Expected /dev/disk/by-label/$label to point to $expected_dev, but points to $actual_dev; triggering udev"
         udevadm trigger --settle "$expected_dev"

Though would be good to narrow down on what's happening there at the udev level and possibly file another RHBZ before working around it. The udevadm trigger was initially added because of issues in RHCOS, not FCOS.

HuijingHei commented 2 years ago

This is back in 37.20220217.91.0 (build#80).

I was able to reproduce locally pretty easily. #HuijingHei - can you try with this qcow (will need to uncompress first).

Test with fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2, failed to entered emergency.target in initramfs 2 times(total 10), but the error logs are different with the first description

Details

[  204.645515] systemd[1]: Timed out waiting for device dev-disk-by\x2dpartlabel-root\x2d2.device - /dev/disk/by-partlabel/root-2.
[  204.695643] ignition[861]: disks: createRaids: op(a): [failed]   waiting for devices [/dev/disk/by-partlabel/boot-1 /dev/disk/by-partlabel/boot-2 /dev/dist
[  204.788764] multipathd[552]: exit (signal)
[  206.313988] systemd[1]: dev-disk-by\x2dpartlabel-root\x2d2.device: Job dev-disk-by\x2dpartlabel-root\x2d2.device/start failed with result 'timeout'.
[  206.361174] ignition[861]: Ignition failed: failed to create raids: failed to wait on raids devs: device unit dev-disk-by\x2dpartlabel-root\x2d2.device tit
[  206.423551] ignition[861]: }
[  206.431028] systemd[1]: ignition-disks.service: Main process exited, code=exited, status=1/FAILURE
[  206.459967] systemd[1]: ignition-disks.service: Failed with result 'exit-code'.
HuijingHei commented 2 years ago

@dustymabe is there anyway I can find console logs about coreos.boot-mirror.luks in build#80?

dustymabe commented 2 years ago

It looks like maybe the logs got cleaned up from that run already :(

dustymabe commented 2 years ago

I just ran this locally against the ISO from https://github.com/coreos/fedora-coreos-tracker/issues/1092#issuecomment-1047592867 and it failed:

[   81.712037] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M  
         Starting ^[[0;1;39mignition-ostree-t…tion OSTree: Restore Partitions...^M                                          
[   81.733279] ignition-ostree-transposefs[1856]: Restoring rootfs from RAM...^M                                            
[   81.802958] ignition-ostree-transposefs[1860]: realpath: /dev/disk/by-label/root: No such file or directory^M            
[   81.805947] systemd[1]: ignition-ostree-transposefs-restore.service: Main process exited, code=exited, status=1/FAILURE^M
[   81.806792] systemd[1]: ignition-ostree-transposefs-restore.service: Failed with result 'exit-code'.^M                   
[^[[0;1;31mFAILED^[[0m] Failed to start ^[[0;1;39mignition-o…nition OSTree: Restore Partitions.^M                           

So.. indeed looks like your failure is different because mine found those devices:

[   39.664588] ignition[855]: disks: createRaids: op(a): [finished] waiting for devices [/dev/disk/by-partlabel/boot-1 /dev/disk/by-partlabel/boot-2 /dev/disk/by-partlabel/root-1 /dev/disk/by-partlabel/root-2]
HuijingHei commented 2 years ago

I can reproduce with fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2 1 time ( totally 20), the error logs are the same as the description. But I have no idea how to debug this, any suggestions? Thanks!

dustymabe commented 2 years ago

The snooze for coreos.boot-mirror.luks was extended in https://github.com/coreos/fedora-coreos-config/pull/1543.

jlebon commented 2 years ago

@HuijingHei One thing I find helpful when debugging udev-related things is to boot with rd.udev.log_level=debug. That will give you debug output for systemd-udevd and you should be able to see for example at what point the root symlink goes away.

HuijingHei commented 2 years ago

Thanks @jlebon for your suggestion. Append rd.udev.log_level=debug to kernel argument and run coreos.boot-mirror.luks, here is console log, hope it will be helpful console.txt

jlebon commented 2 years ago

@HuijingHei The debug logs are in the journal. If you want it on the console, you can add systemd.log_target=console but because it's a lot of data, the printing may influence timing and make the bug not appear.

Hmm but actually, you're triggering this by running the test and not manually doing cosa run with a similar Butane config, right? In that case, you should have the initrd journal dump in the test output directory. The file is called ignition-virtio-dump.txt.

HuijingHei commented 2 years ago

Thanks @jlebon , I will try tomorrow

HuijingHei commented 2 years ago

Hmm but actually, you're triggering this by running the test and not manually doing cosa run with a similar Butane config, right? In that case, you should have the initrd journal dump in the test output directory. The file is called ignition-virtio-dump.txt.

I append AppendKernelArgs: "rd.udev.log_level=debug", to https://github.com/coreos/coreos-assembler/blob/d5f1623aad6d133b2c7c00e784c04ab6828450c1/mantle/kola/tests/misc/boot-mirror.go#L144, rebuild kola and exec cosa kola run --qemu-image fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2 coreos.boot-mirror.luks --multiply=10, is the step correct? (upload the ignition-virtio-dump.txt)

ignition-virtio-dump.txt

dustymabe commented 2 years ago

just noting here that I see a similar error in the ignition-ostree-transposefs service in the ext.config.root-reprovision.filesystem-only failure we just saw over in branched (build#201).

ext-config-root-reprovision-filesystem-only-console.txt

dustymabe commented 2 years ago

The snooze for coreos.boot-mirror.luks was extended in https://github.com/coreos/fedora-coreos-config/pull/1576.

dustymabe commented 2 years ago

I went back to see when this problem was introduced.

Added:

    openldap-compat-2.6.1-2.fc36.x86_64 

Upgraded:

    NetworkManager 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
    NetworkManager-cloud-setup 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
    NetworkManager-libnm 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
    NetworkManager-team 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
    NetworkManager-tui 1:1.36.0-0.8.fc36.x86_64 → 1:1.36.0-0.9.fc36.x86_64
    btrfs-progs 5.16-2.fc36.x86_64 → 5.16.1-1.fc36.x86_64
    containerd 1.5.8-1.fc36.x86_64 → 1.6.0~rc.2-2.fc36.x86_64
    containers-common 4:1-47.fc36.noarch → 4:1-50.fc36.noarch
    crypto-policies 20220119-1.git50109e7.fc36.noarch → 20220203-1.git112f859.fc36.noarch
    kernel 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
    kernel-core 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
    kernel-modules 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
    krb5-libs 1.19.2-4.fc36.1.x86_64 → 1.19.2-5.fc36.x86_64
    libgcc 12.0.1-0.4.fc36.x86_64 → 12.0.1-0.6.fc36.x86_64
    libgomp 12.0.1-0.4.fc36.x86_64 → 12.0.1-0.6.fc36.x86_64
    libibverbs 38.1-3.fc36.x86_64 → 39.0-1.fc36.x86_64
    libicu 69.1-3.fc36.x86_64 → 69.1-5.fc36.x86_64
    libstdc++ 12.0.1-0.4.fc36.x86_64 → 12.0.1-0.6.fc36.x86_64
    nftables 1:1.0.1-2.fc36.x86_64 → 1:1.0.1-3.fc36.x86_64
    openldap 2.6.1-1.fc36.x86_64 → 2.6.1-2.fc36.x86_64
    rpm-ostree 2022.1-3.fc36.x86_64 → 2022.2-1.fc36.x86_64
    rpm-ostree-libs 2022.1-3.fc36.x86_64 → 2022.2-1.fc36.x86_64
    squashfs-tools 4.5-6.20220202git11c9591.fc36.x86_64 → 4.5-7.20220204git8a9d02e.fc36.x86_64 

Here are the fedora-coreos-config changes between those two: f4cf5f9..b1fa0e2

dustymabe commented 2 years ago

Doing a local rawhide build with all current (as of 20220309) packages except for reverting the kernel to kernel-5.17.0-0.rc2.83.fc36 makes it so that I can't reproduce the issue. The root cause of this looks to be something in the kernel transition:

kernel 5.17.0-0.rc2.83.fc36.x86_64 → 5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
dustymabe commented 2 years ago

IIUC the set of changes between those two can be seen by comparing v5.17-rc2..dcb85f85fa6f in the kernel-ark repo.

miabbott commented 2 years ago

Skimmed that diff, but nothing jumped out related to storage devices, mounts, udev, etc. I think we'd need to bisect.

$ git log --pretty=oneline --no-merges v5.17-rc2..dcb85f85fa6f | wc -l
143
``` dcb85f85fa6f gcc-plugins/stackleak: Use noinstr in favor of notrace 87563a043cef ax25: fix reference count leaks of ax25_dev 80d4609008e6 net: stmmac: ensure PTP time register reads are consistent 1f2cfdd349b7 printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin() 67d6212afda2 Revert "module, async: async_synchronize_full() on module init iff async is used" 34a081761e4e net: ipa: request IPA register values be retained ac62a0174d62 dt-bindings: net: qcom,ipa: add optional qcom,qmp property 2bdfd2825c96 cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning 7f3bdbc3f131 tools/resolve_btfids: Do not print any commands when building silently c36c04c2e132 Revert "mm/gup: small refactoring: simplify try_grab_page()" b293dcc473d2 bpf: Use VM_MAP instead of VM_ALLOC for ringbuf 4a81f6da9cb2 net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work b67985be4009 tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data() 81eb8b0b1878 net: sparx5: do not refer to skb after passing it on 186edf7e368c selinux: fix double free of cond_list on error paths c86d86131ab7 Partially revert "net/smc: Add netlink net namespace support" ad5185735f7d net/mlx5e: Avoid field-overflowing memcpy() 6d5c900eb641 net/mlx5e: Use struct_group() for memcpy() region 5b209d1a22af net/mlx5e: Avoid implicit modify hdr for decap drop rule de47db0cf7f4 net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic 5352859b3bfa net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic 736dfe4e68b8 net/mlx5e: Don't treat small ceil values as unlimited in HTB offload d8e5883d694b net/mlx5: E-Switch, Fix uninitialized variable modact ec41332e02bd net/mlx5e: Fix handling of wrong devices during bond netevent 7957837b816f net/mlx5e: Fix broken SKB allocation in HW-GRO b8d91145ed7c net/mlx5e: Fix wrong calculation of header index in HW_GRO 880b51769190 net/mlx5: Bridge, Fix devlink deadlock on net namespace deletion 55b2ca702cfa net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE 5623ef8a1188 net/mlx5e: TC, Reject rules with forward and drop actions 3c5193a87b0f net/mlx5: Use del_timer_sync in fw reset flow of halting poll 4a08a131351e net/mlx5e: Fix module EEPROM query a2446bc77a16 net/mlx5e: TC, Reject rules with drop and modify hdr action 350d9a823734 net/mlx5: Bridge, ensure dev_name is null-terminated 04f8c12f031f net/mlx5: Bridge, take rtnl lock in init error handler d0cfa548dbde net: macsec: Verify that send_sci is on when setting Tx sci explicitly 63e4b45c82ed ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback 479f5547239d tcp: fix mem under-charging with zerocopy sendmsg() e42e70ad6ae2 af_packet: fix data-race in packet_setsockopt / packet_setsockopt c6f6f2444bdb rtnetlink: make sure to refresh master_dev/m_ops in __rtnl_newlink() 04c2a47ffb13 net: sched: fix use-after-free in tc_new_tfilter() 6dde7acdb3dc ethernet: smc911x: fix indentation in get/set EEPROM b7892f7d5cb2 tools: Ignore errors from `which' when searching a GCC toolchain 1a2beb3d5a0b mailmap: update Christian Brauner's email address 24f600856418 cgroup-v1: Require capabilities to set release_agent cad014b7b5a6 e1000e: Handshake with CSME starts from ADL platforms 68defd528f94 e1000e: Separate ADP board type from TGP 836f35f79153 platform/x86: thinkpad_acpi: Fix incorrect use of platform profile on AMD platforms 881cc731df6a net: phy: Fix qca8081 with speeds lower than 2.5Gb/s ee12595147ac fanotify: Fix stale file descriptor in copy_event_to_user() 2161ba070999 MIPS: KVM: fix vz.c kernel-doc notation 50317b636e71 MIPS: octeon: Fix missed PTR->PTR_WD conversion 9cef24c8b76c net: macsec: Fix offload support for NETDEV_UNREGISTER event 7af037c39b60 net: stmmac: dump gmac4 DMA registers correctly 4223f8651287 net: dsa: mt7530: make NET_DSA_MT7530 select MEDIATEK_GE_PHY 6533e558c650 i40e: Fix reset path while removing the driver 3d2504663c41 i40e: Fix reset bw limit when DCB enabled with 1 TC ec049891b2dc kselftest: Fix vdso_test_abi return status f83a96e5f033 spi: mediatek: Avoid NULL pointer crash in interrupt baf927a833ca pinctrl: microchip-sgpio: Fix support for regmap 341adeec9ada net/smc: Forward wakeup to smc socket waitqueue after fallback 3a5286955bf5 pinctrl: bcm63xx: fix unmet dependency on REGMAP for GPIO_REGMAP 5297c693d8c8 pinctrl: bcm2835: Fix a few error paths 6449520391df net: stmmac: properly handle with runtime pm in stmmac_dvr_remove() 1f84a9450d75 gve: fix the wrong AdminQ buffer queue index check d01ffb9eee4a ax25: add refcount in ax25_dev to avoid UAF bugs 4e0f718daf97 ax25: improve the incomplete fix to avoid UAF and NPD bugs 500c77eed0fe pinctrl: zynqmp: Revert "Unify pin naming" 928d6fe996f6 net: stmmac: dwmac-visconti: No change to ETHER_CLOCK_SEL for unexpected speed request. ab451ea952fe nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client. 60b1e97140a4 spi: dt-bindings: Fix 'reg' child node schema 2cbd27267ffe spi: bcm-qspi: check for valid cs before applying chip select 7674b7b559b6 net: amd-xgbe: ensure to reset the tx_timer_active flag 5aac9108a180 net: amd-xgbe: Fix skb data length underflow dae1d8ac3189 selftests: skip mincore.check_file_mmap when fs lacks needed support ac9e0a250bb1 selftests: openat2: Skip testcases that fail with EOPNOTSUPP ea3396725aa1 selftests: openat2: Add missing dependency in Makefile e051cdf655fa selftests: openat2: Print also errno in failure messages b9199181a9ef selftests: futex: Use variable MAKE instead of make 908a26e139e8 selftests/exec: Remove pipe from TEST_GEN_FILES 01dabed20573 selftests/zram: Adapt the situation that /dev/zram0 is being used d18da7ec3719 selftests/zram01.sh: Fix compression ratio calculation fc4eb486a59d selftests/zram: Skip max_comp_streams interface on newer kernel dede34b2c1a8 docs/kselftest: clarify running mainline tests on stables b07f41373254 netfilter: nf_tables: remove assignment with no effect in chain blob builder d19a7af73b5e lockd: fix failure to cleanup client locks 5d8a8b324ff4 MAINTAINERS: Remove Harry Morris bouncing address 79c37ca73a6e net: ieee802154: Return meaningful error codes from the netlink helpers 621b24b09eb6 net: ieee802154: ca8210: Stop leaking skb's e5ce576d45bf net: ieee802154: at86rf230: Stop leaking skb's d753c4004820 net: ieee802154: mcr20a: Fix lifs/sifs periods 1293fccc9e89 net: ieee802154: hwsim: Ensure proper channel selection at probe time eda0cf1202ac selftests: nft_concat_range: add test for reload with no element add/del f459bfd4b979 netfilter: nft_byteorder: track register operations aeac4554eb54 netfilter: nft_reject_bridge: Fix for missing reply from prerouting aad51ca71ad8 selftests: netfilter: check stateless nat udp checksum fixup c858620d2ae3 selftests: netfilter: reduce zone stress test running time 34243b9ec856 netfilter: nft_ct: fix use after free when attaching zone template 7355bfe0e0cc netfilter: Remove flowtable relics c80d401c52a2 cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask() 37c2c83ca4f1 spi: uniphier: fix reference count leak in uniphier_spi_probe() e937440f7fc4 spi: meson-spicc: add IRQ check in meson_spicc_probe 23e3404de1ae spi: uniphier: Fix a bug that doesn't point to private data correctly 90cafce461de spi: change clk_disable_unprepare to clk_unprepare e2bcbd7769ee tools headers UAPI: remove stale lirc.h 0e3135d3bfa5 bpf: Fix possible race in inc_misses_counter 92d25637a3a4 kselftest: signal all child processes 40d70d4d6097 selftests: cpufreq: Write test output to stdout as well f034cc1301e7 selftests: rtc: Increase test timeout so that all tests run 235528072f28 kunit: tool: Import missing importlib.abc f26d04331360 audit: improve audit queue handling when "audit=1" on cmdline 63ee956f69d8 bpf: Fix renaming task_getsecid_subj->current_getsecid_subj. e12963c45326 pinctrl: intel: Fix a glitch when updating IRQ flags on a preconfigured line e986f0e602f1 pinctrl: intel: fix unexpected interrupt 77311237eaff pinctrl: Place correctly CONFIG_PINCTRL_ST in the Makefile bdac3bbd0dc6 spi: spi-rockchip: Add rk3568-spi compatible b8fb0d9b4766 platform/x86: amd-pmc: Correct usage of SMU version f8c28b93d262 platform/x86: asus-tf103c-dock: Make 2 global structs static f7086daab3b5 platform/x86: amd-pmc: Make amd_pmc_stb_debugfs_fops static 17da2d5f9369 platform/x86: ISST: Fix possible circular locking dependency detected a29012ab2316 platform/x86: intel_crystal_cove_charger: Fix IRQ masking / unmasking b288420e773f platform/x86: thinkpad_acpi: Add quirk for ThinkPads without a fan 512eb73cfd12 platform/x86: touchscreen_dmi: Add info for the RWC NANOTE P8 AY07J 2-in-1 c197e969e308 platform/surface: Reinstate platform dependency 17f6736a020e platform/x86: x86-android-tablets: Trivial typo fix for MODULE_AUTHOR 5de2ffd5acd3 platform/x86: x86-android-tablets: Fix the buttons on CZC P10T tablet 442bf564eb0c platform/x86: x86-android-tablets: Constify the gpiod_lookup_tables arrays 84c2dcdd475f platform/x86: x86-android-tablets: Add an init() callback to struct x86_dev_info 4ce2a32d4026 platform/x86: x86-android-tablets: Add support for disabling ACPI _AEI handlers ddec7abd4d93 platform/x86: x86-android-tablets: Correct crystal_cove_charger module name 1fd6bb5b47a6 pinctrl: sunxi: Fix H616 I2S3 pin data aa28514592d5 pinctrl: cherryview: Trigger hwirq0 for interrupt-lines without a mapping 25d2e41cf59b pinctrl: thunderbay: rework loops looking for groups names 30cc53897470 pinctrl: thunderbay: comment process of building functions a bit 5298d4bfe80f unicode: clean up the Kconfig symbol confusion 9df15d842a0f spi: stm32: make SPI_MASTER_MUST_TX flags only specific to STM32F4 3cefddb72f80 spi: stm32: remove inexistant variables in struct stm32_spi_cfg comment e4d63473d311 spi: stm32-qspi: Update spi registering b4c18c18ebf7 regulator: MAX20086: add gpio/consumer.h 6e7f90d163af lockd: fix server crash on reboot of client holding lock 94fd19752b28 ovl: don't fail copy up if no fileattr support on upper 4ee7e4a6c9b2 ovl: fix NULL pointer dereference in copy up warning d068eebbd482 cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy 879cf8006475 regulator: max20086: fix error code in max20086_parse_regulators_dt() ```
HuijingHei commented 2 years ago

I do not know why I can not reproduce locally, not sure my steps are correct.

cosa init -b rawhide https://github.com/coreos/fedora-coreos-config
cosa fetch
cosa build
kola run coreos.boot-mirror.luks --multiply=10
dustymabe commented 2 years ago

I do not know why I can not reproduce locally, not sure my steps are correct.

Most likely it's a race condition and my system is either faster or slower than yours :)

dustymabe commented 2 years ago

I'm no longer seeing this issue with latest kernels in rawhide. It appears it stopped showing up in the following kernel transition:

kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64
dustymabe commented 2 years ago

ok more information here.. I was mislead in my previous understanding that the issue was fixed. It turns out that kernel NVR versions that include ${date}git${hash} have debug turned on, whereas the kernels that don't do not. For example:

have debug turned ON, where:

do NOT.

This problem actually only shows itself when debug is turned on. I was able to reproduce this issue on testing-devel based on Fedora Linux 35 when I used the debug kernel:

$ git diff
diff --git a/manifests/bootable-rpm-ostree.yaml b/manifests/bootable-rpm-ostree.yaml
index 784acd4..d23d3a6 100644
--- a/manifests/bootable-rpm-ostree.yaml
+++ b/manifests/bootable-rpm-ostree.yaml
@@ -7,7 +7,7 @@
 packages:
  # Kernel + systemd.  Note we explicitly specify kernel-{core,modules}
  # because otherwise depsolving could bring in kernel-debug.
- - kernel kernel-core kernel-modules systemd
+ - kernel-debug kernel-debug-core kernel-debug-modules systemd
  # linux-firmware now a recommends so let's explicitly include it
  # https://gitlab.com/cki-project/kernel-ark/-/commit/32271d0cd9bd52d386eb35497c4876a8f041f70b
  # https://src.fedoraproject.org/rpms/kernel/c/f55c3e9ed8605ff28cb9a922efbab1055947e213?branch=rawhide

NOTE: For now I needed this patch to rpm-ostree to do the build.

The problem actually goes back much farther. I chased it for a while but then gave up. Here are the debug kernels I tried and confirmed the issue occurs on (all built on top of testing-devel as of today):

So this problem has been around over a year in the kernel. I imagine we only started hitting it this year because some of our code introduced some path which made the race easier to hit.

Rather than waste much more time on this issue (since it only happens with debug kernels) we should consider ways to make the test artificially not fail if we're on a debug kernel.

dustymabe commented 2 years ago

I decided for now to punt on this issue with a workaround since I've spent way too much time on it and the fact that it only presents itself with a debug kernel, which we'll never ship to our prod streams.

Someone else is welcome to pick up and investigate further and open bugs if they find enough information to open a BZ, but I need to move on.