coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
260 stars 60 forks source link

[rawhide]: kola-openstack jobs timeout with kernel errors on aarch64 #1612

Open marmijo opened 7 months ago

marmijo commented 7 months ago

Kola tests on [rawhide][aarch64] are timing out in the kola-openstack job with one of the two following errors. Eventually, the entire job times out and fails.

harness.go:106: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for 677b6b2e-668a-485f-a9d3-7f4017d074b5.
[2023-11-08T16:16:19.406Z]         harness.go:1737: Found kernel panic (stack-protector: Kernel stack is corrupted in: pte_offset_map_nolock+0x9c/0xa8) on machine 677b6b2e-668a-485f-a9d3-7f4017d074b5 console
[2023-11-08T16:16:19.406Z]         harness.go:1737: Found kernel oops on machine 677b6b2e-668a-485f-a9d3-7f4017d074b5 console
[2023-11-08T16:16:19.406Z]         harness.go:1737: Found systemd generator failure (/usr/lib/systemd/system-generators/coreos-diskful-generator) on machine 677b6b2e-668a-485f-a9d3-7f4017d074b5 console

or

[2023-11-14T17:15:57.017Z] --- FAIL: podman.network-single (674.13s)
[2023-11-14T17:15:57.017Z]         harness.go:1819: mach.Start() failed: machine "4312f03e-3a88-4d91-8bdf-e2922914c5c9" failed to start: ssh journalctl failed: time limit exceeded
[2023-11-14T17:15:57.017Z]         harness.go:1737: Found kernel oops on machine 4312f03e-3a88-4d91-8bdf-e2922914c5c9 console
[2023-11-14T17:15:57.017Z]         harness.go:1737: Found segfault on machine 4312f03e-3a88-4d91-8bdf-e2922914c5c9 console
[2023-11-14T17:16:53.165Z] 2023-11-14T17:16:44Z kola: Test timed out. Adding as candidate for rerun success: ostree.remote

This first started happening with FCOS version: 40.20231103.91.0, which saw the following upgrades:


- NetworkManager 1:1.44.2-1.fc40.aarch64 → 1:1.44.2-2.fc40.aarch64
- NetworkManager-cloud-setup 1:1.44.2-1.fc40.aarch64 → 1:1.44.2-2.fc40.aarch64
- NetworkManager-libnm 1:1.44.2-1.fc40.aarch64 → 1:1.44.2-2.fc40.aarch64
- NetworkManager-team 1:1.44.2-1.fc40.aarch64 → 1:1.44.2-2.fc40.aarch64
- NetworkManager-tui 1:1.44.2-1.fc40.aarch64 → 1:1.44.2-2.fc40.aarch64
- bootupd 0.2.12-4.fc40.aarch64 → 0.2.13-2.fc40.aarch64
- kernel 6.7.0-0.rc0.20231101git8bc9e6515183.3.fc40.aarch64 → 6.7.0-0.rc0.20231102git21e80f3841c0.4.fc40.aarch64
- kernel-core 6.7.0-0.rc0.20231101git8bc9e6515183.3.fc40.aarch64 → 6.7.0-0.rc0.20231102git21e80f3841c0.4.fc40.aarch64
- kernel-modules 6.7.0-0.rc0.20231101git8bc9e6515183.3.fc40.aarch64 → 6.7.0-0.rc0.20231102git21e80f3841c0.4.fc40.aarch64
- kernel-modules-core 6.7.0-0.rc0.20231101git8bc9e6515183.3.fc40.aarch64 → 6.7.0-0.rc0.20231102git21e80f3841c0.4.fc40.aarch64
- libuv 1:1.46.0-4.fc40.aarch64 → 1:1.46.0-5.fc40.aarch64
- sqlite-libs 3.43.2-1.fc40.aarch64 → 3.44.0-1.fc40.aarch64

This was seen in the consol log of a failing test:

[    8.646803] note: ignition-genera[234] exited with irqs disabled
[    8.646811] Internal error: Oops: 0000000096000004 [#5] SMP
[    8.646822] Modules linked in:
[    8.653367] /usr/lib/dracut-lib.sh: line 20:   233 Segmentation fault      mkdir -p -m 0755 /run/log
[    8.653435] BUG: Bad rss-counter state mm:000000006b1073d6 type:MM_ANONPAGES val:1
[    8.653630] /usr/lib/systemd/system-generators/ignition-generator: line 32:   234 Segmentation fault      mkdir -p "${requires_dir}"
[    8.658142]  qemu_fw_cfg
[    8.658157] CPU: 2 PID: 253 Comm: coreos-multipat Tainted: G      D           -------  ---  6.7.0-0.rc0.20231106gitd2f51b3516da.9.fc40.aarch64 #1
[    8.658166] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[    8.658171] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    8.670712] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: pte_offset_map_nolock+0x9c/0xa8
[    8.670720] SMP: stopping secondary CPUs
[    8.677869] pc : handle_mm_fault+0xc0/0x2c8
[    8.677878] lr : handle_mm_fault+0xc0/0x2c8
[    8.677882] sp : ffff80008060bc70
[    8.677885] x29: ffff80008060bc70 x28: 0000fffffffff000 x27: ffff0000c9a78000
[    8.677893] x26: 0000fffffffff000 x25: 0000000000000000 x24: fffffc0007e04e80
[    8.677900] x23: ffff0001f813a000 x22: 0000fffffffffe83 x21: 0000aaaafd44aa90
[    8.677907] x20: 00000000000000a7 x19: ffff0001f813ae83 x18: ffff80008060b878
[    8.677913] x17: 0000000000000000 x16: ffffcb574bc45650 x15: 0000aaaafd435ac0
[    8.677921] x14: 2e3063722e302d30 x13: 0000000093e5a000 x12: 0000000000000000
[    8.677928] x11: 0000000000000000 x10: 0000000093e5b000 x9 : ffffcb574aa63340
[    8.677934] x8 : fefefefefefefeff x7 : 0000000000000000 x6 : ffff0000c9a78000
[    8.677941] x5 : 0000000000000008 x4 : 000000000001fff8 x3 : 0000aaaafd44aa98
[    8.677948] x2 : 0000000000000000 x1 : ffff0000c9a78000 x0 : 0000000000000001
[    8.677956] Call trace:
[    8.677959]  handle_mm_fault+0xc0/0x2c8
[    8.677965]  do_execveat_common.isra.0+0x148/0x240
[    8.677974]  __arm64_sys_execve+0x48/0x68
[    8.677978]  invoke_syscall+0x78/0x100
[    8.677983]  el0_svc_common.constprop.0+0x48/0xf0
[    8.677987]  do_el0_svc+0x24/0x38
[    8.677990]  el0_svc+0x3c/0x138
[    8.677996]  el0t_64_sync_handler+0x120/0x130
[    8.678001]  el0t_64_sync+0x194/0x198
[    8.678007] Code: f9400820 b4000bc0 d503201f 97f7e8a1 (f94202c0) 
[    8.678019] ---[ end trace 0000000000000000 ]---
[    8.678021] note: coreos-multipat[253] exited with irqs disabled
[    8.678077] Kernel Offset: 0x4b56ca6f0000 from 0xffff800080000000
[    8.870784] PHYS_OFFSET: 0x40000000
[    8.874258] CPU features: 0x0,01c00000,14020020,21005203
[    8.879436] Memory Limit: none
[    8.882513] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: pte_offset_map_nolock+0x9c/0xa8 ]---

full console.txt for coreos.ignition.instantiated.enable-unit

dustymabe commented 7 months ago

did any openstack aarch64 tests pass in the 40.20231108.91.0 run?

marmijo commented 7 months ago

No, everything either failed or timed out.