Open koo5 opened 2 years ago
Linux r6 5.13.0-21-generic #21-Ubuntu SMP Tue Oct 19 08:59:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Could you please validate latest 5.13.x ? there was a similar problem which was fixed/backported in there.
thanks, but not sure where to find a 5.13 newer than 5.13.21? latest 5.13.19 here: https://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D i tried v5.16-rc2 but there's a different crash right after boot: https://gist.github.com/koo5/4c6558a860475f8c4a1cac2f77a5a570
So you mean 5.13.19 also has this problem??
yes, the same as 5.13.21.
I seem to have the same issue:
$ uname -a Linux eitchtower 5.16.4-051604-generic #202201310852-Ubuntu SMP PREEMPT Mon Jan 31 09:02:19 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Feb 1 23:13:13 mypc kernel: [ 8.091191] RIP: 0010:rcu_note_context_switch+0x462/0x4a0
Feb 1 23:13:13 mypc kernel: [ 8.091191] Code: 08 48 89 86 40 08 00 00 4c 89 86 48 08 00 00 49 89 10 e9 84 fe ff ff 48 c7 c7 e0 29 3c 9b c6 05 de 73 fe 01 01 e8 81 0e ab 00 <0f> 0b e9 ec fb ff ff 0f 0b e9 7e fd ff ff 0f 0b e9 60 fd ff ff 0f
Feb 1 23:13:13 mypc kernel: [ 8.091192] RSP: 0018:ffffad63c25bee18 EFLAGS: 00010086
Feb 1 23:13:13 mypc kernel: [ 8.091193] RAX: 0000000000000000 RBX: ffff97006ea31f00 RCX: 0000000000000027
Feb 1 23:13:13 mypc kernel: [ 8.091194] RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff97006ea20988
Feb 1 23:13:13 mypc kernel: [ 8.091195] RBP: ffffad63c25bee38 R08: ffff97006ea20980 R09: ffffad63c25bec00
Feb 1 23:13:13 mypc kernel: [ 8.091196] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091196] R13: ffff96f18ce74c80 R14: ffff96f18ce74c80 R15: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091197] FS: 00007f8425114140(0000) GS:ffff97006ea00000(0000) knlGS:0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091198] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 1 23:13:13 mypc kernel: [ 8.091199] CR2: 00007fa0f0df9710 CR3: 0000000109a5c000 CR4: 0000000000750ef0
Feb 1 23:13:13 mypc kernel: [ 8.091199] PKRU: 55555554
Feb 1 23:13:13 mypc kernel: [ 8.091200] Call Trace:
Feb 1 23:13:13 mypc kernel: [ 8.091200] <TASK>
Feb 1 23:13:13 mypc kernel: [ 8.091201] __schedule+0xbb/0x7f0
Feb 1 23:13:13 mypc kernel: [ 8.091203] ? ttwu_do_activate+0xb6/0x510
Feb 1 23:13:13 mypc kernel: [ 8.091206] schedule+0x55/0xd0
Feb 1 23:13:13 mypc kernel: [ 8.091207] schedule_timeout+0x200/0x290
Feb 1 23:13:13 mypc kernel: [ 8.091208] wait_for_completion+0x8b/0xf0
Feb 1 23:13:13 mypc kernel: [ 8.091209] virt_efi_query_variable_info+0x146/0x160
Feb 1 23:13:13 mypc kernel: [ 8.091212] efi_query_variable_store+0x53/0x190
Feb 1 23:13:13 mypc kernel: [ 8.091214] ? vsnprintf+0x71/0x4e0
Feb 1 23:13:13 mypc kernel: [ 8.091216] efivar_entry_set_safe+0xbf/0x210
Feb 1 23:13:13 mypc kernel: [ 8.091219] efi_pstore_write+0x125/0x1a0 [efi_pstore]
Feb 1 23:13:13 mypc kernel: [ 8.091221] pstore_dump+0x11c/0x300
Feb 1 23:13:13 mypc kernel: [ 8.091223] ? vprintk_emit+0xf9/0x230
Feb 1 23:13:13 mypc kernel: [ 8.091225] kmsg_dump+0x4e/0x70
Feb 1 23:13:13 mypc kernel: [ 8.091227] oops_exit+0x27/0x30
Feb 1 23:13:13 mypc kernel: [ 8.091228] oops_end+0x68/0xd0
Feb 1 23:13:13 mypc kernel: [ 8.091230] die+0x43/0x50
Feb 1 23:13:13 mypc kernel: [ 8.091231] do_trap+0xd7/0x100
Feb 1 23:13:13 mypc kernel: [ 8.091232] do_error_trap+0x77/0xa0
Feb 1 23:13:13 mypc kernel: [ 8.091233] ? aq_nic_start+0x363/0x370 [atlantic]
Feb 1 23:13:13 mypc kernel: [ 8.091237] exc_invalid_op+0x53/0x70
Feb 1 23:13:13 mypc kernel: [ 8.091238] ? aq_nic_start+0x363/0x370 [atlantic]
Feb 1 23:13:13 mypc kernel: [ 8.091241] asm_exc_invalid_op+0x12/0x20
Feb 1 23:13:13 mypc kernel: [ 8.091243] RIP: 0010:aq_nic_start+0x363/0x370 [atlantic]
Feb 1 23:13:13 mypc kernel: [ 8.091246] Code: cc 50 c0 e8 4f fa a1 d9 41 89 c4 85 c0 0f 88 57 fd ff ff 41 8b 8d 94 01 00 00 b8 01 00 00 00 d3 e0 41 09 85 f8 05 00 00 eb 84 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5
Feb 1 23:13:13 mypc kernel: [ 8.091247] RSP: 0018:ffffad63c25bf468 EFLAGS: 00010202
Feb 1 23:13:13 mypc kernel: [ 8.091247] RAX: 0000000000000008 RBX: 0000000000000008 RCX: ffff96f1a15b6000
Feb 1 23:13:13 mypc kernel: [ 8.091248] RDX: 0000000000000010 RSI: ffffad63c18e5be8 RDI: ffff96f19fbbe428
Feb 1 23:13:13 mypc kernel: [ 8.091249] RBP: ffffad63c25bf480 R08: 0000000020000800 R09: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091249] R10: 0000000000000000 R11: 000000000000003f R12: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091250] R13: ffff96f1a15b6980 R14: 0000000000000001 R15: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091251] ? aq_nic_start+0x9c/0x370 [atlantic]
Feb 1 23:13:13 mypc kernel: [ 8.091254] aq_ndev_open+0x49/0x70 [atlantic]
Feb 1 23:13:13 mypc kernel: [ 8.091257] __dev_open+0xef/0x180
Feb 1 23:13:13 mypc kernel: [ 8.091259] __dev_change_flags+0x188/0x1e0
Feb 1 23:13:13 mypc kernel: [ 8.091260] dev_change_flags+0x26/0x60
Feb 1 23:13:13 mypc kernel: [ 8.091261] do_setlink+0x908/0xc40
Feb 1 23:13:13 mypc kernel: [ 8.091262] ? __nla_validate_parse+0x4c/0x1a0
Feb 1 23:13:13 mypc kernel: [ 8.091264] __rtnl_newlink+0x597/0xa10
Feb 1 23:13:13 mypc kernel: [ 8.091265] ? __nla_put+0x20/0x30
Feb 1 23:13:13 mypc kernel: [ 8.091266] ? pskb_expand_head+0x75/0x310
Feb 1 23:13:13 mypc kernel: [ 8.091268] ? security_sock_rcv_skb+0x2f/0x50
Feb 1 23:13:13 mypc kernel: [ 8.091270] ? netlink_deliver_tap+0x49/0x240
Feb 1 23:13:13 mypc kernel: [ 8.091271] ? sk_filter_trim_cap+0xbf/0x220
Feb 1 23:13:13 mypc kernel: [ 8.091272] ? skb_queue_tail+0x48/0x50
Feb 1 23:13:13 mypc kernel: [ 8.091273] ? sock_def_readable+0x50/0x80
Feb 1 23:13:13 mypc kernel: [ 8.091275] ? __netlink_sendskb+0x42/0x50
Feb 1 23:13:13 mypc kernel: [ 8.091276] ? netlink_unicast+0x213/0x250
Feb 1 23:13:13 mypc kernel: [ 8.091277] ? rtnl_getlink+0x376/0x400
Feb 1 23:13:13 mypc kernel: [ 8.091280] ? security_capable+0x3d/0x60
Feb 1 23:13:13 mypc kernel: [ 8.091281] ? kmem_cache_alloc_trace+0x193/0x470
Feb 1 23:13:13 mypc kernel: [ 8.091283] rtnl_newlink+0x49/0x70
Feb 1 23:13:13 mypc kernel: [ 8.091284] rtnetlink_rcv_msg+0x160/0x3b0
Feb 1 23:13:13 mypc kernel: [ 8.091286] ? __mark_inode_dirty+0x105/0x350
Feb 1 23:13:13 mypc kernel: [ 8.091288] ? rtnl_calcit.isra.0+0x130/0x130
Feb 1 23:13:13 mypc kernel: [ 8.091290] netlink_rcv_skb+0x55/0x100
Feb 1 23:13:13 mypc kernel: [ 8.091291] rtnetlink_rcv+0x15/0x20
Feb 1 23:13:13 mypc kernel: [ 8.091292] netlink_unicast+0x1a8/0x250
Feb 1 23:13:13 mypc kernel: [ 8.091294] netlink_sendmsg+0x245/0x490
Feb 1 23:13:13 mypc kernel: [ 8.091295] ? _copy_from_user+0x2b/0x60
Feb 1 23:13:13 mypc kernel: [ 8.091296] sock_sendmsg+0x65/0x70
Feb 1 23:13:13 mypc kernel: [ 8.091297] ____sys_sendmsg+0x218/0x290
Feb 1 23:13:13 mypc kernel: [ 8.091298] ? copy_msghdr_from_user+0x5c/0x90
Feb 1 23:13:13 mypc kernel: [ 8.091299] ? addrconf_sysctl_forward+0x111/0x280
Feb 1 23:13:13 mypc kernel: [ 8.091301] ? _copy_from_iter+0x92/0x590
Feb 1 23:13:13 mypc kernel: [ 8.091303] ___sys_sendmsg+0x81/0xc0
Feb 1 23:13:13 mypc kernel: [ 8.091304] ? mntput_no_expire+0x53/0x260
Feb 1 23:13:13 mypc kernel: [ 8.091305] ? __call_rcu+0xac/0x270
Feb 1 23:13:13 mypc kernel: [ 8.091306] ? call_rcu+0xe/0x10
Feb 1 23:13:13 mypc kernel: [ 8.091308] ? __fget_light+0xc5/0x100
Feb 1 23:13:13 mypc kernel: [ 8.091310] __sys_sendmsg+0x62/0xb0
Feb 1 23:13:13 mypc kernel: [ 8.091311] __x64_sys_sendmsg+0x1f/0x30
Feb 1 23:13:13 mypc kernel: [ 8.091312] do_syscall_64+0x5c/0xc0
Feb 1 23:13:13 mypc kernel: [ 8.091313] ? do_syscall_64+0x69/0xc0
Feb 1 23:13:13 mypc kernel: [ 8.091314] ? asm_exc_page_fault+0x8/0x30
Feb 1 23:13:13 mypc kernel: [ 8.091315] entry_SYSCALL_64_after_hwframe+0x44/0xae
Feb 1 23:13:13 mypc kernel: [ 8.091317] RIP: 0033:0x7f842612a3fd
Feb 1 23:13:13 mypc kernel: [ 8.091318] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 fa a4 f6 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 3e a5 f6 ff 48
Feb 1 23:13:13 mypc kernel: [ 8.091318] RSP: 002b:00007ffe82ca9510 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
Feb 1 23:13:13 mypc kernel: [ 8.091319] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f842612a3fd
Feb 1 23:13:13 mypc kernel: [ 8.091320] RDX: 0000000000000000 RSI: 00007ffe82ca9550 RDI: 000000000000000c
Feb 1 23:13:13 mypc kernel: [ 8.091320] RBP: 000055bf626d4030 R08: 0000000000000000 R09: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091321] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091322] R13: 00007ffe82ca96a0 R14: 00007ffe82ca969c R15: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.091323] </TASK>
Feb 1 23:13:13 mypc kernel: [ 8.091323] ---[ end trace 96a2807216b1f271 ]---
Feb 1 23:13:13 mypc kernel: [ 8.239615] RIP: 0010:aq_nic_start+0x363/0x370 [atlantic]
Feb 1 23:13:13 mypc kernel: [ 8.239622] Code: cc 50 c0 e8 4f fa a1 d9 41 89 c4 85 c0 0f 88 57 fd ff ff 41 8b 8d 94 01 00 00 b8 01 00 00 00 d3 e0 41 09 85 f8 05 00 00 eb 84 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5
Feb 1 23:13:13 mypc kernel: [ 8.239624] RSP: 0018:ffffad63c25bf468 EFLAGS: 00010202
Feb 1 23:13:13 mypc kernel: [ 8.239625] RAX: 0000000000000008 RBX: 0000000000000008 RCX: ffff96f1a15b6000
Feb 1 23:13:13 mypc kernel: [ 8.239626] RDX: 0000000000000010 RSI: ffffad63c18e5be8 RDI: ffff96f19fbbe428
Feb 1 23:13:13 mypc kernel: [ 8.239627] RBP: ffffad63c25bf480 R08: 0000000020000800 R09: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.239627] R10: 0000000000000000 R11: 000000000000003f R12: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.239628] R13: ffff96f1a15b6980 R14: 0000000000000001 R15: 0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.239629] FS: 00007f8425114140(0000) GS:ffff97006eec0000(0000) knlGS:0000000000000000
Feb 1 23:13:13 mypc kernel: [ 8.239630] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 1 23:13:13 mypc kernel: [ 8.239631] CR2: 00007f97ec008008 CR3: 0000000109a5c000 CR4: 0000000000750ee0
Feb 1 23:13:13 mypc kernel: [ 8.239631] PKRU: 55555554
The problem is floating between the kernels, still no root cause found.
The problem is floating between the kernels, still no root cause found.
Are you sure, it’s inconsistently present? I get consistently a null deref in aq_ring_rx_fill for the latest 3 stable/longterm kernel releases and mainline¹ when hibernating with the atlantic network interface up. I’ve found a line to be consistently referenced in vmcore dumps of null pointer dereferences.
@koo5: I see resume log messages in your log. Was that resume from hibernation and was the device turned off by that hibernation or stayed it on?
For me the hibernation immediately resumes without switching the PC off. After that “resume” the null pointer deref occurs. There is no indication from kernel messages, why the system resumes immediately after writing the image. I captured vmcores for the latter three of ¹5.10.109, 5.15.32, 5.17.1, 5.18-rc1 with crashkernel and panic_on_oops and tracked it down to this line in 5.18-rc1:
buff->flags = 0U;
aq_ring.c#L545
dmesg-5.18-rc1.log crash-5.18-rc1.log aq_ring-5.18-rc1_objectDumped.txt kernel-config-5.18-rc1-x86_64.txt
The other logs from Redhat’s crash utility look similar, even with the same offset. I could not retrieve the vmcore dump from 5.10.109 because crashkernel could not reinitialize USB and SATA when booting. Note, kernel is tainted by an amdgpu warning. Probably some module/builtin incompatibility. Needed it builtin, because crashkernel would leave me blind otherwise.
In contrary to that, here are dmesg logs of one hibernation with the oldest AQC107 supporting longterm kernels: dmesg-4.19.237-x86_64.log dmesg-5.4.188-x86_64.log
Bisecting is not easily possible unfortunately, because all the latest, but unsupported minor versions between 5.4 and 5.10 do not compile with my toolchain. Tried also older gcc versions and -Wno-error, but to no avail.
I hope you can just build the kernel and reproduce the issue, because I can’t provide the core dumps. They contain my LUKS key in kmem and since it is distributed all over it 1, I’d rather not try to find and scramble it. If you need any more info like other gdb/crash outputs or the other used kernel configs (basically an olddefconfig of the one provided), leave me a note.
OS and build environment
Portage 3.0.30 (python 3.9.9-final-0, default/linux/amd64/17.1/hardened, gcc-11.2.1, glibc-2.34-r10, 5.17.1-gentoo-r1-x86_64 x86_64)
System uname: Linux-5.17.1-gentoo-r1-x86_64-x86_64-AMD_Ryzen_Threadripper_3990X_64-Core_Processor-with-glibc2.34 KiB Mem: 131811644 total, 123581920 free KiB Swap: 268435452 total, 268435452 free Timestamp of repository gentoo: Tue, 05 Apr 2022 20:30:01 +0000 ← last update Head commit of repository gentoo: f3363e64a5c00c663f0935feff6b7f10e8f1d5f9 Timestamp of repository xwing: Tue, 05 Apr 2022 20:00:05 +0000 sh bash 5.1_p16 ld GNU ld (Gentoo 2.37_p1 p2) 2.37 distcc 3.4 x86_64-pc-linux-gnu [disabled] app-misc/pax-utils: 1.3.3::gentoo app-shells/bash: 5.1_p16::gentoo dev-java/java-config: 2.3.1::gentoo dev-lang/perl: 5.34.0-r6::gentoo dev-lang/python: 3.8.13::gentoo, 3.9.9-r1::gentoo, 3.10.2_p1::gentoo dev-lang/rust: 1.58.1::gentoo dev-lang/rust-bin: 1.58.1::gentoo dev-util/cmake: 3.22.2::gentoo dev-util/meson: 0.61.4-r2::gentoo sys-apps/baselayout: 2.7-r3::gentoo sys-apps/openrc: 0.44.10::gentoo sys-apps/sandbox: 2.29::gentoo sys-devel/autoconf: 2.13-r1::gentoo, 2.71-r1::gentoo sys-devel/automake: 1.16.4::gentoo sys-devel/binutils: 2.37_p1-r2::gentoo sys-devel/binutils-config: 5.4.1::gentoo sys-devel/clang: 13.0.1::gentoo sys-devel/gcc: 9.4.0::gentoo, 10.3.1_p20211126::gentoo, 11.2.1_p20220115::gentoo sys-devel/gcc-config: 2.5-r1::gentoo sys-devel/libtool: 2.4.6-r6::gentoo sys-devel/lld: 13.0.1::gentoo sys-devel/llvm: 13.0.1::gentoo sys-devel/make: 4.3::gentoo sys-kernel/linux-headers: 5.15-r3::gentoo (virtual/os-headers) sys-libs/glibc: 2.34-r10::gentoo
I’ve posted a patch for this, since I found the root cause. The deep parameter for pm functions in aq_pci_func.c was inverted for all calls, so suspend/resume did a deep reset, while freeze/thaw did a normal reset.
This was present since the introduction of the new pm API usage in 5.4-rc5 with 8aaa112a57c1 ("net: atlantic: refactoring pm logic"), however it had no effect until aqvec{free,alloc} were split in 8ce84271697a("net: atlantic: changes for multi-TC support"), leading to ring buffers not being reinitialized on non-deep pm calls. The latter commit is found by bisect.
@koo5: Thank you for reporting this; I mentioned you in the Reported-by tag.
https://lore.kernel.org/netdev/87sfqaa8n5.fsf@posteo.de/T/#u
Hm, taking a closer look at koo5’s reported trace, this is a null dereference on resume, so it could not be related, unless the deep reset on suspend/resume causes null derefs for other devices. Getting some retesting from the reporters might help to clear this up.
@koo5, @eitch: Could you please retest any of the patches, applied against the kernel version, indicated by filename? Thanks in advance.
5.10.111-invert-deep-par.patch.txt 5.15.34-invert-deep-par.patch.txt 5.17.3-invert-deep-par.patch.txt 5.18-rc2_invert-deep-par.patch.txt
Much appreciated. Sorry i did not get to testing your patches yet. But:
Was that resume from hibernation and was the device turned off by that hibernation or stayed it on?
it was from an attempted suspend, the suspend never works, my machine resumes immediately.
Much appreciated. Sorry i did not get to testing your patches yet. But:
Never mind. This is patched now in stable as of 5.17.5, 5.15.36 and 5.10.113. Could you try any of those kernel versions?
Was that resume from hibernation and was the device turned off by that hibernation or stayed it on?
it was from an attempted suspend, the suspend never works, my machine resumes immediately.
Oh, I see. I’ve also seen that behaviour with hibernation shifting between null deref during hibernation (causing immediate thaw) and after hibernation while bisecting, so Igor’s impression might be valid. I could not reproduce the kernel oops before hibernation with the latest stable patches before my patch though, e.g. 5.10.110. Which kernel version do you use?
I never had issues with suspend in 5.15.32 (dist-kernel), any 5.16 version (can’t recall them) and 5.17. Since I was investigating hibernation, I did not try suspend in older kernel releases, but I can give them a try. If you can still reproduce this, knowing the reproducing kernel version would be much appreciated. :)
Side note: I could not bisect with the whole tree, so other bug sources were not excluded. Any version between 5.4 and 5.10 booted into a panic before the console was initialized. Neither have an USB debug port nor a serial port on my mainboard, so I just applied the commit state of atlantic on top of v5.10.110 and limited bisect to commits changing it.
@labre Hi, the change associated with this appears to have caused a regression. See https://bugzilla.kernel.org/show_bug.cgi?id=215949
@amarshall Thanks for your bug report and I’m sorry for the regression. I screwed up suspend testing by kexec-rebooting in an unpatched kernel. This is handled in https://patchwork.kernel.org/project/netdevbpf/patch/8735hniqcm.fsf@posteo.de/
yup, not getting any errors in 5.15.0-75