Closed msmith626 closed 5 years ago
Very odd.
I don't fully understand that stack trace, but it's unusual that it's RIPing in an ioread. But there has definitely been some bug fixes in the master branch around switchtec_ntb_check_link().
Wesley, can you please look at updating the 4.14 branch with the changes in master to see if that fixes the issue?
Thanks,
Logan
@lsgunth Yes, the first place: [exception RIP: ioread32+11] is weird. ioread32 from a register of NTB control register region should not be a source of exception.
For the second place: [exception RIP: switchtec_ntb_check_link+131] Maybe it is trigged by following sentence (in switchtec_ntb_check_link function): u64 peer = ioread64(&sndev->peer_shared->magic); in case of peer_shared had already been unmapped (in switchtec_ntb_deinit_shared_mw function), but before remap it (in switchtec_ntb_init_shared_mw function).
The root cause is by: link_reinit_work (when receive "MSG_LINK_FORCE_DOWN") and switchtec_ntb_check_link is not serialized. Once they run concurrently, the corner case may exist.
@msmith626 We have enhanced switchtec_ntb_check_link(), #48 (#37) is for this purpose. But currently we only maintain following branches (except devel and master):
You can merge following two commits from master to see if that fixes the issue ntb_hw_switchtec: Optimize function switchtec_ntb_reinit_peer() 68cee85aef5f1857db10280b826e3320e7b09d1d and ntb_hw_switchtec: Fix unable to set mw translation bug 815977dd7eba80403f950f1b764ef43367b8f628
Or contact our Apps for official support.
Regard, Wesley
Thank you @lsgunth and @wesleywesley for the very quick response. I apologize for the delay in my response -- I wanted to test this on several of these systems to confirm merging in the two suggested commits resolved the kernel panic issue... and I'm happy to report it appears these do fix the issue! Thanks so much.
I noticed another issue where sometimes (rare) I'm unable to bring the link up on the NTB virtual Ethernet device with this driver... it doesn't happen a lot so far, so I will continue debugging and open another issue for this when/if I can reliably reproduce it.
--Marc
Hi,
We have been successfully using this Switchtec kernel driver (release_4.13_to_4.14 branch, latest commit in that branch). We are using vanilla Linux 4.14.91. We used this driver for many weeks with the NTB Virtual Ethernet device. It worked flawlessly across reboots, and we did not experience any issues. The hardware is a 2U NVMe CiB.
Yesterday, we rebooted into the BIOS/setup on one of the nodes, and looked at some parameters... we did NOT save any changes (nor did we make any changes). Then upon rebooting back into the OS, the kernel panic'd on the opposing node (that was already running. This node that panic'd captured the dump file in the crash kernel, and rebooted normally into the OS... when it was booting up, it loaded the Switchtec NTB driver, and brought the 'eth0' interface up, and then the standing node panic'd! They continued in this kernel panic cycle until we stopped them by disabling the Switchtec NTB module (one would boot up and do a 'ifconfig eth0 up' and then the other node would panic).
We obtained a back trace using the 'crash' utility on the dump file: SYSTEM MAP: System.map-esos.prod DEBUG KERNEL: vmlinux-esos.prod (4.14.91-esos.prod) DUMPFILE: dumpfile-1549951046 CPUS: 72 DATE: Tue Feb 12 00:10:42 2019 UPTIME: 00:08:21 LOAD AVERAGE: 1.01, 1.00, 0.59 TASKS: 1299 NODENAME: nvmenode2 RELEASE: 4.14.91-esos.prod VERSION: #1 SMP Mon Feb 11 18:55:28 UTC 2019 MACHINE: x86_64 (2300 Mhz) MEMORY: 127.7 GB PANIC: "BUG: unable to handle kernel paging request at ffffc9000a1f0000" PID: 46 COMMAND: "kworker/7:0" TASK: ffff88903bcca780 [THREAD_INFO: ffff88903bcca780] CPU: 7 STATE: TASK_RUNNING (PANIC)
crash> bt PID: 46 TASK: ffff88903bcca780 CPU: 7 COMMAND: "kworker/7:0"
0 [ffff88903f9c3c28] machine_kexec at ffffffff810362e7
1 [ffff88903f9c3c70] __crash_kexec at ffffffff810d3900
2 [ffff88903f9c3d30] crash_kexec at ffffffff810d43f1
3 [ffff88903f9c3d48] oops_end at ffffffff81016ddc
4 [ffff88903f9c3d68] no_context at ffffffff8103f53c
5 [ffff88903f9c3db8] __do_page_fault at ffffffff8103fadb
6 [ffff88903f9c3e20] page_fault at ffffffff81e01385
7 [ffff88903f9c3f00] switchtec_ntb_message_isr at ffffffffa01208d9 [ntb_hw_switchtec]
8 [ffff88903f9c3f30] __handle_irq_event_percpu at ffffffff810ada21
9 [ffff88903f9c3f70] handle_irq_event_percpu at ffffffff810adb1b
10 [ffff88903f9c3f90] handle_irq_event at ffffffff810adb67
11 [ffff88903f9c3fa8] handle_edge_irq at ffffffff810b0b53
12 [ffff88903f9c3fb8] handle_irq at ffffffff810163f6
13 [ffff88903f9c3fc0] do_IRQ at ffffffff81e0207d
--- ---
14 [ffffc900066e7d88] ret_from_intr at ffffffff81e0093d
15 [ffffc900066e7e38] map_bars at ffffffffa011f4d6 [ntb_hw_switchtec]
16 [ffffc900066e7e60] switchtec_ntb_init_mw at ffffffffa011f5fe [ntb_hw_switchtec]
17 [ffffc900066e7e80] link_reinit_work at ffffffffa011fc0c [ntb_hw_switchtec]
18 [ffffc900066e7ea0] process_one_work at ffffffff8108333b
19 [ffffc900066e7ee0] worker_thread at ffffffff81083e69
20 [ffffc900066e7f10] kthread at ffffffff81087dd8
21 [ffffc900066e7f50] ret_from_fork at ffffffff81e001ef
crash>
Any help would be greatly appreciated. Again, we're using the "release_4.13_to_4.14" branch from this GitHub repo -- we believe that to be the most current/correct branch to use, let me know if that's not the case.
Thanks,
Marc