AmpereComputing / ampere-lts-kernel---DEPRECATED

Linux 5.4 and 5.10 Longterm kernel (LTS) with Ampere patches
20 stars 17 forks source link

AGDI cannot trigger kdump #160

Open adamliyi opened 2 years ago

adamliyi commented 2 years ago

Tested on latest linus_master (5.8-rc1):

On Altra, configure crashkernel by adding kernel option:

crashkernel=768M@0x400100000000
# dmesg | grep -i crash
crashkernel reserved: 0x0000400100000000 - 0x0000400130000000 (768 MB)
  1. sysrq can trigger kdump, vmcore file can be generated:

    echo 1 > /proc/sys/kernel/sysrq; echo c > /proc/sysrq-trigger
  2. AGDI can trigger kernel panic, but kdump cannot work:

ipmitool raw 0x3c 0x16

<system reboot>
[    0.000000] Booting Linux on physical CPU 0x0000120000 [0x413fd0c1]
[    0.000000] Linux version 5.17.0+ (root@adam_mj_cent83) (gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), GNU ld version 2.30-93.el8) #1 SMP Sun Apr 10 23:30:58 CST 2022
... ...
[   17.739947] input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input0
[   17.756846] ACPI: button: Power Button [PWRB]

<hang here>

Note, here is error when rebooting (these error can be ignored)

[   16.817546] sdei: Failed to create event 1073741825: -5
[   16.828086] agdi agdi.0: Failed to register for SDEI event 1073741825
[   16.841065] agdi: probe of agdi.0 failed with error -5

Since the AGDI nmi_panic() is invoked from SDEI context, some drivers in crashkernel (secondary kernel) cannot be initialized correctly. We can add these drivers in blacklist, as bellow:

#edit /etc/sysconfig/kdump, add these options to "KDUMP_COMMANDLINE_APPEND"

ignore_loglevel initcall_debug initcall_blacklist=acpi_processor_driver_init initcall_blacklist=dma_atomic_pool_init

Secondary kernel hangs at arm_smmu_driver_init

[   29.399831] calling  arm_smmu_driver_init+0x0/0x30 @ 1
[   29.410635] initcall arm_smmu_driver_init+0x0/0x30 returned 0 after 457 usecs
[   29.425017] calling  arm_smmu_driver_init+0x0/0x34 @ 1
[   29.435485] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0
[   29.446772] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features
[   29.446772] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00041fff)
[   29.464209] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 524288 entries for cmdq
[   29.480841] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 524288 entries for evtq
[   29.496342] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 524288 entries for priq
[   29.511404] arm-smmu-v3 arm-smmu-v3.0.auto: SMMU currently enabled! Resetting...

<Hang here>
adamliyi commented 2 years ago

Full crash kernel boot log (not host kernel): ipmi.log

adamliyi commented 2 years ago

This patch helps to show the full nmi_panic log:

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index e16b248699d5..e961464a881e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -6,6 +6,7 @@
  * Copyright (C) Huawei Futurewei Technologies.
  */

+#include <linux/console.h>
 #include <linux/interrupt.h>
 #include <linux/irq.h>
 #include <linux/kernel.h>
@@ -188,6 +189,8 @@ void machine_kexec(struct kimage *kimage)
        "Some CPUs may be stale, kdump will be unreliable.\n");

    pr_info("Bye!\n");
+   if (in_nmi())
+       console_flush_on_panic(CONSOLE_FLUSH_PENDING);

    local_daif_mask();
adamliyi commented 2 years ago

A workaround patch added to linux-5.4.y tree: https://github.com/AmpereComputing/ampere-lts-kernel/commit/e26e6ac22363fcb24f31bb72dad4aaba4a76e8c7 "ACPI: AGDI: Complete sdei handler with SDEI_EVENT_COMPLETE_AND_RESUME before nmi_panic". This patch fixed SMMU hang issue when kdump.

According to bellow upstream community discussion, this requires ATF fix.

Details, refer to: [1] TF-A community discussion: https://lists.trustedfirmware.org/archives/list/tf-a@lists.trustedfirmware.org/thread/NB7PH7C32LQ5PRCCMISZ7EOVI3XFBI3X/#GPGW66B5MKUZFCAS2EJLBBZIZNSCMAA4

[2] Kernel discussion: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20211012142910.9688-1-zhangliguang@linux.alibaba.com/