ARM-software / tf-issues

Issue tracking for the ARM Trusted Firmware project
37 stars 16 forks source link

While bringing up the secondary cores with ARM IP CCN-508, Linux hangs. #622

Closed pangupta closed 6 years ago

pangupta commented 6 years ago

Two Observations:

  1. When all the RN-F Id are included in the snoop/dvm domain in one go, the Linux works fine.
  2. When RN-F mapped to the cluster whose core is coming up, is incrementally included in the snoop/dvm domain, Linux hangs.

Do this issue earlier reported? Am I placing the call for updating the snoop/dvm domain correctly?

Observation 2 Code level details & Linux hanged call trace:

Using Function: void plat_arm_interconnect_enter_coherency() { ccn_enter_snoop_dvm_domain(1 << MPIDR_AFFLVL1_VAL(read_mpidr_el1())); }

Linux raises SMC to request runtime secure firmware(BL31) to bring up a new core.

In the BL31, the above function is executed as part of platform hook: static plat_psci_ops_t _psci_pm_ops = { ... .pwr_domain_on_finish = _pwr_domain_wakeup ... };

static void _pwr_domain_wakeup(const psci_power_state_t *target_state) { u_register_t core_mask = plat_my_core_mask(); u_register_t core_state = _getCoreState(core_mask);

    switch (core_state) {
    case CORE_PENDING : /* this core is coming out of reset */

             /* soc per cpu setup */
            soc_init_percpu();

             /* gic per cpu setup */
            plat_ls_gic_pcpu_init();

           plat_arm_interconnect_enter_coherency();

             /* set core state in internal data */
            core_state = CORE_RELEASED;
            _setCoreState(core_mask, core_state);
            break;

... ... }

Linux hangs just after executing plat_arm_interconnect_enter_coherency().

Linux traces: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.9.62-gfa56a24 (meenakshi@uefi-OptiPlex-790) (gcc version 7.3.0 (GCC) ) #6 SMP PREEMPT Mon Sep 3 14:45:53 IST 2018 [ 0.000000] Boot CPU: AArch64 Processor [410fd083] [ 0.000000] earlycon: pl11 at MMIO32 0x00000000021c0000 (options '') [ 0.000000] bootconsole [pl11] enabled [ 0.000000] Memory limited to 2048MB [ 0.000000] efi: Getting EFI parameters from FDT: [ 0.000000] efi: EFI v2.70 by EDK II [ 0.000000] efi: MEMATTR=0xf037d018 [ 0.000000] cma: Reserved 16 MiB at 0x00000000f9c00000 [ 0.000000] NUMA: No NUMA configuration found [ 0.000000] NUMA: Faking a node at [mem 0x0000000000000000-0x00000020841fffff] [ 0.000000] NUMA: Adding memblock [0x80000000 - 0xefc6ffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xefc70000 - 0xefcdffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xefce0000 - 0xf0313fff] on node 0 [ 0.000000] NUMA: Adding memblock [0xf0314000 - 0xf0317fff] on node 0 [ 0.000000] NUMA: Adding memblock [0xf0318000 - 0xf033ffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xf0340000 - 0xf034ffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xf0350000 - 0xf0b2ffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xf0b30000 - 0xf0eeffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xf0ef0000 - 0xfad1ffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xfad20000 - 0xfad6ffff] on node 0 [ 0.000000] NUMA: Adding memblock [0xfad70000 - 0xfbdfffff] on node 0 [ 0.000000] NUMA: Adding memblock [0x2080000000 - 0x20841fffff] on node 0 [ 0.000000] NUMA: Initmem setup node 0 [mem 0x80000000-0x20841fffff] [ 0.000000] NUMA: NODE_DATA [mem 0x20841d0e40-0x20841d25ff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000080000000-0x00000000ffffffff] [ 0.000000] Normal [mem 0x0000000100000000-0x00000020841fffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000080000000-0x00000000efc6ffff] [ 0.000000] node 0: [mem 0x00000000efc70000-0x00000000efcdffff] [ 0.000000] node 0: [mem 0x00000000efce0000-0x00000000f0313fff] [ 0.000000] node 0: [mem 0x00000000f0314000-0x00000000f0317fff] [ 0.000000] node 0: [mem 0x00000000f0318000-0x00000000f033ffff] [ 0.000000] node 0: [mem 0x00000000f0340000-0x00000000f034ffff] [ 0.000000] node 0: [mem 0x00000000f0350000-0x00000000f0b2ffff] [ 0.000000] node 0: [mem 0x00000000f0b30000-0x00000000f0eeffff] [ 0.000000] node 0: [mem 0x00000000f0ef0000-0x00000000fad1ffff] [ 0.000000] node 0: [mem 0x00000000fad20000-0x00000000fad6ffff] [ 0.000000] node 0: [mem 0x00000000fad70000-0x00000000fbdfffff] [ 0.000000] node 0: [mem 0x0000002080000000-0x00000020841fffff] [ 0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000020841fffff] [ 0.000000] psci: probing for conduit method from DT. [ 0.000000] psci: PSCIv1.1 detected in firmware. [ 0.000000] psci: Using standard PSCI v0.2 function IDs [ 0.000000] psci: MIGRATE_INFO_TYPE not supported. [ 0.000000] percpu: Embedded 23 pages/cpu @ffff80200405d000 s54680 r8192 d31336 u94208 [ 0.000000] Detected PIPT I-cache on CPU0 [ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total pages: 515832 [ 0.000000] Policy zone: Normal [ 0.000000] Kernel command line: Image initrd=\fsl-image-core-lx2160ardb.ext2.gz root=/dev/ram0 rw console=ttyAMA0,115200 ramdisk_size=2000000 no_console_suspend,ignore_loglevel earlycon=pl011,mmio32,0x21c0000 mem=2048M default_hugepagesz=2MB hugepagesz=2MB hugepages=44 pci=pcie_bus_perf noefi [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) [ 0.000000] software IO TLB [mem 0xf5c00000-0xf9c00000] (64MB) mapped at [ffff800075c00000-ffff800079bfffff] [ 0.000000] Memory: 1895996K/2097152K available (11452K kernel code, 1028K rwdata, 4584K rodata, 1088K init, 949K bss, 184772K reserved, 16384K cma-reserved) [ 0.000000] Virtual kernel memory layout: [ 0.000000] modules : 0xffff000000000000 - 0xffff000008000000 ( 128 MB) [ 0.000000] vmalloc : 0xffff000008000000 - 0xffff7dffbfff0000 (129022 GB) [ 0.000000] .text : 0xffff000008080000 - 0xffff000008bb0000 ( 11456 KB) [ 0.000000] .rodata : 0xffff000008bb0000 - 0xffff000009030000 ( 4608 KB) [ 0.000000] .init : 0xffff000009030000 - 0xffff000009140000 ( 1088 KB) [ 0.000000] .data : 0xffff000009140000 - 0xffff000009241200 ( 1029 KB) [ 0.000000] .bss : 0xffff000009241200 - 0xffff00000932e9cc ( 950 KB) [ 0.000000] fixed : 0xffff7dfffe7fd000 - 0xffff7dfffec00000 ( 4108 KB) [ 0.000000] PCI I/O : 0xffff7dfffee00000 - 0xffff7dffffe00000 ( 16 MB) [ 0.000000] vmemmap : 0xffff7e0000000000 - 0xffff800000000000 ( 2048 GB maximum) [ 0.000000] 0xffff7e0000000000 - 0xffff7e0080108000 ( 2049 MB actual) [ 0.000000] memory : 0xffff800000000000 - 0xffff802004200000 (131138 MB) [ 0.000000] Preemptible hierarchical RCU implementation. [ 0.000000] Build-time adjustment of leaf fanout to 64. [ 0.000000] RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=16. [ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=16 [ 0.000000] NR_IRQS:64 nr_irqs:64 0 [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode [ 0.000000] ITS [mem 0x06020000-0x0603ffff] [ 0.000000] ITS@0x0000000006020000: allocated 65536 Devices @2080880000 (flat, esz 8, psz 64K, shr 0) [ 0.000000] ITS: using cache flushing for cmd queue [ 0.000000] GIC: using LPI property table @0x0000002080820000 [ 0.000000] ITS: Allocated 1792 chunks for LPIs [ 0.000000] GICv3: CPU0: found redistributor 0 region 0:0x0000000006200000 [ 0.000000] CPU0: using LPI pending table @0x0000002080830000 [ 0.000000] GIC: using cache flushing for LPI property table [ 0.000000] arm_arch_timer: Architected cp15 timer(s) running at 25.00MHz (phys). [ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x5c409fb33, max_idle_ns: 440795203156 ns [ 0.000001] sched_clock: 56 bits at 25MHz, resolution 39ns, wraps every 4398046511103ns [ 0.008339] Console: colour dummy device 80x25 [ 0.012852] Calibrating delay loop (skipped), value calculated using timer frequency.. 50.00 BogoMIPS (lpj=100000) [ 0.023318] pid_max: default: 32768 minimum: 301 [ 0.028069] Security Framework initialized [ 0.032443] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) [ 0.040082] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) [ 0.047367] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes) [ 0.054144] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes) [ 0.061764] ASID allocator initialised with 65536 entries [ 0.099260] PCI/MSI: /interrupt-controller@6000000/gic-its@6020000 domain created [ 0.106852] Platform MSI: /interrupt-controller@6000000/gic-its@6020000 domain created [ 0.115010] EFI runtime services will be disabled. ccn_enter_snoop_dvm_domain Enter ccn_master_to_rn_id_map enter master_map =1 Node Id = 0 ccn_master_to_rn_id_map exit rn_id_map=1 Updating HNF ccn_snoop_dvm_do_op Enter ccn_snoop_dvm_do_op Exit Updating MN ccn_snoop_dvm_do_op Enter ccn_snoop_dvm_do_op Exit ccn_enter_snoop_dvm_domain Exit ccn_enter_snoop_dvm_domain Enter ccn_master_to_rn_id_map enter master_map =2 Node Id = 11 ccn_master_to_rn_id_map exit rn_id_map=800 Updating HNF ccn_snoop_dvm_do_op Enter ccn_snoop_dvm_do_op Exit Updating MN ccn_snoop_dvm_do_op Enter ccn_snoop_dvm_do_op Exit ccn_enter_snoop_dvm_domain Exit [ 0.251923] Unexpected interrupt received! [ 0.255146] ------------[ cut here ]------------ [ 0.255147] kernel BUG at kernel/cpu.c:867! [ 0.255149] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 0.255151] Modules linked in: [ 0.255154] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.62-gfa56a24 #6 [ 0.255155] Hardware name: NXP Layerscape LX2160ARDB (DT) [ 0.255156] task: ffff8020000a4e40 task.stack: ffff8020000a8000 [ 0.255165] PC is at cpuhp_report_idle_dead+0x80/0x88 [ 0.255168] LR is at cpu_startup_entry+0x17c/0x200 [ 0.255170] pc : [] lr : [] pstate: 20000045 [ 0.255170] sp : ffff8020000abf30 [ 0.255173] x29: ffff8020000abf30 x28: 0000000000000001 [ 0.255174] x27: ffff000009148000 x26: ffff000009107c70 [ 0.255176] x25: ffff000009148164 x24: ffff8020000a4e40 [ 0.255178] x23: 0000000000000000 x22: 0000000000000000 [ 0.255179] x21: ffff8020040789c8 x20: 0000801ffaf76000 [ 0.255181] x19: ffff0000091029c8 x18: 0000000000000030 [ 0.255182] x17: ffff8020000abf10 x16: ffff00000914b000 [ 0.255184] x15: ffff00000915f000 x14: 00000000fffffff0 [ 0.255185] x13: 00000000026cd7d3 x12: ffff802000124440 [ 0.255187] x11: 0000000000000000 x10: 00000000000008a0 [ 0.255189] x9 : ffff8020000abea0 x8 : ffff8020000a5740 [ 0.255190] x7 : 00000000ffffffff x6 : 00000000fffedb2f [ 0.255192] x5 : 00000000000000c0 x4 : 0000801ffaf76000 [ 0.255193] x3 : 0000000000000040 x2 : 0000000000000000 [ 0.255195] x1 : 000000000000003f x0 : 0000000000000096 [ 0.255195] [ 0.255196] Process swapper/1 (pid: 0, stack limit = 0xffff8020000a8000) [ 0.255198] Stack: (0xffff8020000abf30 to 0xffff8020000ac000) [ 0.255199] bf20: ffff8020000abf60 ffff0000080feeec [ 0.255201] bf40: ffff0000091480c8 0000000000000002 ffff000009148158 ffff0000080feea8 [ 0.255202] bf60: ffff8020000abfd0 ffff00000808e6ec 0000000000000001 0000000000000001 [ 0.255203] bf80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 0.255205] bfa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 0.255206] bfc0: 0000000000000000 ffff00000923e0ed 0000000000000000 0000000080ba41a4 [ 0.255207] bfe0: 0000000000000000 0000000000000000 deadbeefdeadbeef deadbeefdeadbeef [ 0.255208] Call trace: [ 0.255210] Exception stack(0xffff8020000abd60 to 0xffff8020000abe90) [ 0.255211] bd60: ffff0000091029c8 0001000000000000 ffff8020000abf30 ffff0000080c1828 [ 0.255213] bd80: ffff8020000abd90 ffff000008132f38 ffff8020000abda0 ffff000008100e9c [ 0.255214] bda0: ffff8020000abdd0 ffff00000810dd6c ffff00000924c958 0000000000000000 [ 0.255215] bdc0: ffff8020000abdd0 ffff000008477fc4 ffff8020000abe30 ffff00000808ed54 [ 0.255216] bde0: 0000000000000000 ffff802000040cc0 0000000000000000 0000000000000001 [ 0.255218] be00: 0000000000000096 000000000000003f 0000000000000000 0000000000000040 [ 0.255219] be20: 0000801ffaf76000 00000000000000c0 00000000fffedb2f 00000000ffffffff [ 0.255220] be40: ffff8020000a5740 ffff8020000abea0 00000000000008a0 0000000000000000 [ 0.255222] be60: ffff802000124440 00000000026cd7d3 00000000fffffff0 ffff00000915f000 [ 0.255223] be80: ffff00000914b000 ffff8020000abf10 [ 0.255226] [] cpuhp_report_idle_dead+0x80/0x88 [ 0.255227] [] cpu_startup_entry+0x17c/0x200 [ 0.255230] [] secondary_start_kernel+0x14c/0x188 [ 0.255231] [<0000000080ba41a4>] 0x80ba41a4 [ 0.255234] Code: a94153f3 a8c37bfd d65f03c0 d503201f (d4210000) [ 0.255241] ---[ end trace 5ef4971781faca1d ]--- [ 0.255245] Kernel panic - not syncing: Attempted to kill the idle task! [ 0.255247] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! [ 0.597618] ------------[ cut here ]------------ [ 0.602281] WARNING: CPU: 0 PID: 0 at drivers/irqchip/irq-gic-v3.c:360 gic_handle_irq+0x154/0x15c [ 0.611247] Modules linked in: [ 0.614326] [ 0.615823] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D 4.9.62-gfa56a24 #6 [ 0.623822] Hardware name: NXP Layerscape LX2160ARDB (DT) [ 0.629274] task: ffff000009151b40 task.stack: ffff000009140000 [ 0.635253] PC is at gic_handle_irq+0x154/0x15c [ 0.639825] LR is at gic_handle_irq+0x154/0x15c [ 0.644397] pc : [] lr : [] pstate: 600001c5 [ 0.651869] sp : ffff802004061000 [ 0.655210] x29: ffff802004061000 x28: ffff000009151b40 [ 0.660575] x27: ffff000009148000 x26: ffff802004061050 [ 0.665939] x25: ffff80200405d060 x24: ffff00000923e2ca [ 0.671304] x23: ffff000009143dd0 x22: ffff00000914b050 [ 0.676668] x21: 0000000000001fff x20: 000000000000001e [ 0.682031] x19: 000000000000001e x18: 0000000000000010 [ 0.687395] x17: 00000000c3821ed9 x16: 0000000089fe07d7 [ 0.692759] x15: ffffffffffffffff x14: ffff00008924cec7 [ 0.698122] x13: ffff00000924ced5 x12: 000000000000000f [ 0.703486] x11: 0000000005f5e0ff x10: 0000000000000079 [ 0.708850] x9 : 00000000ffffffd0 x8 : 74206c6c696b206f [ 0.714214] x7 : 7420646574706d65 x6 : ffff0000085db7e8 [ 0.719578] x5 : 000000000000000d x4 : 0000000000000000 [ 0.724942] x3 : 0000000000000000 x2 : ffff00000915f5c8 [ 0.730306] x1 : ffff000009151b40 x0 : 000000000000001e [ 0.735669] [ 0.737165] ---[ end trace 5ef4971781faca1e ]---

soby-mathew commented 6 years ago

Hi @pangupta My first impression looking at the pwr_domain_on_finish is that you are programming the interconnect for every core wakeup which seems wrong. Usually the interconnect will be programmed only for the first core to wakeup in a cluster.

static void css_pwr_domain_on_finisher_common(
        const psci_power_state_t *target_state)
{
    assert(CSS_CORE_PWR_STATE(target_state) == ARM_LOCAL_STATE_OFF);

    /* Enable the gic cpu interface */
    plat_arm_gic_cpuif_enable();

    /*
     * Perform the common cluster specific operations i.e enable coherency
     * if this cluster was off.
     */
    if (CSS_CLUSTER_PWR_STATE(target_state) == ARM_LOCAL_STATE_OFF)
        plat_arm_interconnect_enter_coherency();
}

As seen from the code above, the interconnect is only programmed if the Cluster state was marked as OFF. Cluster state will be OFF only for the first core powering ON in the cluster. The remaining cores of the cluster when powered ON later will find the Cluster state is already RUNNING and hence will not do the interconnect programming.

Hope that helps.

pangupta commented 6 years ago

I have tried the suggested way of calling the function "plat_arm_interconnect_enter_coherency" once per cluster, but still the same problem is reported.

soby-mathew commented 6 years ago

Ok, the kernel panic reported seems to be unrelated to the CCN programming as it says "Unexpected interrupt received!". CPU1 has booted and crashed in the kernel where as CPU0 has received an unexpected interrupt and crashed. Are both CPU1 and CPU0 in the same cluster?

That's all I can glean out at the moment with the information you have provided. I would suspect the interrupt configuration done by Linux as CPU0 has received an unexpected interrupt.

pangupta commented 6 years ago

Still this issue is un-resolved for me. Included all the RN-F to the Snoop Domain in one go.