foss-for-synopsys-dwc-arc-processors / linux

Helpful resources for users & developers of Linux kernel for ARC
22 stars 13 forks source link

Synopsys ARC HSDK (4xHS38) hangs at boot with 5.10-rc1 kernel #32

Closed jzbydniewski closed 3 years ago

jzbydniewski commented 3 years ago

I observe this with buildroot 2020.08.1, after kernel changed to 5.10-rc1 (using menuconfig 'kernel' and 'toolchain' sections) BR2_LINUX_KERNEL_CUSTOM_VERSION_VALUE="5.10-rc1" BR2_LINUX_KERNEL_VERSION="5.10-rc1" BR2_PACKAGE_HOST_LINUX_HEADERS_CUSTOM_5_7=y BR2_TOOLCHAIN_HEADERS_AT_LEAST_5_7=y BR2_TOOLCHAIN_HEADERS_AT_LEAST="5.7"

Linux doesn't complete boot but hangs instead when trying to start all cores:

Linux version 5.10.0-rc1 (arc-buildroot-linux-uclib c-gcc.br_real (Buildroot 2020.08.1) 9.3.1 20200315, GNU ld (GNU Binutils) 2.34.5 0.20200307) #1 SMP PREEMPT Thu Oct 29 11:52:51 CET 2020
Memory @ 80000000 [1024M]
OF: fdt: Machine model: snps,hsdk
earlycon: uart8250 at MMIO32 0xf0005000 (options '115200n8')
printk: bootconsole [uart8250] enabled
Failed to get possible-cpus from dtb, pretending all 4 cpus exist
archs-intc      : 2 priority levels (default 1) FIRQ (not used)

IDENTITY        : ARCVER [0x52] ARCNUM [0x0] CHIPID [ 0x0]
processor [0]   : HS38 R2.1 (ARCv2 ISA)
Timers          : Timer0 Timer1 RTC [UP 64-bit] GFRC [SMP 64-bit]
ISA Extn        : atomic ll64 unalign mpy[opt 9] div_rem
BPU             : full match, cache:2048, Predict Table:16384 Return stk: 8
MMU [v4]        : 8k PAGE, 2M Super Page (not used) JTLB 1024 (256x4), uDTLB 8,                                   uITLB 4, PAE40 (not used)
I-Cache         : 64K, 4way/set, 64B Line, VIPT aliasing
D-Cache         : 64K, 2way/set, 64B Line, PIPT
SLC             : 512K, 128B Line
Peripherals     : 0xf0000000, IO-Coherency (per-device)
Vector Table    : 0x90000000
FPU             : SP DP
DEBUG           : smaRT RTT ActionPoint 8/min
Extn [SMP]      : ARConnect (v2): 4 cores with IPI IDU DEBUG GFRC
Zone ranges:
  Normal   [mem 0x0000000080000000-0x00000000bfffffff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000080000000-0x00000000bfffffff]
Initmem setup node 0 [mem 0x0000000080000000-0x00000000bfffffff]
On node 0 totalpages: 131072
  Normal zone: 576 pages used for memmap
  Normal zone: 0 pages reserved
  Normal zone: 131072 pages, LIFO batch:31
percpu: Embedded 7 pages/cpu s23168 r8192 d25984 u57344
pcpu-alloc: s23168 r8192 d25984 u57344 alloc=7*8192
pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
Built 1 zonelists, mobility grouping on.  Total pages: 130496
Kernel command line: earlycon=uart8250,mmio32,0xf0005000,115200n8 console=ttyS0,                                  115200n8 debug print-fatal-signals=1 root=/dev/mmcblk0p2 rootwait
Dentry cache hash table entries: 131072 (order: 6, 524288 bytes, linear)
Inode-cache hash table entries: 65536 (order: 5, 262144 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 1033472K/1048576K available (5693K kernel code, 642K rwdata, 1312K rodat                                  a, 168K init, 302K bss, 15104K reserved, 0K cma-reserved)
rcu: Preemptible hierarchical RCU implementation.
rcu:    RCU event tracing is enabled.
        Trampoline variant of Tasks RCU enabled.
        Rude variant of Tasks RCU enabled.
        Tracing variant of Tasks RCU enabled.
rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
NR_IRQS: 512
MCIP: IDU supports 64 common irqs
sched_clock: 64 bits at 999MHz, resolution 1ns, wraps every 4398046511103ns
clocksource: ARConnect GFRC: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb,                                   max_idle_ns: 881590591483 ns
Console: colour dummy device 80x25
Calibrating delay loop... 997.78 BogoMIPS (lpj=4988928)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 2048 (order: 0, 8192 bytes, linear)
Mountpoint-cache hash table entries: 2048 (order: 0, 8192 bytes, linear)
rcu: Hierarchical SRCU implementation.
smp: Bringing up secondary CPUs ...
Idle Task [1] (ptrval)
Trying to bring up CPU1 ...
Timeout: CPU1 FAILED to come up !!!
Idle Task [2] (ptrval)
Trying to bring up CPU2 ...
archs-intc      : 2 priority levels (default 1) FIRQ (not used)

IDENTITY        : ARCVER [0x52] ARCNUM [0x2] CHIPID [ 0x0]
processor [2]   : HS38 R2.1 (ARCv2 ISA)
Timers          : Timer0 Timer1 RTC [UP 64-bit]
ISA Extn        : atomic ll64 unalign mpy[opt 9] div_rem
BPU             : full match, cache:2048, Predict Table:16384 Return stk: 8
MMU [v4]        : 8k PAGE, 2M Super Page (not used) JTLB 1024 (256x4), uDTLB 8,                                   uITLB 4, PAE40 (not used)
I-Cache         : 64K, 4way/set, 64B Line, VIPT aliasing
D-Cache         : 64K, 2way/set, 64B Line, PIPT
SLC             : 512K, 128B Line
Peripherals     : 0xf0000000, IO-Coherency (per-device)
Vector Table    : 0x90000000
FPU             : SP DP
DEBUG           : smaRT RTT ActionPoint 8/min
Extn [SMP]      : ARConnect (v2): 4 cores with IPI IDU DEBUG GFRC
## CPU2 LIVE ##: Executing Code...
Idle Task [3] (ptrval)
Trying to bring up CPU3 ...
random: fast init done

I use default configuration "snps_archs38_hsdk_defconfig", with 'rcu torture' tests enabled as a module.

vineetgarc commented 3 years ago

I can reproduce this issue on my end on original HSDK (v1 with HS38x4) cores. The problem only happens on 5.10-rc1 and not on last release 5.9.

vineetgarc commented 3 years ago

Debugged this with Metaware debugger and I know what's going on.

From boot logs we can see that cores 0 (Master) and 2 are fine but 1 and 3 are not and hitting memory error exception in early boot asm trampoline when accessing @secondary_idle_task->thread_info (used to setup stack base)

9002c000 <first_lines_of_secondary>: 
9002c000:   ld  r0,[0x907961f4]  <-- read @secondary_idle_task (returns 0x8006_f2a0)
...
9002c024:   ld  sp,[r0,4]        <---- read @secondary_idle_task->thread_info
                                                  triggers memory error exception

The key things to note are

On HSDK, two of the four cores (1 and 3) have private ICCM and DCCM (snippet from boot log of a working system)

processor [0]   : HS38 R2.1 (ARCv2 ISA) 
processor [1]   : HS38 R2.1 (ARCv2 ISA) 
Extn [CCM]      : DCCM @ 80000000, 256 KB / ICCM: @ 70000000, 256 KB
processor [2]   : HS38 R2.1 (ARCv2 ISA) 
processor [3]   : HS38 R2.1 (ARCv2 ISA) 
Extn [CCM]      : DCCM @ 80000000, 256 KB / ICCM: @ 70000000, 256 KB

On these cores, DCCM is wired @ 0x8000_0000 to 0x8004_0000. Since memory targets (ccm, cache..) occupy the full 256M aperture, access at 0x8006_f2a0 still go to DCCM which correctly responds with unpopulated DCCM memory error.

On HS3x memory targets are allowed to be double mapped (with a priority order) and on HSDK 0x8000_0000 is also mapped to D-cache and further along to first 256MB of DDR. Linux needs to use the 0x8000_0000 region for following reasons

ARC SMP Linux however can't use CCMs (for reasons too detailed to explain here). So the kernel dynamically remaps DCCM to 0x6000_0000, an unused region on Linux process memory map (which is a pretty cool hardware feature BTW). From then on, accesses at 0x8000_0000 are served by 2nd mapping to caches (backed by external DDR). Kernel adds 0x8000_0000 (256MB) to pool of free memory, which is used by page/slab allocators for dynamic allocations (To avoid chicken-n-egg problem kernel binary for HSDK is linked at 0x9000_0000 (and not at canonical 0x8000_0000 as on other ARC platforms) so all the build-time kernel code/data are off that address (and what you see if you objdump kernel binary)).

The remapping of DCCM is required for each core (at least 1 and 3) and is currently done in platform boot hook hsdk_init_per_cpu() called for each cpu.

The offending access itself is for @secondary_idle_task which is actually dynamically allocated on Master/boot cpu (from its pool of free memory) and passed to non-master cpus as part of kick starting them. However the remapping on DCCM on those cores, despite being in early platform boot hook is too late given the access is in first asm trampoline.

core0

bringup_cpu(int cpu)
    struct task_struct *idle = idle_thread_get(cpu);
    __cpu_up(cpu, idle);
        \  secondary_idle_tsk = idle;
        \  arc_default_smp_cpu_kick(cpu, first_lines_of_secondary)

core1

ENTRY(first_lines_of_secondary)
    ld  r0, [@secondary_idle_tsk]               <-- failing access
...
        start_kernel_secondary
         \  plat_smp_ops.init_per_cpu(cpu);
                 \ hsdk_init_per_cpu                         <--- remaps DCCM

To prove this, the exact failing kernel binary can be made to work by loading the kernel binary via Metaware Debugger and manually programming Auxiliary Registers 0x208 (ICCM) and 0x18 (DCCM) to 0x6000_0000 before hitting run.

vineetgarc commented 3 years ago

As to the mystery why this was happening only on v5.10-rc1 and not prior release v5.9 the issue was bisected to a subtle side-effect of generic upstream change merged in v5.10-rc1

Commit 7fef431be9c9ac25583 mm/page_alloc: place pages to tail in __free_pages_core() happens to trigger this issue. That change is legit, but it semantically changes how free memory pages are managed and handed out (in-order vs. tail order). For ARC it started using 0x8000_0000 based region much earlier, before that region was remapped from CCM to caches on other cores. exposing a deficiency or more accurately an implicit assumption in ARC platform code.

To prove, this even 5.10-rc1 can be made to work by just hacking a single line of generic mm code

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
@@ -1553,7 +1553,7 @@ void __free_pages_core(struct page *page, unsigned int order)
         * Bypass PCP and place fresh pages right to the tail, primarily
         * relevant for memory onlining.
         */
-       __free_pages_ok(page, order, FPI_TO_TAIL);
+       __free_pages_ok(page, order, FPI_NONE);
 }
vineetgarc commented 3 years ago

The fix is to remove the implicit ordering in ARC platform code.

Patch posted to lkml ARC: [plat-hsdk] Remap CCMs super early in asm boot trampoline

This will make it to upstream v5.10-rc2 but can be tested now.

vineetgarc commented 3 years ago

Oh before we move on, another mystery was this not showing up on my HSDK-4xD development board which is essentially 2nd generation of same board, with a new HS release HS48x4. Turns out that on on this version, CCMs are already wired at 0x6000_0000 (at hardware build time) so CCM remapping in Linux is not even needed (but doing it doesn't really hurt).

jzbydniewski commented 3 years ago

Thanks, I applied this patch to 5.10-rc2 and can see that HSDK now boots properly.