foss-for-synopsys-dwc-arc-processors / linux

Helpful resources for users & developers of Linux kernel for ARC
22 stars 13 forks source link

[Linux] Init fails with 16KiB MMU page #17

Closed abrodkin closed 4 years ago

abrodkin commented 4 years ago

In Buildroot 2019.11-374-g62a7e61df9 BR2_ARC_PAGE_SIZE_16K=y & BR2_TOOLCHAIN_USES_UCLIBC=y In Linux kernel 5.4.2 CONFIG_ARC_PAGE_SIZE_16K=y.

Linux version 5.4.2 (abrodkin@ru20arcgnu1) (gcc version 9.2.1 20191002 (Buildroot 2020.02-git-00374-g62a7e61df9)) foss-for-synopsys-dwc-arc-processors/toolchain#2 PREEMPT Tue Dec 17 12:08:00 MSK 2019
Memory @ 80000000 [512M]
Memory @ 100000000 [1024M] Not used
OF: fdt: Machine model: snps,nsim_hs
earlycon: arc_uart0 at MMIO32 0xc0fc1000 (options '115200n8')
printk: bootconsole [arc_uart0] enabled
archs-intc      : 15 priority levels (default 1)

IDENTITY        : ARCVER [0x53] ARCNUM [0x0] CHIPID [0xffff]
processor [0]   : HS38 R3.0 (ARCv2 ISA)
Timers          : Timer0 Timer1 RTC [UP 64-bit]
ISA Extn        : atomic ll64 unalign mpy[opt 9] div_rem
BPU             : partial match, cache:2048, Predict Table:16384 Return stk: 8
MMU [v4]        : 16k PAGE, 2M Super Page (not used) JTLB 512 (128x4), uDTLB 8, uITLB 4
I-Cache         : 32K, 4way/set, 64B Line, VIPT
D-Cache         : 16K, 2way/set, 64B Line, PIPT
Peripherals     : 0xc0000000
Vector Table    : 0x80000000
DEBUG           : ActionPoint 4/full
Built 1 zonelists, mobility grouping on.  Total pages: 32696
Kernel command line: earlycon=arc_uart,mmio32,0xc0fc1000,115200n8 console=ttyARC0,115200n8 print-fatal-signals=1
Dentry cache hash table entries: 65536 (order: 4, 262144 bytes, linear)
Inode-cache hash table entries: 32768 (order: 3, 131072 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 516384K/524288K available (3303K kernel code, 194K rwdata, 912K rodata, 752K init, 292K bss, 7904K reserved, 0K cma-reserved)
SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
rcu: Preemptible hierarchical RCU implementation.
        Tasks RCU enabled.
rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
NR_IRQS: 512
sched_clock: 64 bits at 80MHz, resolution 12ns, wraps every 4398046511100ns
clocksource: ARCv2 RTC: mask: 0xffffffffffffffff max_cycles: 0x127350b881, max_idle_ns: 440795202125 ns
sched_clock: 32 bits at 80MHz, resolution 12ns, wraps every 26843545593ns
clocksource: ARC Timer1: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 23890755578 ns
Console: colour dummy device 80x25
Calibrating delay loop... 159.12 BogoMIPS (lpj=795648)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 4096 (order: 0, 16384 bytes, linear)
Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes, linear)
rcu: Hierarchical SRCU implementation.
devtmpfs: initialized
random: get_random_u32 called from bucket_table_alloc.isra.0+0x4c/0x194 with crng_init=0
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
futex hash table entries: 256 (order: -3, 3072 bytes, linear)
NET: Registered protocol family 16
DMA: preallocated 256 KiB pool for atomic allocations
clocksource: Switched to clocksource ARCv2 RTC
NET: Registered protocol family 2
tcp_listen_portaddr_hash hash table entries: 2048 (order: 0, 16384 bytes, linear)
TCP established hash table entries: 4096 (order: 0, 16384 bytes, linear)
TCP bind hash table entries: 4096 (order: 0, 16384 bytes, linear)
TCP: Hash tables configured (established 4096 bind 4096)
UDP hash table entries: 1024 (order: 0, 16384 bytes, linear)
UDP-Lite hash table entries: 1024 (order: 0, 16384 bytes, linear)
NET: Registered protocol family 1
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
arc-pct fpga:pct: use noncoherent DMA ops
ARC perf        : 8 counters (32 bits), 40 conditions, [overflow IRQ support]
workingset: timestamp_bits=30 max_order=15 bucket_order=0
io scheduler mq-deadline registered
io scheduler kyber registered
arc-uart c0fc1000.serial: use noncoherent DMA ops
c0fc1000.serial: ttyARC0 at MMIO 0x0 (irq = 24, base_baud = 5000000) is a arc-uart
printk: console [ttyARC0] enabled
printk: console [ttyARC0] enabled
printk: bootconsole [arc_uart0] disabled
printk: bootconsole [arc_uart0] disabled
NET: Registered protocol family 17
NET: Registered protocol family 15
Freeing unused kernel memory: 752K
This architecture does not have kernel memory protection.
Run /init as init process
Failed to execute /init (error -22)
Run /sbin/init as init process
Starting init: /sbin/init exists but couldn't execute it (error -14)
Run /etc/init as init process
Run /bin/init as init process
Run /bin/sh as init process
Starting init: /bin/sh exists but couldn't execute it (error -14)
Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. ]---
abrodkin commented 4 years ago

Apparently with 4KiB MMU page the issue goes away. I.e. BR2_ARC_PAGE_SIZE_4K=y & BR2_TOOLCHAIN_USES_UCLIBC=y In Linux kernel 5.4.2 CONFIG_ARC_PAGE_SIZE_4K=y.

Linux version 5.4.2 (abrodkin@ru20arcgnu1) (gcc version 9.2.1 20191002 (Buildroot 2020.02-git-00374-g62a7e61df9)) foss-for-synopsys-dwc-arc-processors/toolchain#4 PREEMPT Tue Dec 17 14:40:15 MSK 2019
Memory @ 80000000 [512M]
Memory @ 100000000 [1024M] Not used
OF: fdt: Machine model: snps,nsim_hs
earlycon: arc_uart0 at MMIO32 0xc0fc1000 (options '115200n8')
printk: bootconsole [arc_uart0] enabled
archs-intc      : 15 priority levels (default 1)

IDENTITY        : ARCVER [0x53] ARCNUM [0x0] CHIPID [0xffff]
processor [0]   : HS38 R3.0 (ARCv2 ISA)
Timers          : Timer0 Timer1 RTC [UP 64-bit]
ISA Extn        : atomic ll64 unalign mpy[opt 9] div_rem
BPU             : partial match, cache:2048, Predict Table:16384 Return stk: 8
MMU [v4]        : 4k PAGE, 2M Super Page (not used) JTLB 512 (128x4), uDTLB 8, uITLB 4
I-Cache         : 32K, 4way/set, 64B Line, VIPT aliasing
D-Cache         : 16K, 2way/set, 64B Line, PIPT
Peripherals     : 0xc0000000
Vector Table    : 0x80000000
DEBUG           : ActionPoint 4/full
Built 1 zonelists, mobility grouping on.  Total pages: 129920
Kernel command line: earlycon=arc_uart,mmio32,0xc0fc1000,115200n8 console=ttyARC0,115200n8 print-fatal-signals=1
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 513072K/524288K available (3296K kernel code, 169K rwdata, 912K rodata, 732K init, 222K bss, 11216K reserved, 0K cma-reserved)
SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
rcu: Preemptible hierarchical RCU implementation.
        Tasks RCU enabled.
rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
NR_IRQS: 512
sched_clock: 64 bits at 80MHz, resolution 12ns, wraps every 4398046511100ns
clocksource: ARCv2 RTC: mask: 0xffffffffffffffff max_cycles: 0x127350b881, max_idle_ns: 440795202125 ns
sched_clock: 32 bits at 80MHz, resolution 12ns, wraps every 26843545593ns
clocksource: ARC Timer1: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 23890755578 ns
Console: colour dummy device 80x25
Calibrating delay loop... 159.12 BogoMIPS (lpj=795648)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
rcu: Hierarchical SRCU implementation.
devtmpfs: initialized
random: get_random_u32 called from bucket_table_alloc.isra.0+0x4c/0x194 with crng_init=0
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
futex hash table entries: 256 (order: -1, 3072 bytes, linear)
NET: Registered protocol family 16
DMA: preallocated 256 KiB pool for atomic allocations
clocksource: Switched to clocksource ARCv2 RTC
NET: Registered protocol family 2
tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
TCP bind hash table entries: 4096 (order: 2, 16384 bytes, linear)
TCP: Hash tables configured (established 4096 bind 4096)
UDP hash table entries: 256 (order: 0, 4096 bytes, linear)
UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear)
NET: Registered protocol family 1
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
arc-pct fpga:pct: use noncoherent DMA ops
ARC perf        : 8 counters (32 bits), 40 conditions, [overflow IRQ support]
workingset: timestamp_bits=30 max_order=17 bucket_order=0
io scheduler mq-deadline registered
io scheduler kyber registered
arc-uart c0fc1000.serial: use noncoherent DMA ops
c0fc1000.serial: ttyARC0 at MMIO 0x0 (irq = 24, base_baud = 5000000) is a arc-uart
printk: console [ttyARC0] enabled
printk: console [ttyARC0] enabled
printk: bootconsole [arc_uart0] disabled
printk: bootconsole [arc_uart0] disabled
NET: Registered protocol family 17
NET: Registered protocol family 15
Freeing unused kernel memory: 732K
This architecture does not have kernel memory protection.
Run /init as init process
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Saving random seed: random: dd: uninitialized urandom read (512 bytes read)
OK
Starting network: OK

Welcome to Buildroot
shahab-vahedi commented 4 years ago

The page size for ARC is hardcoded/defined to 8K in QEMU. I am even surprised that 4k version works. Maybe it works because any entry added to QEMU's internal TLB that is smaller than the page size (hardcoded 8k) can be considered a 1-byte entry. Then I expect the booting process to be slower than the normal scenarios.

abrodkin commented 4 years ago

@shahab-vahedi this problem happens on both nSIM & real HW (HSDK) so has nothing to do with QEMU [yet] :)

abrodkin commented 4 years ago

@claziss @cupertinomiranda I guess that might happen due to some alignment things. So to me this MAXPAGESIZE=0x2000 looks suspicious as it is exactly 8KiB! See https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2019.03/ld/emulparams/arclinux_prof.sh#L11

I remember back in the day I fixed that for ARC700 but now I cannot find that change in our repos :(

claziss commented 4 years ago

@abrodkin arclinux_prof.sh is unused :) But indeed the MAXPAGESIZE is set to 8k here: https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/d276fed546051ac2d0f127613c2bb0e00d02ccb6/bfd/elf32-arc.c#L3426

vineetgarc commented 4 years ago

The page size for ARC is hardcoded/defined to 8K in QEMU. I am even surprised that 4k version works. Maybe it works because any entry added to QEMU's internal TLB that is smaller than the page size (hardcoded 8k) can be considered a 1-byte entry. Then I expect the booting process to be slower than the normal scenarios.

A 4k page software build on a 8k hardware/simulator will not be slow, but will be busted in random ways. Imagine the vadr faulted for was 0x1001. A 8k page simulator will lookup for a TLB entry with vaddr 0x0 to do the mapping, while a 4K page simulator will lookup 0x1000. Since the entry for latter won't likely exist it will fail. This is unless you have "creative" masking in lookup process :-)

vineetgarc commented 4 years ago

@claziss @cupertinomiranda I guess that might happen due to some alignment things. So to me this MAXPAGESIZE=0x2000 looks suspicious as it is exactly 8KiB! See https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2019.03/ld/emulparams/arclinux_prof.sh#L11

I remember back in the day I fixed that for ARC700 but now I cannot find that change in our repos :(

There's a long history to how all the 16k and 4k sizes were brought up. The gist of issue at had is in https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/commit/50c303c7e921cdeb0344e782397f8585dfae000d

But then as Claudiu says this still needs MAXPAGESIZE to be 16k is currently 8k for ARC. I would propose to NOT change the default as that causes a lot of mmap space wasted in slack (issue too detailed to explain here).

One way to override is linker toggle -Wl,-z,max-page-size=16384 This was initially broken as well (STAR ARS0102823- linker coalescing elf segments with -z max-page-size=16384) since fixed in GNU 2016.09

But in the end, 16k page size is a useless config IMO. It reduces the TLB pressure a lot but potentially wastes a bunch of memory for small/medium allocations. Do you have a customer using or wanting to use it.

abrodkin commented 4 years ago

Ok solution (-Wl,-z,max-page-size=16384) proposed by Vineet that really make user-space work on 16 KiB MMU pages. In Buildroot parlance this translates to BR2_TARGET_OPTIMIZATION="-Wl,-z,max-page-size=16384". I guess we may just add this linker option by default whenever 16 KiB page gets selected in Buildroot, i.e. make it dependent on BR2_ARC_PAGE_SIZE_16K.

FWIW:

# cat /proc/cpuinfo

IDENTITY        : ARCVER [0x53] ARCNUM [0x0] CHIPID [0xffff]
processor [0]   : HS38 R3.0 (ARCv2 ISA)
Timers          : Timer0 Timer1 RTC [UP 64-bit]
ISA Extn        : atomic ll64 unalign mpy[opt 9] div_rem
BPU             : partial match, cache:2048, Predict Table:16384 Return stk: 8
CPU speed       : 80.00 Mhz
Bogo MIPS       : 159.12
MMU [v4]        : 16k PAGE, 2M Super Page (not used) JTLB 512 (128x4), uDTLB 8, uITLB 4
I-Cache         : 32K, 4way/set, 64B Line, VIPT
D-Cache         : 16K, 2way/set, 64B Line, PIPT
Peripherals     : 0xc0000000
Vector Table    : 0x80000000
DEBUG           : ActionPoint 4/full
abrodkin commented 4 years ago

Ok is covered in Buildroot now with https://patchwork.ozlabs.org/patch/1212544/.