angry-goose-initiative / wiki

AGI Wiki
0 stars 0 forks source link

Fix Broken Paging With S-Mode kernel #15

Closed nickchan2 closed 5 months ago

JZJisawesome commented 5 months ago

I've been writing a bunch of my progress on the #debugging channel, let my copy-paste what I've written (apologies for the poor formatting):

JZJ: Alrighty so it's about time I get back to debugging the smode kernel
[2:21 PM]JZJ: Being able to cancel a GDB continue command is really nice, but the issue from the summer still remains:
IRVE> RVDEBUGADDR: "[    0.000000] printk: debug: skip boot console de-registration.\r\n"
^C
Program received signal SIGINT, Interrupt.
Cannot remove breakpoints because program is no longer writable.
Further execution is probably impossible.
handle_exception () at arch/riscv/kernel/entry.S:23
23              csrrw tp, CSR_SCRATCH, tp
(gdb) list
18              /*
19               * If coming from userspace, preserve the user thread pointer and load
20               * the kernel thread pointer.  If we came from the kernel, the scratch
21               * register will contain 0, and we should continue on the current TP.
22               */
23              csrrw tp, CSR_SCRATCH, tp
24              bnez tp, _save_context
25
26      _restore_kernel_tpsp:
27              csrr tp, CSR_SCRATCH
(gdb) bt
#0  handle_exception () at arch/riscv/kernel/entry.S:23
#1  0xc021bc84 in setup_vm_final () at arch/riscv/mm/init.c:1316
#2  paging_init () at arch/riscv/mm/init.c:1486
Backtrace stopped: Cannot access memory at address 0xc029defc
[2:21 PM]JZJ: The game begins 
[7:53 PM]JZJ: Alrighty so after removing most SBI calls it's very unlikely the issue is firmware related
[7:53 PM]JZJ: I've tried playing with the kernel config, no dice
[7:54 PM]JZJ: I've messed with the device tree to include itself in the memory accessible to linux, no dice
[7:54 PM]JZJ: I've manually stepped through page table translation and can't find any issues. The page tables just seem messed up for whatever reason
[8:01 PM]JZJ: WHYYYYYYYY
[8:10 PM]JZJ: Honestly at this point I think my next step is to try out qemu and see if the page tables are sane there
[8:10 PM]JZJ: I really really hope some instruction isn't subtly implemented wrong and causing corruption all of this time
[5:38 PM]nickchan: Yeah trying in a different emulator is a good idea
[10:24 PM]JZJ: Trying out the mm-mode kernel with a modified device tree (only 32 megs of ram given to the kernel, and not inclusive of the memory the kernel contained (from 0xC2000000 onward):
IRVE> RVSEMIHOSTING: "[    0.000000] Oops - store (or AMO) access fault [#1]\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0+ #1\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] epc : __memset+0xd0/0x110\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  ra : memmap_init_range+0x18c/0x214\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] epc : c01eb704 ra : c0211784 sp : c027be10\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  gp : c02f72d8 tp : c027d340 t0 : c3ffffa0\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  t1 : 00000001 t2 : ffffffff s0 : 000c2068\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  s1 : c4000000 a0 : c4000004 a1 : 00000000\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  a2 : 0000001c a3 : c4000020 a4 : 00000064\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  a5 : c01eb704 a6 : c02144e4 a7 : 0000000c\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  s2 : 000c4000 s3 : 00000000 s4 : 00000000\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  s5 : 00000001 s6 : 00004000 s7 : 00000000\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  s8 : 00000001 s9 : ffffffff s10: 07f40000\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  s11: c02f844c t3 : c02144dc t4 : 00000001\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]  t5 : 00000000 t6 : 02000000\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] status: 00001800 badaddr: 00000000 cause: 00000007\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] [<c01eb704>] __memset+0xd0/0x110\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] [<c0203838>] free_area_init+0xbcc/0xc9c\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] [<c01fdc70>] misc_mem_init+0x28/0x54\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] [<c01fd548>] setup_arch+0xd4/0x518\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] Code: a823 04b2 aa23 04b2 ac23 04b2 ae23 04b2 a023 06b2 (a223) 06b2 \r\n"
[10:25 PM]JZJ: Weird because even though the device tree is wrong (intententionally), I'm not lying to the kernel and telling it there's memory where there isn't any
[10:25 PM]JZJ: You can see the argument a0 to the memset call is: c4000004. Why the hell would the kernel be writing to addresses past the end of memory?
[10:26 PM]JZJ: It clearly knows where memory stops and starts:
IRVE> RVSEMIHOSTING: "[    0.000000] Zone ranges:\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]   Normal   [mem 0x00000000c2000000-0x00000000c3ffffff]\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] Movable zone start for each node\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] Early memory node ranges\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000]   node   0: [mem 0x00000000c2000000-0x00000000c3ffffff]\r\n"
IRVE> RVSEMIHOSTING: "[    0.000000] Initmem setup node 0 [mem 0x00000000c2000000-0x00000000c3ffffff]\r\n"

[10:26 PM]JZJ: I wonder if this is the same issue that the s-mode kernel was facing, but we just didn't notice until now?
[10:27 PM]JZJ: Maybe this truly is an instruction being implemented incorrectly somewhere...
JZJisawesome commented 5 months ago

Next step is trying out QEMU, actually this time.

Failing that my plan is to get the RISCV arch tests up and running again, and just carefully comparing the spec and our instructions and trying to find bugs / ways to break things

JZJisawesome commented 5 months ago

Here's the output of qemu-system-riscv32 -nographic -machine virt -kernel arch/riscv/boot/Image -bios /home/jzj/Downloads/opensbi-1.4-rv-bin/share/opensbi/ilp32/generic/firmware/fw_dynamic.bin -m 500M -s -S using our kernel fork for IRVE:

OpenSBI v1.4
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

Platform Name             : riscv-virtio,qemu
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : aclint-mswi
Platform Timer Device     : aclint-mtimer @ 10000000Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : syscon-reboot
Platform Shutdown Device  : syscon-poweroff
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x80000000
Firmware Size             : 319 KB
Firmware RW Offset        : 0x40000
Firmware RW Size          : 63 KB
Firmware Heap Offset      : 0x47000
Firmware Heap Size        : 35 KB (total), 2 KB (reserved), 9 KB (used), 24 KB (free)
Firmware Scratch Size     : 4096 B (total), 184 B (used), 3912 B (free)
Runtime SBI Version       : 2.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x00100000-0x00100fff M: (I,R,W) S/U: (R,W)
Domain0 Region01          : 0x10000000-0x10000fff M: (I,R,W) S/U: (R,W)
Domain0 Region02          : 0x02000000-0x0200ffff M: (I,R,W) S/U: ()
Domain0 Region03          : 0x80040000-0x8004ffff M: (R,W) S/U: ()
Domain0 Region04          : 0x80000000-0x8003ffff M: (R,X) S/U: ()
Domain0 Region05          : 0x0c400000-0x0c5fffff M: (I,R,W) S/U: (R,W)
Domain0 Region06          : 0x0c000000-0x0c3fffff M: (I,R,W) S/U: (R,W)
Domain0 Region07          : 0x00000000-0xffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x80400000
Domain0 Next Arg1         : 0x9f200000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

Boot HART ID              : 0
Boot HART Domain          : root
Boot HART Priv Version    : v1.12
Boot HART Base ISA        : rv32imafdch
Boot HART ISA Extensions  : sstc,zicntr,zihpm
Boot HART PMP Count       : 16
Boot HART PMP Granularity : 2 bits
Boot HART PMP Address Bits: 32
Boot HART MHPM Info       : 16 (0x0007fff8)
Boot HART MIDELEG         : 0x00001666
Boot HART MEDELEG         : 0x00f0b509
[    0.000000] Linux version 6.6.0+ (jzj@aurora) (riscv32-unknown-linux-gnu-gcc () 13.2.0, GNU ld (GNU Binutils) 2.42) #3 Mon Feb  5 21:53:18 EST 2024
[    0.000000] With tweaks for the Angry Goose Initiative
[    0.000000] I'm running on IRVE, how do you do?
[    0.000000] random: crng init done
[    0.000000] OF: fdt: Ignoring memory range 0x80000000 - 0x80400000
[    0.000000] Machine model: riscv-virtio,qemu
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x1 Version=0x10004
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[    0.000000] printk: bootconsole [sbi0] enabled
[    0.000000] printk: debug: skip boot console de-registration.
[    0.000000] OF: reserved mem: 0x80000000..0x8003ffff (256 KiB) nomap non-reusable mmode_resv1@80000000
[    0.000000] OF: reserved mem: 0x80040000..0x8004ffff (64 KiB) nomap non-reusable mmode_resv0@80040000
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000080400000-0x000000009f3fffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080400000-0x000000009f3fffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080400000-0x000000009f3fffff]
[    0.000000] riscv: base ISA extensions 
[    0.000000] riscv: ELF capabilities 
[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[    0.000000] pcpu-alloc: [0] 0 
[    0.000000] Kernel command line: earlycon=sbi keep_bootcon
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 125984
[    0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
[    0.000000] Memory: 500880K/507904K available (1418K kernel code, 501K rwdata, 362K rodata, 101K init, 220K bss, 7024K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: plic@c000000: mapped 96 interrupts with 1 handlers for 2 contexts.
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
[    0.000118] sched_clock: 64 bits at 10MHz, resolution 100ns, wraps every 4398046511100ns
[    0.003652] Console: colour dummy device 80x25
[    0.004217] printk: console [tty0] enabled
[    0.006743] Calibrating delay loop (skipped), value calculated using timer frequency.. 20.00 BogoMIPS (lpj=40000)
[    0.007683] pid_max: default: 32768 minimum: 301
[    0.008560] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.008963] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.024364] ASID allocator using 9 bits (512 entries)
[    0.029860] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.030646] futex hash table entries: 256 (order: -1, 3072 bytes, linear)
[    0.045012] clocksource: Switched to clocksource riscv_clocksource
[    0.055143] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.063272] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.067487] syscon-poweroff poweroff: pm_power_off already claimed for sbi_srst_power_off
[    0.068354] syscon-poweroff: probe of poweroff failed with error -16
[    0.077613] clk: Disabling unused clocks
[    0.082039] List of all partitions:
[    0.082387] No filesystem could mount root, tried: 
[    0.082405] 
[    0.082834] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    0.083360] CPU: 0 PID: 1 Comm: swapper Not tainted 6.6.0+ #3
[    0.083826] Hardware name: riscv-virtio,qemu (DT)
[    0.084135] Call Trace:
[    0.084362] [<c00028f0>] walk_stackframe+0x0/0x7a
[    0.084734] [<c015eb9e>] dump_stack_lvl+0x18/0x2c
[    0.085059] [<c015aaf4>] panic+0xd6/0x250
[    0.085698] [<c0164946>] mount_root_generic+0x1ea/0x1ee
[    0.086089] [<c0164c5a>] prepare_namespace+0x1b2/0x1ea
[    0.086453] [<c015f822>] rest_init+0x96/0x9a
[    0.086783] [<c015f838>] kernel_init+0x12/0xd8
[    0.087098] [<c015f822>] rest_init+0x96/0x9a
[    0.087442] [<c0001ad2>] ret_from_fork+0x6/0x1c
[    0.088101] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---

Adding -s -S to the command lets us connect to qemu using GDB, so I'll do that next and compare page tables

JZJisawesome commented 5 months ago

Qemu page table:

$2 = {0x0 <repeats 630 times>, 0x201c2c01, 0x27c000e7, 0x0 <repeats 136 times>, 0x201000ef, 0x202000ef, 
  0x203000ef, 0x204000ef, 0x205000ef, 0x206000ef, 0x207000ef, 0x208000ef, 0x209000ef, 0x20a000ef, 0x20b000ef, 
  0x20c000ef, 0x20d000ef, 0x20e000ef, 0x20f000ef, 0x210000ef, 0x211000ef, 0x212000ef, 0x213000ef, 0x214000ef, 
  0x215000ef, 0x216000ef, 0x217000ef, 0x218000ef, 0x219000ef, 0x21a000ef, 0x21b000ef, 0x21c000ef, 0x21d000ef, 
  0x21e000ef, 0x21f000ef, 0x220000ef, 0x221000ef, 0x222000ef, 0x223000ef, 0x224000ef, 0x225000ef, 0x226000ef, 
  0x227000ef, 0x228000ef, 0x229000ef, 0x22a000ef, 0x22b000ef, 0x22c000ef, 0x22d000ef, 0x22e000ef, 0x22f000ef, 
  0x230000ef, 0x231000ef, 0x232000ef, 0x233000ef, 0x234000ef, 0x235000ef, 0x236000ef, 0x237000ef, 0x238000ef, 
  0x239000ef, 0x23a000ef, 0x23b000ef, 0x23c000ef, 0x23d000ef, 0x23e000ef, 0x23f000ef, 0x240000ef, 0x241000ef, 
  0x242000ef, 0x243000ef, 0x244000ef, 0x245000ef, 0x246000ef, 0x247000ef, 0x248000ef, 0x249000ef, 0x24a000ef, 
  0x24b000ef, 0x24c000ef, 0x24d000ef, 0x24e000ef, 0x24f000ef, 0x250000ef, 0x251000ef, 0x252000ef, 0x253000ef, 
  0x254000ef, 0x255000ef, 0x256000ef, 0x257000ef, 0x258000ef, 0x259000ef, 0x25a000ef, 0x25b000ef, 0x25c000ef, 
  0x25d000ef, 0x25e000ef, 0x25f000ef, 0x260000ef, 0x261000ef, 0x262000ef, 0x263000ef, 0x264000ef, 0x265000ef, 
  0x266000ef, 0x267000ef, 0x268000ef, 0x269000ef, 0x26a000ef, 0x26b000ef, 0x26c000ef, 0x26d000ef, 0x26e000ef, 
  0x26f000ef, 0x270000ef, 0x271000ef, 0x272000ef, 0x273000ef, 0x274000ef, 0x275000ef, 0x276000ef, 0x277000ef, 
  0x278000ef, 0x279000ef, 0x27a000ef, 0x27b000ef, 0x27c000ef, 0x0 <repeats 132 times>}

Note that we do expect some difference in the higher order bits compared with IRVE because in QEMU the kernel is physically located around 0x80000000 in memory whereas for us it is around 0xC0000000.

JZJisawesome commented 5 months ago

Wow, IRVE is a lot different: $2 = {0x0 <repeats 630 times>, 0x300c2c01, 0xe7, 0x0 <repeats 392 times>}

Note that unlike QEMU there is no option to print physical addresses, so it's possible that this location is physical memory is different? I should look into this...

The fact though that it is also <repeats 630 times> is suspisious, implying that this is in fact IRVE's page table, but just not at all what we would expect...

Should we try moving memory around to load the kernel at 0x80000000 as well for better comparison?

JZJisawesome commented 5 months ago

Moving the kernel to 0x80000000 in physical memory actually fixed the issue! FINALLY after literally months I've found a fix!

No idea why the hell the kernel would care about where it is located in physical manner, given it will remap itself to the top of the virtual address space in normal operation.

PR #51 https://github.com/angry-goose-initiative/irve/pull/51 implements this change!