cnlohr / mini-rv32ima

A tiny C header-only risc-v emulator.
MIT License
1.68k stars 137 forks source link

vi with no filename faults with access fault #17

Open ipacman opened 1 year ago

ipacman commented 1 year ago

After $ make testdlimage I login to Linux (amazing!) and then attempt to run vi: ~ # vi 11.662352] Oops - load access fault [#1] [ 11.662595] CPU: 0 PID: 29 Comm: vi Not tainted 5.18.0 #7 [ 11.662919] Hardware name: riscv-minimal-nommu,qemu (DT) [ 11.663180] epc : 0x800d9088 [ 11.663341] ra : 0x80081f20 [ 11.663496] epc : 800d9088 ra : 80081f20 sp : 807a1e90 [ 11.663860] gp : 802cf420 tp : 807aa000 t0 : 807a1dfc [ 11.664181] t1 : 809af010 t2 : 80998c24 s0 : 807a1ea0 [ 11.664505] s1 : 809af000 a0 : 00000ff0 a1 : 00000000 [ 11.664808] a2 : 00000ff0 a3 : 00000000 a4 : 00000122 [ 11.665091] a5 : 00000ff0 a6 : 00000ff0 a7 : 00000000 [ 11.665374] s2 : 00000000 s3 : 807a1f38 s4 : 00000000 [ 11.665656] s5 : 00000000 s6 : 802d0000 s7 : 00001000 [ 11.665939] s8 : 00000000 s9 : 809aee78 s10: 0000000e [ 11.666284] s11: 00000000 t3 : fefefeff t4 : 80808080 [ 11.666588] t5 : 00000800 t6 : 00000025 [ 11.666825] status: 00001880 badaddr: 00000001 cause: 00000005 [ 11.667207] ---[ end trace 0000000000000000 ]--- Segmentation fault ~ # This does not happen if a filename is specified, e.g.

vi x.x

cnlohr commented 1 year ago

Notes for debugging this:

Weird the crash appears to be in csum_and_copy_from_iter but, called from __register_chrdev maybe? can you check to see if it's the same for you? The process to debug this is:

~/git/mini-rv32ima$ buildroot/output/host/bin/riscv32-buildroot-linux-uclibc-objdump -S buildroot/output/build/linux-5.19/vmlinux.o > kl.txt

convert the address of epc from 0x800d9088 to 0x000d9088 by subtracting 0x80000000. Do the same for ra.

Side-note: I can repro, but vi still generally works.

Hmm

ipacman commented 1 year ago

In my case, there is no buildroot directory since I cheated and used your DownloadedImage. Happy to $ make everything ..which I assume generates the tools and vmlinux.o file. But in that case, possibly a different address if not the exact same version as your DownloadedImage.

cnlohr commented 1 year ago

Correct. I guess I'm just saying I don't know if there's a quick avenue to solving this and it may be a bug in busybox or the kernel?

ipacman commented 1 year ago

The default buildroot configs do not appear to select vi. The configs/buildroot_config points at busybox-minimal.config. Wondering what the 'DownloadedImage' used for busybox config? There are several busybox config files in the source tree but guessing its the one in config/busybox_config. If so, the next step is to reconfigure buildroot with it to include vi, build it and see if it faults.. At least then all the addresses should match the source..

regymm commented 1 year ago

I'm also kinda interested in this problem... As vi has been crashing all the time in my attempts(no matter on early version kernel with manually patched flat binary loader, QEMU, or real hardware).

ipacman commented 1 year ago

Ok, in order to restore vi into the built image, I changed line 498 of configs/buildroot_config from: BR2_PACKAGE_BUSYBOX_CONFIG="package/busybox/busybox-minimal.config" to BR2_PACKAGE_BUSYBOX_CONFIG="busybox_config" Then did a "make clean" inside buildroot (and removed the copied config files from there for good measure) then did a "build all" from your root directory.

I then confirmed vi is now in the image and I get the same error (at a different address).

133.742286] Oops - load access fault [#1] [ 133.742477] CPU: 0 PID: 28 Comm: vi Not tainted 5.19.0 #3 [ 133.742711] Hardware name: riscv-minimal-nommu,qemu (DT) [ 133.742870] epc : 0x800e7cd8 [ 133.743006] ra : 0x8008d4ec [ 133.743127] epc : 800e7cd8 ra : 8008d4ec sp : 80cbbe90 [ 133.743341] gp : 8033c9d0 tp : 805b0b00 t0 : 000000e0 [ 133.743539] t1 : 80440010 t2 : 00000037 s0 : 80cbbea0 [ 133.743756] s1 : 80440000 a0 : 80440010 a1 : 00000000 [ 133.743962] a2 : 00000ff0 a3 : 00000000 a4 : 00000ff0 [ 133.744170] a5 : 00000ff0 a6 : 00000000 a7 : 00000000 [ 133.744378] s2 : 00000000 s3 : 80cbbf30 s4 : 00000000 [ 133.744586] s5 : 00000000 s6 : 8033d000 s7 : 00001000 [ 133.744808] s8 : 00000000 s9 : 80dabe78 s10: 0000000e [ 133.745030] s11: 80cb6008 t3 : fefefeff t4 : 80808080 [ 133.745252] t5 : 00000001 t6 : 00000800 [ 133.745429] status: 00001880 badaddr: 00000001 cause: 00000005 [ 133.745681] ---[ end trace 0000000000000000 ]--- Then I did the objdump command, which regretfully does not include source code despite -S option.

At epc we see:

000e7ccc <.L754>: e7ccc: 00400713 li a4,4 e7cd0: e6e790e3 bne a5,a4,e7b30 <.L750> e7cd4: 00492703 lw a4,4(s2) e7cd8: 01092b83 lw s7,16(s2) e7cdc: 01492783 lw a5,20(s2) e7ce0: 00001ab7 lui s5,0x1 which is indeed deep within "csum_and_copy_from_iter", but with suspicious value for s2 (0)

At ra, we see: 0008d4c0 <__register_chrdev>: 8d4c0: fd010113 addi sp,sp,-48 8d4c4: 02812423 sw s0,40(sp) 8d4c8: 02912223 sw s1,36(sp) 8d4cc: 01312e23 sw s3,28(sp) 8d4d0: 01412c23 sw s4,24(sp) 8d4d4: 01512a23 sw s5,20(sp) 8d4d8: 01612823 sw s6,16(sp) 8d4dc: 01712623 sw s7,12(sp) 8d4e0: 02112623 sw ra,44(sp) 8d4e4: 03212023 sw s2,32(sp) 8d4e8: 03010413 addi s0,sp,48 8d4ec: 00050b13 mv s6,a0 8d4f0: 00058a13 mv s4,a1 8d4f4: 00060a93 mv s5,a2 8d4f8: 00068b93 mv s7,a3 8d4fc: 00070993 mv s3,a4 8d500: 00000097 auipc ra,0x0 8d504: 000080e7 jalr ra # 8d500 <__register_chrdev+0x40> 8d508: fffff7b7 lui a5,0xfffff 8d50c: 00050493 mv s1,a0 8d510: 02a7fc63 bgeu a5,a0,8d548 <.L149> 8d514: 00050993 mv s3,a0

which makes no sense to me since there is no jal near there. Does sp look ok? Unless this is obvious, a trace or breakpoint capability would be handy.

The kernel dump is here

cnlohr commented 1 year ago

I'm glad that we are at least seeing consistent faults.

cnlohr commented 1 year ago

I am wondering if we should make the kernel configurations non-relocatable, so that all the addresses will line up perfectly, like so no address translation has to be done between the pc and the dump files.

ipacman commented 1 year ago

I verified that vi initializes without a filename with -H option. That means it loads ok and thinking something with the filename parameter.

So, did a quick look in buildroot/output/busybox-1.35.0/editors/vi.c to see whats different about initializing with a filename vs without one and thought there might be some tricky stuff associated with the option ENABLE_FEATURE_VI_ASK_TERMINAL owing to the comment

if ENABLE_FEATURE_VI_ASK_TERMINAL

if (G.get_rowcol_error /* **TODO? && no input on stdin** */) {
    uint64_t k;
    write1(ESC"[999;999H" ESC"[6n");
    fflush_all();
    k = read_key(STDIN_FILENO, readbuffer, /*timeout_ms:*/ 100);
    if ((int32_t)k == KEYCODE_CURSOR_POS) {
        uint32_t rc = (k >> 32);
        columns = (rc & 0x7fff);
        if (columns > MAX_SCR_COLS)
            columns = MAX_SCR_COLS;
        rows = ((rc >> 16) & 0x7fff);
        if (rows > MAX_SCR_ROWS)
            rows = MAX_SCR_ROWS;
    }
}

endif

So, I disabled the corresponding FEATURE in busybox_config (which turns off this ENABLE) and recompiled but with no improvement. Sad face.

A few lines later, a call is made to init_text_buffer(fn) where fn is presumably NULL. That function does a system call to "open" with first parameter null, but I think thats ok. It should just return with an error, which is tested.

So nothing obvious by inspection.

Otherwise, I'm trying to get my bearings on how to debug something like this. I'm guessing that busybox invocation is relocated into available memory (given no mmu) and that executable periodically calls into kernel functions, one of which is failing. So this fault is happening within a kernel call after busybox is loaded. Is there any way to predict where busybox is loading to so that the simulator can be edited to single step once it reaches that address? Maybe there is a known address when the shell finishes loading some executable (like another copy of busybox). From there, its fairly straightforward to get a busybox compile time assembly listing, but org'd at 0 of course.

The alternative is to start dbg-printing within busybox to better isolate which function is causing the problem. Basically, do a better job than visual inspection.

I personally dont think relocating the kernel helps so much since its not too hard to predict the address in either domain.

ipacman commented 1 year ago

I've located the location of the trap within the kernel function "strncpy_from_user" with the following call stack:

do_se_sys_openat() do_sys_open() do_sys_openat2() getname() getname_flags() strncpy_from_user():

000f4e4c : f4e4c: ff010113 addi sp,sp,-16 f4e50: 00812623 sw s0,12(sp) f4e54: 01010413 addi s0,sp,16 f4e58: 16c05663 blez a2,f4fc4 <.L20> f4e5c: fff00793 li a5,-1 f4e60: 00f59a63 bne a1,a5,f4e74 <.L3>

000f4e64 <.L19>: f4e64: ff200513 li a0,-14

000f4e68 <.L1>: f4e68: 00c12403 lw s0,12(sp) f4e6c: 01010113 addi sp,sp,16 f4e70: 00008067 ret

000f4e74 <.L3>: f4e74: fff5c793 not a5,a1 f4e78: 00050313 mv t1,a0 f4e7c: 00f67463 bgeu a2,a5,f4e84 <.L4> f4e80: 00060793 mv a5,a2

000f4e84 <.L4>: f4e84: 0065e833 or a6,a1,t1 f4e88: 00387813 andi a6,a6,3 f4e8c: 10081663 bnez a6,f4f98 <.L21> f4e90: feff0e37 lui t3,0xfeff0 f4e94: 80808eb7 lui t4,0x80808 f4e98: ffc7f713 andi a4,a5,-4 f4e9c: effe0e13 addi t3,t3,-257 # fefefeff <.L6^B1+0xfee7bdcb> f4ea0: 080e8e93 addi t4,t4,128 # 80808080 <.L6^B1+0x80693f4c>

000f4ea4 <.L6>: f4ea4: 01071c63 bne a4,a6,f4ebc <.L15> f4ea8: 40e787b3 sub a5,a5,a4

000f4eac <.L5>: f4eac: 00f70533 add a0,a4,a5

000f4eb0 <.L16>: f4eb0: 0ee51863 bne a0,a4,f4fa0 <.L18> f4eb4: fac57ae3 bgeu a0,a2,f4e68 <.L1> f4eb8: fadff06f j f4e64 <.L19>

000f4ebc <.L15>: f4ebc: 010588b3 add a7,a1,a6 f4ec0: 0018c683 lbu a3,1(a7) <=================================== HERE =============== f4ec4: 0008c503 lbu a0,0(a7)

The technique for tracing this was: 1 -- note the epc in the trap ~ # vi[ 4371.212189] Oops - load access fault [#2] [ 4371.212560] CPU: 0 PID: 35 Comm: vi Tainted: G D 5.19.0 #5 [ 4371.213121] Hardware name: riscv-minimal-nommu,qemu (DT) [ 4371.213524] epc : 0x800e7cd8 2 -- find the function containing this address from System.map (Linux kernel map file) 3 -- search for this function in the objdump file, whose addresses are not absolute 4 -- modify the MiniRV32IMAStep() function to stop at the desired address, e.g. if ( ofs_pc == 0xe7cd8 ) { // fault addr, lets see who calls it in ra DumpState(state, image); } ir = MINIRV32_LOAD4( ofs_pc ); 5 -- examine ra, and repeat the process to go up the call stack, using ms code debugger to double check the best stop address based on source code (want to look near the top of calling functions prior to overwriting ra). As an example, we see at the faulting address: (uint32_t) (image+0xe7cd8),h 0x18c683 This matches the objdump file so we are probably in the right place.. f4ec0: 0018c683 lbu a3,1(a7) <=================================== HERE =============== Going back to an original post, this most likely is caused by a call to open() within busybox with null file name,

busybox function: file_insert() which calls c lib function open with fn=NULL fd = open(fn, O_RDONLY);

It should be legal to do that but maybe some RISCV adaption of the open system call has a problem with this?

But its hard to confirm since busybox is in userspace and we dont have gdb there.

ipacman commented 1 year ago

Added some traceback code to mini-rv32 and traced back from the fault into userspace.

The fault is triggered starting inside busybox vi.c file_insert() function at the call to fopen: fd = open(fn, O_RDONLY); Since there is no filename when opening vi in this way, fn is set to NULL, but this should not result in a fault (or it doesnt for regular linux).

The problem is that the current nommu configuration of linux (with CONFIG_UACCESS_MEMCPY) does not check for invalid pointers.

In this case, __get_user_fn() in linux-5.19/include/asm-generic/uaccess.h happily fetches from 0 which is causing the fault.

This can easily be fixed by adding a line to the top of the function (after BUILD_BUG_ON line):

if ((from == NULL) || (to == NULL) { return 1; }

which is not a good solution for several reasons: (1) its not specific to this config, (2) it doesnt check any other invalid pointer range, (3) since its inline this adds for every instance of it, and I count 48 in the kernel and (4) there is a put version of this function that is unresolved. So its just a test hack.

I reason that the 2nd point is maybe ok because one would need to purposefully override the char * type to send in a filename of value integer "1", for example, whereas setting a string to NULL is common practice.

Anyway, noticed other problems while working thru this. For example, when I used vi to actually write a file x.x with a couple of test characters, it saved it with a file size of 34359738368 and hangs busybox if you attempt to 'cat' the file (for example). I'm double checking this happens with the unmodified version..

cnlohr commented 1 year ago

Do you know if there is a way of addressing this in the emulator? I.e. a different way to throw an access violation?