Meziu / rust-horizon

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
3 stars 1 forks source link

Crash on panic! #2

Closed Meziu closed 2 years ago

Meziu commented 2 years ago

From testing the network example, i noticed that panic!, unwraps and anything that calls a panic in the Rust runtime crashes the program as a translation - error data abort in Luma3DS. This is unintended behaviour, and my previous work on the std should have fixed it. It came back only now, possibly because of changes in the std code. @AzureMarker, if you want to help, this is pretty much the priority right now.

Meziu commented 2 years ago

@AzureMarker just for info, are you able to run GDB with the Luma3DS stub? I'm trying in every way to run the hello-world-3ds example with gdb but I can't get it to work anymore... If you do, please tell me your exact procedure :smile:

AzureMarker commented 2 years ago

I did notice the panic data abort thing in debug builds, but release builds don't seem to have the problem. The release build behavior is that the error message doesn't stay up and the program immediately exits. Is the message supposed to stay visible, perhaps until a button is pressed?

GDB:

  1. Run homebrew launcher
  2. Open Luma3DS menu with L + Down + Select
  3. Debugger options -> Enable debugger + Force-debug next application at launch
  4. Close Luma3DS menu
  5. Run application either through SD card or 3dslink
  6. Open gdb with gdb -tui <elf file path>. Example: gdb -tui target/armv6k-nintendo-3ds/debug/n3ds-controller.elf
  7. Run command target remote <3DS ip address>:4003 to connect to the 3DS gdb.
  8. Continue debugging like normal.

The GDB stub is pretty slow and sometimes/eventually crashes for me. It crashes sooner when I try to use CLion's remote debug feature, so I've just been using command line for more stability.

AzureMarker commented 2 years ago

Right when it calls panic_impl it gets a segfault: image image

Meziu commented 2 years ago

I did notice the panic data abort thing in debug builds, but release builds don't seem to have the problem. The release build behavior is that the error message doesn't stay up and the program immediately exits. Is the message supposed to stay visible, perhaps until a button is pressed?

Sadly we can't put ctru functions in the std, so buttons aren't a possibility. Maybe just a std::thread::sleep(Duration::from_secs(5)) to read the message or something.

Meziu commented 2 years ago
6. Open gdb with `gdb -tui <elf file path>`. Example: `gdb -tui target/armv6k-nintendo-3ds/debug/n3ds-controller.elf`

Wait, is that a standard gdb or just an alias? arm-none-eabi-gdb doesn't support tui afaik.

Edit: wow, gdb actually supports the arch. Didn't know.

AzureMarker commented 2 years ago

Yeah, just normal gdb works since it's using a remote server for the heavy lifting.

Also, set substitute-path is handy for debugging libctru, since you probably didn't compile it locally: https://sourceware.org/gdb/onlinedocs/gdb/Source-Path.html#set-substitute_002dpath

Meziu commented 2 years ago
#0  0x0012012c in newRow ()
    at /home/fincs/pacman-packages/libctru/src/libctru-2.0.1/libctru/source/console.c:627
#1  0x0011f488 in consolePrintChar (c=10)
    at /home/fincs/pacman-packages/libctru/src/libctru-2.0.1/libctru/source/console.c:743
#2  consolePrintChar (c=10)
    at /home/fincs/pacman-packages/libctru/src/libctru-2.0.1/libctru/source/console.c:700
#3  0x0011f58c in con_write (r=<optimized out>, fd=<optimized out>, len=<optimized out>, ptr=<optimized out>)
    at /home/fincs/pacman-packages/libctru/src/libctru-2.0.1/libctru/source/console.c:476
#4  con_write (r=<optimized out>, fd=<optimized out>, ptr=<optimized out>, len=<optimized out>)
    at /home/fincs/pacman-packages/libctru/src/libctru-2.0.1/libctru/source/console.c:206
#5  0x0013a1e0 in <std::io::Write::write_fmt::Adapter<T> as core::fmt::Write>::write_str ()
#6  0x00135a54 in std::backtrace_rs::print::BacktraceFrameFmt::print_raw_with_column ()
#7  0x0013c518 in std::sys_common::backtrace::_print_fmt::_$u7b$$u7b$closure$u7d$$u7d$::h11c5d2ac13daa081 ()
#8  0x00137e68 in std::backtrace_rs::backtrace::libunwind::trace::trace_fn ()
#9  0x001271f0 in __gnu_Unwind_Backtrace ()
#10 0x001006d4 in ___Unwind_Backtrace ()
#11 0x00131880 in <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt ()
#12 0x0014834c in core::fmt::write ()
#13 0x00158520 in std::io::Write::write_fmt ()
#14 0x00162a54 in std::panicking::default_hook::{{closure}} ()
#15 0x0016267c in std::panicking::default_hook ()
#16 0x00163020 in std::panicking::rust_panic_with_hook ()

By looking at this backtrace, the obvious guess is that the Console gets dropped after the panic, yet it is used correctly before that line...

AzureMarker commented 2 years ago

If I force the Console to not get dropped via std::mem::forget, it panics with an unknown instruction error: image

Meziu commented 2 years ago
619                     u16 *dst = &currentConsole->frameBuffer[(currentConsole->windowX * 8 * 240) + (239 - (currentConsole->windowY * 8))];
(gdb) p currentConsole->cursorY
$7 = 29
(gdb) p *currentConsole
$8 = {font = {gfx = 0x0, asciiOffset = 0, numChars = 0}, frameBuffer = 0x30000000, cursorX = 28, 
  cursorY = 29, prevCursorX = 17, prevCursorY = 1, consoleWidth = 0, consoleHeight = 1908700, windowX = 41, 
  windowY = 1907224, windowWidth = 1907224, windowHeight = 30, tabSize = 3, fg = 7, bg = 0, flags = 0, 
  PrintChar = 0x0, consoleInitialised = 40}

Alright, looks like the windowY variable goes flying through the roof and the dst calculation goes bad. Those pointers aren't valid and thus the sigsegv.

AzureMarker commented 2 years ago

The current implementation of intrinsics::abort is to invoke an invalid instruction, on most platforms.

The current console pointer points to the box in Console, which might have gotten deallocated by that point.

AzureMarker commented 2 years ago

The backtrace from the abort: image image

Meziu commented 2 years ago

It's weird though, this all seems like it should happen in a release version too. If the problem is really the order of drop and print, then it should print either no message, or all of it, and it should be the same in either debug or release. This I cannot understand

AzureMarker commented 2 years ago

Is there a difference between if it unwinds or aborts in debug vs release? I'm trying to step through the debug build's panic to see if it's triggering a second panic. Otherwise, maybe we need to customize the abort code for the 3DS to call some libctru function instead of executing an undefined instruction?

Meziu commented 2 years ago

I don't think there is a difference. It may even have to do with the threading speed and stuff. Directly using ctru functions is a no go. Maybe the console object doesn't get corrupted in release and thus survives enough to not sigsegv?

Thread 1 hit Hardware watchpoint 2: currentConsole->windowY

Old value = 0
New value = 1907192
0x00128ec0 in _malloc_r ()
(gdb) bt
#0  0x00128ec0 in _malloc_r ()
#1  0x0017d8ec in alloc::alloc::alloc (layout=...)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/alloc/src/alloc.rs:87
#2  0x0017da18 in alloc::alloc::Global::alloc_impl (self=0x1c11bc, layout=..., zeroed=false)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/alloc/src/alloc.rs:169
#3  0x0017db28 in alloc::alloc::{impl#1}::allocate (self=0x1c11bc, layout=...)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/alloc/src/alloc.rs:229
#4  0x0017dbec in alloc::alloc::exchange_malloc (size=12, align=4)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/alloc/src/alloc.rs:318
#5  0x0017e710 in std::thread::local::os::Key<core::cell::Cell<usize>>::try_initialize<core::cell::Cell<usize>> (self=0x1d1fdc <std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::__KEY>, 
    init=0x1780ec <std::panicking::panic_count::LOCAL_PANIC_COUNT::__init>)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/std/src/thread/local.rs:739
#6  0x0017e5b0 in std::thread::local::os::Key<core::cell::Cell<usize>>::get<core::cell::Cell<usize>> (
    self=0x1d1fdc <std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit::__KEY>, 
    init=0x1780ec <std::panicking::panic_count::LOCAL_PANIC_COUNT::__init>)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/std/src/thread/local.rs:721
#7  0x00178124 in std::panicking::panic_count::LOCAL_PANIC_COUNT::__getit ()
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/std/src/thread/local.rs:324
#8  0x0017895c in std::thread::local::LocalKey<core::cell::Cell<usize>>::try_with<core::cell::Cell<usize>, std::panicking::panic_count::decrease::{closure#0}, usize> (self=0x1c0414, f=...)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/std/src/thread/local.rs:411
#9  0x00178498 in std::thread::local::LocalKey<core::cell::Cell<usize>>::with<core::cell::Cell<usize>, std::panicking::panic_count::decrease::{closure#0}, usize> (self=0x1c0414, f=...)
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/std/src/thread/local.rs:388
#10 0x00177cc8 in std::panicking::panic_count::decrease ()
    at /home/andrea/Desktop/ProgrammingStuff/Rust/rust-3ds/rust-horizon/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/src/rust/library/std/src/panicking.rs:276
#11 0x00189de0 in std::panicking::try::cleanup (payload=0x800ccc8)

The trigger for the corruption is the generation of a thread key requested by the panic. Maybe this is the issue...

Meziu commented 2 years ago
    // We consider unwinding to be rare, so mark this function as cold. However,
    // do not mark it no-inline -- that decision is best to leave to the
    // optimizer (in most cases this function is not inlined even as a normal,
    // non-cold function, though, as of the writing of this comment).
    #[cold]
    unsafe fn cleanup(payload: *mut u8) -> Box<dyn Any + Send + 'static> {
        // SAFETY: The whole unsafe block hinges on a correct implementation of
        // the panic handler `__rust_panic_cleanup`. As such we can only
        // assume it returns the correct thing for `Box::from_raw` to work
        // without undefined behavior.
        let obj = unsafe { Box::from_raw(__rust_panic_cleanup(payload)) };
        panic_count::decrease();
        obj
    }

Alright, i read "decision left to the optimizer"...

Meziu commented 2 years ago

Yes, not calling the panic_count::decrease() fixed the bug, though it isn't the best way to fix this...

AzureMarker commented 2 years ago

Have you tried debugging after mem::forgetting the console so the allocation doesn't get dropped? (if this is the issue, we could mem forget the box in the Console impl)

I did that and got this far (see screenshot). I think the panic handler thinks there was more than one panic and therefore aborts (see library/std/src/panicking.rs around line 624). However, I put a breakpoint on rust_panic and only saw it fire once... image

Meziu commented 2 years ago

I don't understand what you are trying to do. Why should we mem::forget the Console? The issue is the deallocation of the console itself (which makes sense, since it goes out of scope). We should make the Console static, so that it may print the panic information even after exiting the main function.

AzureMarker commented 2 years ago

If we mem::forget the Console then the Box<PrintConsole> doesn't get deallocated/dropped. libctru holds a reference to this struct, so deallocating it will cause the corruption during printing that you are seeing. Making it static is another option.

After solving that problem I'm now working on fixing the panic/abort issue.

Meziu commented 2 years ago

But the panic issue is caused by the Console being dropped. I fixed it by not reallocating its memory, and it doesn't panic. I don't know where your "other" panic may be coming from.

AzureMarker commented 2 years ago

There are two "panic" issues. One is caused by the console getting deallocated (segfault), the other is an abort (undefined instruction).

What happens after you stop the console from being deallocated? On my 3DS with a debug build it triggers another ARM 11 exception with type undefined instruction. The previous screenshot shows the undefined instruction in the assembly code. I think this relates to this line in panicking.rs: https://github.com/AzureMarker/rust-horizon/blob/26b3bcb2787681c126a4b21bcdefb4338001f603/library/std/src/panicking.rs#L621

When I debug via Citra I see the panic message printed to the screen before it reaches this undefined instruction.

Meziu commented 2 years ago

I don't "stop the console from being deallocated". I think we are trying 2 different ways. Instead of not deallocating the Console, I stopped the data pointed to by the currentConsole pointer from being changed, basically an invalid pointer to valid memory. This is obviously temporary, and not deallocating/making the Console static would be preferable, but to me, this doesn't result in a panic, and the program exits normally.

AzureMarker commented 2 years ago

What was your fix, not calling panic_count::decrease? I do see that function in the stack trace, so maybe it's triggering a second panic which causes the abort.

Meziu commented 2 years ago

Nope, it doesn't trigger a panic. By "watch"ing the pointer to the currentConsole->windowY, I noticed that call re-allocated the memory previously held by the Console, thus changing the value inside, and screwing up the pointer math in the libctru code (the second screenshot you sent me). By not calling that function, the memory stays intact, and the pointer math is correct, thus making the Console work as intended.

Meziu commented 2 years ago

My guess is that (as written in the comment above that function i sent you earlier) the compiler optimizes away the call, and thus the release version works anyways.

AzureMarker commented 2 years ago

I found the cause of my second panic: attempt to subtract with overflow during panic_count::decrease.

Why would panic_count::decrease touch the Console memory? That seems like the root issue.

Meziu commented 2 years ago

It doesn't touch it, not because it wants to at least. Since the Console is one of the last deallocated objects, my guess is that it is chosen by the malloc as a good spot to allocate (keeping track of previously deallocated sections is an obvious optimisation for memory allocators on the heap). This way, the malloc allocates the memory previously held by the Console and changes it. Still, you seem to have solved it anyways. Since to you it seems the decrease is actually working as intended. Now we should just understand why is it kept at zero even though a panic happens already...

AzureMarker commented 2 years ago

Yeah, using mem::forget should prevent malloc from reusing the Console memory since it isn't ever freed. Using a static variable for the PrintConsole data would result in a similar fix. I'm currently debugging if the release build correctly handles the panic count, and why the debug build has the incorrect panic count.

AzureMarker commented 2 years ago

Did some debugging. For some reason line debuginfo is pretty sparse in std, even in debug mode, which made it hard to debug.

I think debug and release mode have the same general behavior with the panic count, but debug mode panics on underflow/overflow (this is a Rust feature). That causes the crash in debug mode and allows release mode to exit normally.

I'm pretty sure panic_count::increase is called, but it's inlined and hard to verify that it went through without debug info. More debugging is necessary here.

AzureMarker commented 2 years ago

I added a hardware write watchpoint on GLOBAL_PANIC_COUNT and it only gets written two during decrease.

Meziu commented 2 years ago

Yeah, I know. rust_panic_with_hook, while clearly called (by putting a breakpoint it's clear it's called) doesn't have any debug information, and the call to increase() inside seems nonexistent...

Meziu commented 2 years ago

Ok, as a last hope I tried putting an increase into the decrease body. It wasn't inlined and gdb found it among its symbols (and it worked too, since the vars were set to 1 right before the subtration). This means the actual function to have optimized-away the increase call is rust_panic_with_hook. I'll look into a Linux debug example to see if this problem is of our config or what.

Meziu commented 2 years ago

(Using the same source code) Unix build works fine, and the debuginfo on those std functions is complete. There is something wrong with our configuration of the compiler, possibly the optimization level or some parameter.

AzureMarker commented 2 years ago

Ran gdb <elf file> and disassemble/s std::panicking::rust_panic_with_hook. This shows that, at least in the ELF, the increase function is called. image

Meziu commented 2 years ago

Yes, I checked the disassembly too. Still, even if inlined, this doesn't explain why it doesn't work...

AzureMarker commented 2 years ago

Since it's in the disassembly of the ELF, there must be something going wrong in the ELF -> 3DSX conversion.

Meziu commented 2 years ago

Hmmm, is there no way to check in the 3dsx? Debug info is still in there.

Though honestly I don't believe that's the case, if the function didn't run there would be big problems with the validity of the pointers in there. That's why I believe it's the compiler's fault...

AzureMarker commented 2 years ago

That's the weird part. The wiki says there aren't any debug sections (https://www.3dbrew.org/wiki/3DSX_Format) and OpenCTR says there are "No debugging support features" (https://openctr.github.io/3DSX.html) but I get some line debug info when using gdb.

I want to find a second way to verify that the ELF file emitted by the compiler is correct (calls the increase function). Also a second way to verify that it doesn't get translated over correctly to the 3dsx...

I tried running the ELF file directly in Citra since it can load them (and has a GDB stub), but it seems like it ends up with the same issue (including same assembly code?). Maybe I'll try again with a release build + line debug info.

Meziu commented 2 years ago

Yeah i guessed so, the ELF is correct after all. Also, Rust does hold some debug info even in release builds, so checking those is possible, but I believe the result will be the same, with the only difference being not crashing for the subtraction overflow error. This is so insanely weird. As a member of the Rust Zulip chat told me, the call to panic_count::increase shouldn't be inlined anyways, which is correct, thinking at how the function isn't inlined if used from other sections of the code (try adding the call right before the decrease...). It feels like rust_panic_with_hook is some sort of black hole that shows no debug info and has weird properties. Another suggestion I got was to use the rustc_codegen_gcc. While it's not out of the question it may work, it's unstable, requires recompiling gcc (and thus makes it hell to use arm-none-eabi-gcc directly) and it's way out of the objective of this project to make compiling code for the Nintendo 3DS easy and accessible. Besides, something is weird with the compiler, but I couldn't find even one compilation flag, Rust feature, or function configuration to not inline panic_count::increase. Even #[inline(never)] is ignored...

AzureMarker commented 2 years ago

Since the ELF doesn't have the call inlined, I don't think rustc is doing anything there. Maybe there's an issue in the metadata which is confusing 3dsxtool. I also just found the Zulip thread. Looks like we're getting some good help there at least.

Meziu commented 2 years ago

Since the ELF doesn't have the call inlined, I don't think rustc is doing anything there. Maybe there's an issue in the metadata which is confusing 3dsxtool. I also just found the Zulip thread. Looks like we're getting some good help there at least.

Welp, they did clear some paths I had in mind, but to be honest I can't ask if I don't even know where to search. Still, to me gdb doesn't show the symbol, and until the function is inline, it doesn't work. That's as far as I know...

AzureMarker commented 2 years ago

What binary/environment were you in when you checked info functions? Mine shows increase and decrease in a release build with debug = 1 (line info) when looking at the ELF.

I also tried a bunch of stuff to ensure the function isn't inlined, like inline(never) and cold. Though it seems to not be inlined in the ELF.

Another thought I had was if there are two compiled functions with the same name. Because the assembly is so different (stack pointer is adjusted differently, call is inlined in one and not in the other, etc), maybe we're comparing apples to oranges when we look at it in the ELF and in 3dsx.

Meziu commented 2 years ago

What binary/environment were you in when you checked info functions? Mine shows increase and decrease in a release build with debug = 1 (line info) when looking at the ELF.

I've been looking at the elf for a while, wdym debug = 1? Also, in my release build, not even the symbol for decrease is shown. Are we looking at the same thing?

AzureMarker commented 2 years ago

In Cargo.toml:

[profile.release]
debug = 1

1 is just line tables: https://doc.rust-lang.org/cargo/reference/profiles.html#debug

AzureMarker commented 2 years ago

I think the main difference between our setups should be that I'm using my ctru-rs library instead of yours, but they're functionally identical: https://github.com/AzureMarker/ctru-rs

I'm also using stage1 to compile: cargo +stage1 3ds build --release (+/- the release flag)

Meziu commented 2 years ago

May I ask, why stage 1?

AzureMarker commented 2 years ago

I'm on my laptop, which has 28 less cores than my desktop (away from home), so it was annoying to wait so long for stage2 when testing out rustc changes.

AzureMarker commented 2 years ago

So I was experimenting and got pthread-3ds working as a cargo dependency, though I'm trying to make it work nicer. This fixed the issue we saw with the debug info and missing call to increase, but it also revealed that a single panic will result in successive internal panics because the panic mechanism uses a RwLock, which is unimplemented in pthread-3ds.

Meziu commented 2 years ago

I want to cry

Edit: let's work on it I guess. Rust3DS team and vivlim got the std to work, we can too

Meziu commented 2 years ago

PR those changes for the time being, those fixes are for a different issue