bloombloombloom / Bloom

A debug interface for AVR-based embedded systems development on GNU/Linux.
https://bloom.oscillate.io/
Other
65 stars 4 forks source link

Issues when stepping into a static library #110

Open sven-hoek opened 1 month ago

sven-hoek commented 1 month ago

Not sure if stepping into the static library is the issue but it seems to happen whenever I try to step into or break in a statically-linked library function.

When debugging within VSCode, the Debugging stops but when I run avr-gdb in the terminal and connect to Bloom's gdb-server, I can continue, though never able to step into a library function. Bloom itself doesn't always crash and so far I couldn't really pin down when it does. It happens when I am a few lines above a library function call and then try to step over (but not even the library function call itself)...I'll also report the bug on the avr-gdb side.

If there's anything else I could try or any info I could provide, let me know.

navnavnav commented 1 month ago

Hey @sven-hoek

Thanks for reporting this.

When range stepping is enabled, Bloom attempts to analyze all instructions within the given range, and intercept those that may take the target outside of the range. The error message suggests that Bloom was unable to decode a particular opcode, and so it was forced to intercept that instruction, as we don't know what it will do. However, that error is not considered to be fatal and should not result in Bloom crashing or shutting down abruptly.

A few things I need from you, please:

As for the fatal error in GDB, I'm not sure if that's even related to the range step, as Bloom simply intercepts any instructions that it could not decode, so it shouldn't affect GDB at all. Have you tried disabling range stepping? Does GDB still crash? You can disable range stepping by setting rangeStepping to false in your server config, in bloom.yaml:

server:
  rangeStepping: false

If the issue in GDB is related to range stepping, you can just leave range stepping disabled, for the time being. It will result in degraded stepping performance but at least it won't crash.

navnavnav commented 1 month ago

Sorry, the GDB commands I provided in the previous comment, for dumping program memory, were incorrect as the address 0x000001BB is an invalid program memory address (it needs to be word-aligned, as opcodes take the form of 16-bit words). You'll want to use 0x000001BA instead. So x/10b 0x000001BA to dump program memory, and x/10bfi 0x000001BA to dump decoded instructions.

I have also revised the previous comment.

sven-hoek commented 1 month ago

Hey @navnavnav , thanks for the quick reply and the clear instructions. Also thanks for creating this great tool including good documentation.

What version of GDB are you currently using?

> avr-gdb --version
GNU gdb (GDB) 10.1.90.20210103-git

Could you enable debug logging, reproduce that error, and then send me the full debug log?

I didn't get to reproduce Bloom crashing to capture that log but here it is with just GDB crashing. I'll upload another log if I get to the point that Bloom also crashes again. https://gist.github.com/sven-hoek/4dee86bf9faccce8e4e4981c93a9c6c4

Can you provide a dump of program memory, around the address at which Bloom failed to decode the opcode (0x000001BE)?

x/10b 0x000001BA
0x1ba <CH9120Initialisation>:   -49 -109    14  -108    125 0   -120    -31
0x1c2 <CH9120Initialisation+8>: -118    -107
{"token":18,"outOfBandRecord":[],"resultRecords":{"resultClass":"done","results":[]}}

In addition to a program memory dump, it will help to know if GDB has similar issues decoding that opcode. Could you try running x/10bfi 0x000001BA in GDB? It will attempt to decode the opcodes around that address and output them.

x/10bfi 0x000001BA
   0x1ba <CH9120Initialisation>:    push    r28
   0x1bc <CH9120Initialisation+2>:  call    0xfa    ;  0xfa <UART1ActiveState>
   0x1c0 <CH9120Initialisation+6>:  ldi r24, 0x18   ; 24
   0x1c2 <CH9120Initialisation+8>:  dec r24
   0x1c4 <CH9120Initialisation+10>: brne    .-4         ;  0x1c2 <CH9120Initialisation+8>
   0x1c6 <CH9120Initialisation+12>: rjmp    .+0         ;  0x1c8 <CH9120Initialisation+14>
=> 0x1c8 <CH9120Initialisation+14>: break
   0x1ca <CH9120Initialisation+16>: .word   0x0079  ; ????
   0x1cc <CH9120Initialisation+18>: ldi r24, 0x18   ; 24
   0x1ce <CH9120Initialisation+20>: dec r24
sven-hoek commented 1 month ago

I haven't got to try it yet but turning off range-stepping is a good point. Though I stepped through the same code with AVARICE instead of Bloom and GDB also crashed, so there's that. It seems to be rather the GDB side or something about the code I'm debugging (the latter of which hopefully shouldn't be an issue though).

I will create a very simple project with a simple library for better experimenting. It may take a little until I get to do that but I'll update you with my findings.

navnavnav commented 1 month ago

Thanks for this @sven-hoek

So I can see that you have a CALL instruction at byte address 0x1bc, which is made up of two words (spanning byte address 0x1bc -> 0x1bf). But Bloom was attempting to decode the instruction at 0x1be - which is an invalid address as it points to the second word of that CALL instruction.

I was worried that Bloom may be incorrectly decoding the first word (0x1bc -> 0x1bd) as some other, single-word instruction, which would explain why it was attempting to decode the second word separately. But I've just attempted to replicate this, and it seems to be working fine for me. I used the exact same opcode as the one in your program: 0x0E947D00, which translates to CALL 0xFA, and then performed a range step to step over that instruction. Bloom correctly decoded the instruction and intercepted the destination address (0xFA), as it was outside of the requested range:

2024-07-22 01:58:50.259 BST [DS]: [DEBUG] Read GDB packet: $vCont;rfc,100:-1;c#73
2024-07-22 01:58:50.260 BST [DS]: [INFO] Handling VContRangeStep packet
2024-07-22 01:58:50.260 BST [DS]: [DEBUG] Requested stepping range start address: 0x000000fc
2024-07-22 01:58:50.260 BST [DS]: [DEBUG] Requested stepping range end address (exclusive): 0x00000100
2024-07-22 01:58:50.260 BST [DS]: [DEBUG] Issuing ReadTargetMemory command (ID: 430) to TargetController
2024-07-22 01:58:50.260 BST [DS]: [DEBUG] Delivering response for ReadTargetMemory command (ID: 430)
2024-07-22 01:58:50.260 BST [DS]: [DEBUG] Inspecting 1 instructions within stepping range (byte addresses) 0x000000fc -> 0x00000100, in preparation for new range stepping session
2024-07-22 01:58:50.260 BST [DS]: [DEBUG] Intercepting destination byte address 0x000000fa of CCPF instruction ("CALL") at byte address 0x000000fc

So, this leads me to believe that, when you attempted to step over that CALL instruction, GDB sent an invalid address range to Bloom, with a start address of 0x1be, which does not point to the beginning of any valid instruction. What's even worse: Once Bloom failed to decode the instruction at that address, it would have attempted to intercept it by placing a breakpoint there. That newly inserted breakpoint may have corrupted the CALL instruction (as it was placed in the middle of it), resulting in a corrupted program.

So I think the issue here is with GDB. But before you report this to the GDB devs, can you try reproducing the error with a newer version? Version 10 is a little old. I'm on 12.2, which works great for AVR, IMO. But I understand this may be a headache, as you may have to build it from source (unless you're willing to upgrade to a newer Ubuntu version - the newer repositories seem to host newer versions of the gdb-avr package).

navnavnav commented 1 month ago

It seems to be rather the GDB side or something about the code I'm debugging (the latter of which hopefully shouldn't be an issue though).

Yeah I agree. That fatal error in GDB doesn't seem to be caused by the opcode decoding error in Bloom. But whatever is causing the fatal error in GDB may also be the cause for the invalid address range that GDB is sending to Bloom. Worth keeping in mind 👍🏽

sven-hoek commented 1 month ago

Thanks for that detailed explanation and great support.

can you try reproducing the error with a newer version? Version 10 is a little old

True, I haven't thought of that. I tried gdb 13.2 and stepping through the problematic code still seems to produce errors in GDB.

I created a very simple app that also uses a static library to see if it's any library that will cause issues. Kept everything very simple and I can step into the library code without any issues with the same toolchain as before (gdb 10). I assume the other project has some weird configuration that messes up the addressing. I will try to cleanly recreate the project and will see if the error persists.