Closed makermelissa closed 5 years ago
What were you doing (or what was happening) at the time? If I can reproduce this I can catch the hard fault and debug.
I was working on writing the displayio driver for the Mini 160x80 driver. I think it happened immediately after I saved changes to a CP script, but most of the time it was fine.
I’ll let you know if I find something more specific that causes it.
I haven't had an issue since I reported this. It's possible that it was corrupted from Beta 4. I'll be doing some more stuff tonight and will see if it happens again then.
@makermelissa do you have a debugger available to you? It might be worth having it breaking on the HardFault_Handler while you do other things just in case it is a random occurrence.
Yeah. Is there a guide to using it for debugging?
I’ve only used it for bootloader flashing.
@makermelissa https://learn.adafruit.com/debugging-the-samd21-with-gdb/overview should get you debugging.
Thanks @sommersoft. I'll take a look at that guide.
Very brief instructions:
make -j4 BOARD=feather_nrf52840_express SD=s140 DEBUG=1
JLinkGDBServer -device nRF52840_xxAA -if SWD
arm-none-eabi-gdb build-feather_nrf52840_express-s140/firmware.elf
target extended-remote :2331
mon reset
load
mon reset
break HardFault_Handler
continue
backtrace
Thanks @dhalbert
Ok, I got it to do it twice and is still red at the moment. The last time it happened when I was saving a library file inside of a folder.
Since this is happening on a Feather M4 Express and I'm working on programming feathers, I can't plug in a JLink without soldering. So I'm going to see if I can get this to happen on a Metro M4 Express using Siddacious's MetroWing.
I got it to happen on the MetroM4 as well. I now know what I did. I made a change to the lib. Saved. Made another change, and then saved again within a very short amount of time. What probably happened is it tried to save the file while it was soft rebooting. Unfortunately I'm just now realizing I didn't compile a version with debug symbols before this happened.
Just FYI, backtrace even without debugging symbols will have some information that gdb can recover from the .elf file. Belowi is backtrace from a non-debug elf. It has routine names but not arg names or values. Obviously the debug is better, but this can still help. But if you can repeat with a DEBUG build, that would be great.
I forget what OS and editor you're using? Which are they? Are you editing locally and copying, or editing directly on CIRCUITPY?
(gdb) bt
#0 0x00006c60 in displayio_bitmap_make_new.lto_priv ()
#1 0x0000fb50 in type_call.lto_priv ()
#2 0x00027726 in mp_call_function_n_kw ()
#3 0x0000526e in mp_execute_bytecode ()
#4 0x00021180 in fun_bc_call.lto_priv ()
#5 0x00027726 in mp_call_function_n_kw ()
#6 0x00027946 in mp_call_function_0 ()
#7 0x0001ecb0 in parse_compile_execute.lto_priv ()
#8 0x00026f36 in maybe_run_list ()
#9 0x00004350 in main ()
@makermelissa One thing I find extremely useful while debugging is turning off size optimization by removing the -0s
here https://github.com/adafruit/circuitpython/blob/master/ports/atmel-samd/Makefile#L95
This will prevent annoying things like variables being optimized out when you'd really like to know their values.
Obviously this makes the build larger so will probably only work with M4s and I've only ever tried it with a Grand Central. As the build grows this might stop working as well.
Thanks @siddacious. I missed your comment before, but I'm playing around with this now. I soldered a 5-pin header onto the proto area of my Feather M4 with wires so I can connect it to a debugger easily.
Ok, I decided to set up the debugger as described above (except for the Metro M4) and really go to town. Here's the Backtrace @dhalbert:
(gdb) backtrace
#0 HardFault_Handler () at supervisor/port.c:283
#1 <signal handler called>
#2 mp_decode_uint_skip (ptr=0x89000005 <error: Cannot access memory at address 0x89000005>) at ../../py/bc.c:70
#3 mp_execute_bytecode (code_state=0x20002c80, inject_exc=0x67001800) at ../../py/vm.c:1385
#4 0x20029734 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
It probably doesn't matter, but the compiled code was what I had in this PR: https://github.com/adafruit/circuitpython/pull/1708
It didn't corrupt the file this time (maybe because I typed continue
), but I hope this helps you trace it.
@dhalbert, if you want, I can also try on my feather M4 express. I modified it using the protoboard area so I can hook a debugger up easily now. :)
Ok, this time I got the Red LED and it corrupted the filesystem. Here's back trace and what happened when I tried to continue:
(gdb) backtrace
#0 HardFault_Handler () at supervisor/port.c:283
#1 <signal handler called>
#2 memcpy (dst=<optimized out>, src=<optimized out>, n=<optimized out>, n=<optimized out>, src=<optimized out>, dst=<optimized out>)
at ../../lib/libc/string0.c:61
#3 0x2002ff28 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) continue
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
write_flash (address=<optimized out>, data=<optimized out>, data_length=<optimized out>, data_length=<optimized out>, data=<optimized out>,
address=<optimized out>) at ../../supervisor/shared/external_flash/external_flash.c:96
96 if (data[i] != 0xff) {
@makermelissa What code was running when this happened? The varying stack trace seems to indicate memory corruption. @dhalbert just submitted a fix for memory issues when TileGrid is used with a ColorConverter. Any chance you were using it?
I was running the example code from the ST7735 library and rapidly making changes and saving the file. Yes, I believe the sample code might have been using TileGrid.
Yep verified. It was using TileGrid. I'l try it with the fix. Except there was no color converter.
Ok, I got it to HardFault twice with this:
Breakpoint 1, HardFault_Handler () at supervisor/port.c:283
283 {
(gdb) backtrace
#0 HardFault_Handler () at supervisor/port.c:283
#1 <signal handler called>
#2 0x000064de in mp_execute_bytecode (code_state=0x20002c70, inject_exc=<optimized out>) at ../../py/vm.c:1384
#3 0x20029724 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
However, it did not corrupt the filesystem like before, so definite improvement.
@makermelissa When you have a chance please describe your whole setup. I'll try to replicate it tomorrow.
Ok. I have an M4 Metro with Siddacious' MetroWing adapter on it (it was either that or my modified Feather M4 with debug header), an Adafruit Mini Color TFT with Joystick FeatherWing, and a JLink Edu connected to the Metro.
For software, I had compiled the latest with this https://github.com/adafruit/circuitpython/pull/1708. For CP code and libraries I had the latest ST7735 PR that I had submitted inside the lib folder and was running the example code as code.py. I had a few other misc files and folders in the lib folder as well: adafruit_bus_device adafruit_ili9341.py adafruit_ra8875 adafruit_ssd1351 adafruit_st7789.py
What I was doing was rapidly modifying and saving the library file since that was down a couple folders.
Oh and I was using Mu Alpha 1.1 on Mac OS X Mojave.
I just had my lib directory get wiped out on Beta 6 with this same pesky error. Unfortunately I didn't have my debugger running at the time. Now that I know the bug still exists in Beta 6, I'll hook my debugger up.
Ugh, it wiped it again, but this time I got a backtrace:
(gdb) backtrace
#0 HardFault_Handler () at supervisor/port.c:283
#1 <signal handler called>
#2 memcpy (dst=<optimized out>, src=<optimized out>, n=<optimized out>, n=<optimized out>, src=<optimized out>, dst=<optimized out>)
at ../../lib/libc/string0.c:61
#3 <signal handler called>
#4 0x00000004 in ?? ()
(gdb) continue
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
write_flash (address=<optimized out>, data=<optimized out>, data_length=<optimized out>, data_length=<optimized out>, data=<optimized out>,
address=<optimized out>) at ../../supervisor/shared/external_flash/external_flash.c:96
96 if (data[i] != 0xff) {
With the ST7735 example still? This smells of memory corruption to me. Were you interacting with the drive a bunch also? If not, it may reproduce ok.
funny i was adding ST7735 to the CPX and getting a lot of hardfaults too, just running the simpletest thought it was me :/
@ladyada What code were you running? I'll try and track it down tomorrow.
Actually, this time I was working on the SSD1331 library. Possibly displayio init code? It happens when I rapidly edit and save.
Has anyone seen it with a built-in display? Or is it only when code.py is initing?
I haven’t tried on a device with a built in display.
I managed to reproduce a HardFault on an M4 (pybadge) but it is very touchy. If I insert assembly the problem goes away.
I think the best approach will be reproing on the M0 and then using the micro trace buffer to see what was done recently. Will try that after I take a break.
I was thinking about this. I have code.py loading the library that I was editing. Perhaps it had something to do with attempting to execute the file at the same time it was writing to it?
pardon the delay - i was using a CPX, not using builtin display (initing via code.py) - here's my CPX UF2, code.py and the mpconfig for building
I think this may be related to exceptions because if I break the example and then fix it, it crashes on the second reload. I'm still chasing it down though.
Could it be that the exception object survives the reload (to be displayed) but refers to line numbers in the original file, which might have been just overwritten?
It turned out that it was a double free of supervisor managed memory for the TileGrid that is used for the terminal. The exception causes a run without the first allocation and the second free then frees the heap. This makes way for the flash cache to end up on top of the heap and thus corrupt it. Should be fixed now.
Cool. Hopefully this was the last of the memory issues.
Although the latest Beta 5 is more stable than Beta 4, Ive had 2-3 instances where the NeoPixel Went Red and the File System locked up. Resetting it brought it to Safe Mode and it notified me that a Hard Fault had occurred. In one instance, I had several folders replaced with zero-byte files.