adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.11k stars 1.22k forks source link

FS Issues on the Feather M4 Express #1667

Closed makermelissa closed 5 years ago

makermelissa commented 5 years ago

Although the latest Beta 5 is more stable than Beta 4, Ive had 2-3 instances where the NeoPixel Went Red and the File System locked up. Resetting it brought it to Safe Mode and it notified me that a Hard Fault had occurred. In one instance, I had several folders replaced with zero-byte files.

dhalbert commented 5 years ago

What were you doing (or what was happening) at the time? If I can reproduce this I can catch the hard fault and debug.

makermelissa commented 5 years ago

I was working on writing the displayio driver for the Mini 160x80 driver. I think it happened immediately after I saved changes to a CP script, but most of the time it was fine.

makermelissa commented 5 years ago

I’ll let you know if I find something more specific that causes it.

makermelissa commented 5 years ago

I haven't had an issue since I reported this. It's possible that it was corrupted from Beta 4. I'll be doing some more stuff tonight and will see if it happens again then.

tannewt commented 5 years ago

@makermelissa do you have a debugger available to you? It might be worth having it breaking on the HardFault_Handler while you do other things just in case it is a random occurrence.

makermelissa commented 5 years ago

Yeah. Is there a guide to using it for debugging?

makermelissa commented 5 years ago

I’ve only used it for bootloader flashing.

sommersoft commented 5 years ago

@makermelissa https://learn.adafruit.com/debugging-the-samd21-with-gdb/overview should get you debugging.

makermelissa commented 5 years ago

Thanks @sommersoft. I'll take a look at that guide.

dhalbert commented 5 years ago

Very brief instructions:

  1. Building with debug symbols is helpful: make -j4 BOARD=feather_nrf52840_express SD=s140 DEBUG=1
  2. Connect the jlink and start the JLinkGDBServer in another terminal window: JLinkGDBServer -device nRF52840_xxAA -if SWD
  3. Start debugging with the .elf file: arm-none-eabi-gdb build-feather_nrf52840_express-s140/firmware.elf
  4. Connect to jlink and load elf
    target extended-remote :2331
    mon reset
    load
    mon reset
  5. Set a breakpoint: break HardFault_Handler
  6. Start CircuitPython: continue
  7. If you hit the breakpoint, get a backtrace: backtrace
makermelissa commented 5 years ago

Thanks @dhalbert

makermelissa commented 5 years ago

Ok, I got it to do it twice and is still red at the moment. The last time it happened when I was saving a library file inside of a folder.

makermelissa commented 5 years ago

Since this is happening on a Feather M4 Express and I'm working on programming feathers, I can't plug in a JLink without soldering. So I'm going to see if I can get this to happen on a Metro M4 Express using Siddacious's MetroWing.

makermelissa commented 5 years ago

I got it to happen on the MetroM4 as well. I now know what I did. I made a change to the lib. Saved. Made another change, and then saved again within a very short amount of time. What probably happened is it tried to save the file while it was soft rebooting. Unfortunately I'm just now realizing I didn't compile a version with debug symbols before this happened.

dhalbert commented 5 years ago

Just FYI, backtrace even without debugging symbols will have some information that gdb can recover from the .elf file. Belowi is backtrace from a non-debug elf. It has routine names but not arg names or values. Obviously the debug is better, but this can still help. But if you can repeat with a DEBUG build, that would be great.

I forget what OS and editor you're using? Which are they? Are you editing locally and copying, or editing directly on CIRCUITPY?

(gdb) bt
#0  0x00006c60 in displayio_bitmap_make_new.lto_priv ()
#1  0x0000fb50 in type_call.lto_priv ()
#2  0x00027726 in mp_call_function_n_kw ()
#3  0x0000526e in mp_execute_bytecode ()
#4  0x00021180 in fun_bc_call.lto_priv ()
#5  0x00027726 in mp_call_function_n_kw ()
#6  0x00027946 in mp_call_function_0 ()
#7  0x0001ecb0 in parse_compile_execute.lto_priv ()
#8  0x00026f36 in maybe_run_list ()
#9  0x00004350 in main ()
siddacious commented 5 years ago

@makermelissa One thing I find extremely useful while debugging is turning off size optimization by removing the -0s here https://github.com/adafruit/circuitpython/blob/master/ports/atmel-samd/Makefile#L95

This will prevent annoying things like variables being optimized out when you'd really like to know their values.

Obviously this makes the build larger so will probably only work with M4s and I've only ever tried it with a Grand Central. As the build grows this might stop working as well.

makermelissa commented 5 years ago

Thanks @siddacious. I missed your comment before, but I'm playing around with this now. I soldered a 5-pin header onto the proto area of my Feather M4 with wires so I can connect it to a debugger easily.

makermelissa commented 5 years ago

Ok, I decided to set up the debugger as described above (except for the Metro M4) and really go to town. Here's the Backtrace @dhalbert:

(gdb) backtrace
#0  HardFault_Handler () at supervisor/port.c:283
#1  <signal handler called>
#2  mp_decode_uint_skip (ptr=0x89000005 <error: Cannot access memory at address 0x89000005>) at ../../py/bc.c:70
#3  mp_execute_bytecode (code_state=0x20002c80, inject_exc=0x67001800) at ../../py/vm.c:1385
#4  0x20029734 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
makermelissa commented 5 years ago

It probably doesn't matter, but the compiled code was what I had in this PR: https://github.com/adafruit/circuitpython/pull/1708

makermelissa commented 5 years ago

It didn't corrupt the file this time (maybe because I typed continue), but I hope this helps you trace it.

makermelissa commented 5 years ago

@dhalbert, if you want, I can also try on my feather M4 express. I modified it using the protoboard area so I can hook a debugger up easily now. :)

makermelissa commented 5 years ago

Ok, this time I got the Red LED and it corrupted the filesystem. Here's back trace and what happened when I tried to continue:

(gdb) backtrace
#0  HardFault_Handler () at supervisor/port.c:283
#1  <signal handler called>
#2  memcpy (dst=<optimized out>, src=<optimized out>, n=<optimized out>, n=<optimized out>, src=<optimized out>, dst=<optimized out>)
    at ../../lib/libc/string0.c:61
#3  0x2002ff28 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
write_flash (address=<optimized out>, data=<optimized out>, data_length=<optimized out>, data_length=<optimized out>, data=<optimized out>,
    address=<optimized out>) at ../../supervisor/shared/external_flash/external_flash.c:96
96          if (data[i] != 0xff) {
tannewt commented 5 years ago

@makermelissa What code was running when this happened? The varying stack trace seems to indicate memory corruption. @dhalbert just submitted a fix for memory issues when TileGrid is used with a ColorConverter. Any chance you were using it?

makermelissa commented 5 years ago

I was running the example code from the ST7735 library and rapidly making changes and saving the file. Yes, I believe the sample code might have been using TileGrid.

makermelissa commented 5 years ago

Yep verified. It was using TileGrid. I'l try it with the fix. Except there was no color converter.

makermelissa commented 5 years ago

Ok, I got it to HardFault twice with this:

Breakpoint 1, HardFault_Handler () at supervisor/port.c:283
283 {
(gdb) backtrace
#0  HardFault_Handler () at supervisor/port.c:283
#1  <signal handler called>
#2  0x000064de in mp_execute_bytecode (code_state=0x20002c70, inject_exc=<optimized out>) at ../../py/vm.c:1384
#3  0x20029724 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

However, it did not corrupt the filesystem like before, so definite improvement.

tannewt commented 5 years ago

@makermelissa When you have a chance please describe your whole setup. I'll try to replicate it tomorrow.

makermelissa commented 5 years ago

Ok. I have an M4 Metro with Siddacious' MetroWing adapter on it (it was either that or my modified Feather M4 with debug header), an Adafruit Mini Color TFT with Joystick FeatherWing, and a JLink Edu connected to the Metro.

For software, I had compiled the latest with this https://github.com/adafruit/circuitpython/pull/1708. For CP code and libraries I had the latest ST7735 PR that I had submitted inside the lib folder and was running the example code as code.py. I had a few other misc files and folders in the lib folder as well: adafruit_bus_device adafruit_ili9341.py adafruit_ra8875 adafruit_ssd1351 adafruit_st7789.py

What I was doing was rapidly modifying and saving the library file since that was down a couple folders.

Oh and I was using Mu Alpha 1.1 on Mac OS X Mojave.

makermelissa commented 5 years ago

I just had my lib directory get wiped out on Beta 6 with this same pesky error. Unfortunately I didn't have my debugger running at the time. Now that I know the bug still exists in Beta 6, I'll hook my debugger up.

makermelissa commented 5 years ago

Ugh, it wiped it again, but this time I got a backtrace:

(gdb) backtrace
#0  HardFault_Handler () at supervisor/port.c:283
#1  <signal handler called>
#2  memcpy (dst=<optimized out>, src=<optimized out>, n=<optimized out>, n=<optimized out>, src=<optimized out>, dst=<optimized out>)
    at ../../lib/libc/string0.c:61
#3  <signal handler called>
#4  0x00000004 in ?? ()
(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
write_flash (address=<optimized out>, data=<optimized out>, data_length=<optimized out>, data_length=<optimized out>, data=<optimized out>,
    address=<optimized out>) at ../../supervisor/shared/external_flash/external_flash.c:96
96          if (data[i] != 0xff) {
tannewt commented 5 years ago

With the ST7735 example still? This smells of memory corruption to me. Were you interacting with the drive a bunch also? If not, it may reproduce ok.

ladyada commented 5 years ago

funny i was adding ST7735 to the CPX and getting a lot of hardfaults too, just running the simpletest thought it was me :/

tannewt commented 5 years ago

@ladyada What code were you running? I'll try and track it down tomorrow.

makermelissa commented 5 years ago

Actually, this time I was working on the SSD1331 library. Possibly displayio init code? It happens when I rapidly edit and save.

tannewt commented 5 years ago

Has anyone seen it with a built-in display? Or is it only when code.py is initing?

makermelissa commented 5 years ago

I haven’t tried on a device with a built in display.

tannewt commented 5 years ago

I managed to reproduce a HardFault on an M4 (pybadge) but it is very touchy. If I insert assembly the problem goes away.

I think the best approach will be reproing on the M0 and then using the micro trace buffer to see what was done recently. Will try that after I take a break.

makermelissa commented 5 years ago

I was thinking about this. I have code.py loading the library that I was editing. Perhaps it had something to do with attempting to execute the file at the same time it was writing to it?

ladyada commented 5 years ago

pardon the delay - i was using a CPX, not using builtin display (initing via code.py) - here's my CPX UF2, code.py and the mpconfig for building

code.txt

firm.zip

mpconfigboard.txt

tannewt commented 5 years ago

I think this may be related to exceptions because if I break the example and then fix it, it crashes on the second reload. I'm still chasing it down though.

deshipu commented 5 years ago

Could it be that the exception object survives the reload (to be displayed) but refers to line numbers in the original file, which might have been just overwritten?

tannewt commented 5 years ago

It turned out that it was a double free of supervisor managed memory for the TileGrid that is used for the terminal. The exception causes a run without the first allocation and the second free then frees the heap. This makes way for the flash cache to end up on top of the heap and thus corrupt it. Should be fixed now.

makermelissa commented 5 years ago

Cool. Hopefully this was the last of the memory issues.