esp8266 / Arduino

ESP8266 core for Arduino
GNU Lesser General Public License v2.1
16.06k stars 13.33k forks source link

Crash regression 3.0.0 vs 2.7.3 - unknown cause. #7649

Closed CRCinAU closed 4 years ago

CRCinAU commented 4 years ago

The full code in question is here: https://git.crc.id.au/netwiz/ESP8266_Code/src/branch/master/RGB%20Lights

When I build and compile the code in this project with PlatformIO, I use the following platformio.ini values and the code builds and works perfectly:

[env:d1_mini]
board = d1_mini
board_build.ldscript = eagle.flash.4m.ld
board_build.f_cpu = 160000000L
board_build.f_flash = 80000000L
board_build.flash_mode = qio
framework = arduino
platform = espressif8266

The above gives me a version string of:

SDK:2.2.2-dev(38a443e)/Core:2.7.3-3-g2843a5ac=20703003/lwIP:IPv6+STABLE-2_1_2_RELEASE/glue:1.2-30-g92add50/BearSSL:5c771be

If I modify the platformio.ini as per the following, the code always crashes after ~10 seconds:

[env:d1_mini]
board = d1_mini
board_build.ldscript = eagle.flash.4m.ld
board_build.f_cpu = 160000000L
board_build.f_flash = 80000000L
board_build.flash_mode = qio
framework = arduino
platform = espressif8266
platform_packages =
    framework-arduinoespressif8266 @ https://github.com/esp8266/Arduino.git
    toolchain-xtensa @ ~2.100100.0

The above gives me a version string of:

SDK:2.2.2-dev(38a443e)/Core:3.0.0-dev=30000000/lwIP:IPv6+STABLE-2_1_2_RELEASE/glue:1.2-34-gf56e795/BearSSL:149e503

I tried to decode the crash dump here: https://github.com/FastLED/FastLED/issues/1080

However the crash seems to decode to a different set of instructions all the time...

Happy to do some debugging - but I'm not sure where else to go...

earlephilhower commented 4 years ago

There's not really anything to go on with this report, @CRCinAU. Without an MCVE and dump that we can reproduce, it's really just shooting in the dark. (Also, when decoding exceptions make sure you get the entire reange. The PC and EXECVADDR decode tell you where the code is vs. the stack trace which says where the code was).

My first guesses would be IRQ related given the Exception 0. You need to ensure all IRQs, and all functions called by those IRQs, are in IRAM. If the CPU tries to fetch an instruction from flash while the flash is doing something else, you'll get a random crash (and it won't be very repeatable since it's caching-related).

Otherwise, crank up debug to full and check logs as you go along. Out of memory errors can come in due to heap fragmentation, even if you have a large total amount of space there.

If you get an MCVE that works in 2.7.3 but not in master, we can reopen this, but for now given the little info presented I'm closing.

CRCinAU commented 4 years ago

Is there any clues on changes between 2.7.3 and 3.0.0?

I'm pretty sure I've had this running on an earlier version of 3.0.0 (which correct me if I'm remembering wrong) is basically 'master'. As such, it would likely be a change within the last 3-6 months that has caused it to now crash instead of working ok...

I know its still fish in a barrel, but .... given its an established project, I can only think of starting from scratch to see what happens bit by bit...

earlephilhower commented 4 years ago

I don't think any major changes other than the new toolchain and some minor bugfixes.

The changelogs in the releases so list every commit. GCC4->GCC10 may have exposed some memory ordering issues in some code, especially if you're writing in to HW registers, so that would be where I'd look first assuming none of the PRs listed in changelogs seem to apply.