espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.38k stars 7.22k forks source link

Crash and hang on coredump save (IDFGH-4710) #6519

Open vtunr opened 3 years ago

vtunr commented 3 years ago

Environment

Problem Description

ESP can crash and hang forever.

Expected Behavior

ESP crash but recover and reboot by itself

Actual Behavior

ESP crash and doesn't recover, just hang.

Steps to reproduce

I couldn't reproduce on a simple project, but when I call restart, with a too small stack for LWIP thread, it crashes before restarting, but half the time, it hang and never recover until power cycle.

Here's the log :

I (22487) wifi:state: run -> init (0)
I (22489) wifi:pm stop, total sleep time: 13865608 us / 18122915 us

I (22491) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
W (22
***ERROR*** A stack overflow in task tiT has been detected.

Backtrace:0x40090b72:0x3ffe4430 0x40091125:0x3ffe4450 0x40091312:0x3ffe4470 0x4009205d:0x3ffe44f0 0x40091408:0x3ffe4530 0x400913be:0xa5a5a5a5 |<-CORRUPTED

ELF file SHA256: ca747698a1c60b13

I (21263) esp_core_dump_flash: Save core dump to flash...
I (21269) esp_core_dump_elf: Found tasks: 27
I (21275) esp_core_dump_flash: Erase flash 49152 bytes @ 0x210000

Here's the SDK config : sdkconfig_debug.txt

If I debug, I hit the first stack overflow and can't see the actual problem that hangs after. I modified a bit the SDK so it's not creating a breakpoint when a crash happens.

It seems that I hit a double exception :

Thread #1 (Suspended : Signal : SIGTRAP:Trace/breakpoint trap)  
    _DoubleExceptionVector() at xtensa_vectors.S:455 0x400803c0 

I'd like to know what to do so it doesn't hang forever in case of a crash. Let me know if you need more information.

gerekon commented 3 years ago

Hi @vtunr Can you enable coredump verbose logging by inserting

#define LOG_LOCAL_LEVEL ESP_LOG_VERBOSE

before this line?

vtunr commented 3 years ago

Hi @gerekon,

Thanks for your answer. Please find attached the logs coredump_crash.log

Here's my partition.csv :

# Name,   Type, SubType, Offset,   Size
# Note: if you change the phy_init or app partition offset, make sure to change the offset in Kconfig.projbuild
nvs,      data, nvs,     ,         16K
otadata,  data, ota,     ,         8K
phy_init, data, phy,     ,         4K
factory,  0,    0,       ,         2M
coredump, data, coredump,,         512K
ota_0,    0,    ota_0,   ,         2M
ota_1,    0,    ota_1,   ,         2M
nvs_factory, data, nvs,  ,         16K
sensordata, data, nvs,   ,         1456K
gerekon commented 3 years ago

@vtunr

It seems that I hit a double exception :

W/o debugger in case of exception panic handler should be re-entered and you would see special message. BTW can you retrive backtrace from the point you hit DoubleException?

Please find attached the logs coredump_crash.log

Hmm, looks strange... In any case if core dump was stuck at some point the board should be reset by RTC watchdog.

I couldn't reproduce on a simple project, but when I call restart, with a too small stack for LWIP thread, i

Coredump code works on the task's stack and needs some extra stack space. For saving data in ELF format it requires more stack (~800 bytes) than for binary one. So possible option is to switch to binary coredump format. What size of LWIP stack do you use when problem happens? Can you add code (somewhere in panic handler) to print high water mark (uxTaskGetStackHighWaterMark) for the task before dumping the data to flash?

KaeLL commented 3 years ago

@gerekon

In any case if core dump was stuck at some point the board should be reset by RTC watchdog.

Without wanting to hijack the thread but doing so anyway, I've had issues with DoubleException and the board not resetting itself at all, so much so that I had to develop a way to kind of reboot the board externally.

vtunr commented 3 years ago

@gerekon @KaeLL Actually that is my biggest issue. The crash, I can prevent it, I just need to extend the stack, and even if it happens, I know it should recover. But it doesn't. Now i'm worried it'll crash when deployed, and somehow get stuck, so that's what I want to understand.

I'll check to have more info about the double exception, i'll let you know.

KaeLL commented 3 years ago

@gerekon Good luck. I gave up on trying to find out what was happening and went for the radical solution.