espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.38k stars 7.22k forks source link

Processor hang during crash (IDFGH-6374) #8033

Open cjstott94 opened 2 years ago

cjstott94 commented 2 years ago

Environment

Problem Description

Processor hang during crash

Expected Behavior

Any crashes should result in CPU restarting

Actual Behavior

Processor Halts

Steps to reproduce

  1. Firmware built with CONFIG_ESP32S2_PANIC_PRINT_REBOOT and CONFIG_ESP_COREDUMP_ENABLE_TO_FLASH enabled
  2. Download and apply new firmware update over a ppp connection
  3. Initiate restart via esp_restart()
  4. After esp_restart is called a uart task was erroneously still active and calls esp_netif_receive here
  5. InstrFetchProhibited Exception occurs probably due to a esp_netif handle being freed (PC is 0xfefefefe)
  6. Processor halts after printing ELF sha256 (somewhere between here)

Capture

I'm Guessing that most likely something happens when saving a coredump in elf format to flash similar to this issue #6519 LWIP stack has 4K of memory allocated Though I don't get an indication of a double exception over uart (We don't have an easy way of hooking up jtag ATM)

I can probably fix this specific crash by ensuring the modem/ppp is cleaned up properly before a restart and/or by increasing the stack size to give extra overhead for the coredump as suggested here @gerekon in https://github.com/espressif/esp-idf/issues/6519#issuecomment-779968116_

Though my main concern is if a problem is occurring during the coredump, why is it not doing some sort of hard reset? It's critical that the application reboot on any sort of exception.

Is there a way we could adjust the _DoubleExceptionVector so that it instead reboots straight away instead of invoking the panic handler?

Or is there any other known workaround? Maybe skipping core dump if there isn't enough stack space?

Other items if possible

KaeLL commented 2 years ago

Reminds of this. @igrr Thoughts? Also, in case this was implemented or it's on a roadmap somewhere (hopefully), was/will it be backported to release/v3.3?

cjstott94 commented 2 years ago

I've so far narrowed where it's halting down to esp_panic_handler() =>esp_core_dump_to_flash() ==>esp_core_dump_write() ===>esp_core_dump_write_elf() ====>esp_core_dump_do_write_elf_pass() =====>elf_write_core_dump_user_data() ======>elf_add_segment()? maybe even esp_core_dump_flash_write_data?

Though curiously, if I force a crash by inserting code like this just before the call to esp_core_dump_do_write_elf_pass *(int *)0 = 0;

A StoreProhibited exception is triggered and then the second panic handler correctly displays Re-entered core dump! Exception happened during core dump! (Though isn't a crash during the panic handler meant to trigger a DoubleException?)

Intentional crash before esp_core_dump_do_write_elf_pass

The app Still halts if that crashing code is inserted after the call to esp_core_dump_do_write_elf_pass

I also tried updating to ESP-IDF 4.3.1 and this fixed the problem Though I cant tell if it fixed the underlying problem during a core-dump as I can no longer trigger the initial exception

I couldn't find any changes to the coredump panic handler in 4.3.1 here

If it helps I'm also using Encrypted flash and making a release build

david-cermak commented 2 years ago

Hi @cjstott94

Just a note about the former exception. Have you correctly deinitialized the PPP component as indicated here:

https://github.com/espressif/esp-idf/blob/233dc30fb1a376d7ca0c5d74bdd410ca368f6bf7/examples/protocols/pppos_client/main/pppos_client_main.c#L324-L326

after the OTA update?

Just assume that the receive callback could be called normally (maybe after entering esp_restart() function) before performing a restart.

cjstott94 commented 2 years ago

Yes that was the problem, not de-initializing properly Fixed that issue already, just concerned whether a similar error could cause the processor to lock up again If it locks up then there's no chance we'll be able to do any OTA bug fixes