andylinpersonal commented 9 months ago

Title

Core dump failed on overflowed tasks due to excessive stack usage of panic handler

IDF version.

release/v5.1 7380f96 release/v5.2 93ea06f master c460e1c

Espressif SoC revision.

esp32s3: v0.1 esp32c3: v0.4 esp32c6: v0.0

Operating System used.

Linux

How did you build your project?

Command line with idf.py

Development Kit.

esp32c3: LuatOS CORE-ESP32-C3 (custom esp32-c3 with DIO mode flash) esp32c6: esp32-c6-devkitc-1-n8 esp32s3: esp32-s3-devkitc-1-n32r8v

Power Supply used.

USB

What is the expected behavior?

Core dump should be reliably performed, even for the overflowed tasks.

Following is the normal core dump flow:

[Exception occurred]
Run the exception handler on the current stack.
Fatal exception -> invoke panic handler.
If core dump is enabled, invoke the core dump function.
If core dump stack is enabled, swap the stack to the pre-allocated one.
Perform real core dump stuff.
Swap back the stack to the panic handler's one.
Reset.

What is the actual behavior?

espcoredump crashed randomly when the original task was overflowed.

After reserving a large space before the stack, most configurations can perform core dump successfully.

To proof and check the memory region polluted by the overflowed panic handler and the core dump function, I have placed a second canary watchpoint below the first one. It seems that reserving some space under the end of stack could be a possible workaround as the second watchpoint will not be triggered duing core dump if the space is sufficiently large.

If the reserved space is way too small, we'll see this:

Guru Meditation Error: Core  0 panic'ed (Double exception).

If the reserved space is slightly smaller than required space, we'll see this:

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception).
Debug exception reason: Watchpoint 0 triggered

Original stack:

|  ---  stack top   ---
|  ====================
|  <-   stack body   ->
|
|    [1st canary WP]
|  ====================
V  ---  stack end   ---

Reserve some space under the stack end as workaround:

G  ---  stack top   ---
R  ====================
O  <-   stack body   ->
W
T    [1st canary WP]
H  ====================
|  ---  stack end   ---
|  ====================
|  <- Reserved space ->
|
|    [2nd canary WP]
|   (only work on s3)
|  ====================
V  --- end of rsvd  ---

After loading the core dump data by idf.py coredump-debug and examine with gdb, we can find the reserved space was modified by the panic handler and core dump code.
Possible fix:
- Ensure _xt_panic and _panic_handler only do the minimum works and run the real panic handler on a dedicated emergency stack, just like the core dump do.
Additional note for S3, -O0:
- Core dump cannot even continue with enlarged stack. What's happened here?

Steps to reproduce.

Reproduce:
1. Enable some options in the sdkconfig.defaults.
  1. Enable core dump and save to flash.
    - CONFIG_ESP32_ENABLE_COREDUMP_TO_FLASH=y
  2. Enable canary watchpoint at the end of stack to catch the overflow event.
    - CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y
  3. Enable core dump stack for core dump code.
    - CONFIG_ESP_COREDUMP_STACK_SIZE=1536
    - CONFIG_ESP32_CORE_DUMP_STACK_SIZE=1536
  4. Adjust the optimization level.
2. Input crash and carefully overflow the victim task by recursively calling a small function.
3. We can see the espcoredump crashed again :(
Examine:
1. Reserve extra space before the stack by adding #define ENABLE_RESERVED_SPACE 1 to main/main.c
2. Crash the task by input crash.
3. Check the overflowed stask
  - Using gdb to show the overflowed task stack:
    1. Load the dumped image by ESP_ROM_ELF_DIR=~/.espressif/tools/esp-rom-elfs/20230320/ idf.py coredump-debug.
    2. Enter p task_stack and see the content of the stack in gdb.
    3. We can see the reserved space was modified by the core dump process.
    4. For S3: if the reserved size is too small, the 2nd WP will be triggered during the core dump process.
  - Hash the reserved area and compare before/after the panic handler:
    1. Hash and save the reserved space.
    2. Store the hash value to the RTC memory or global variable above the task_stack.
    3. Hash the reserved space again after the panic handler.
      - For IDF v5.3, user can put the checking code into esp_core_dump_write_internal()
    4. The hash value will be mismatched if the panic handler invaded the reserved space.

Debug Logs.

Log messages without workaround: (Excerpted from *without-workaround.log)

v5.1, S3, -Og:

ELF file SHA256: b47fc05821d133ea

I (6381) esp_core_dump_flash: Save core dump to flash...
I (6381) esp_core_dump_common: Backing up stack @ 0x3fc959f0 and use core dump stack @ 0x3fc98780
I (6382) esp_core_dump_flash: Erase flash 16384 bytes @ 0x110000
Guru Meditation Error: Core  0 panic'ed (InstructionFetchError). Exception was unhandled.

Core  0 register dump:
PC      : 0x3fcef74c  PS      : 0x00060734  A0      : 0x82007e12  A1      : 0x3fc986f0
A2      : 0x3fcef74c  A3      : 0x00110000  A4      : 0x00004000  A5      : 0x00004000
A6      : 0x00110000  A7      : 0x00000002  A8      : 0x80377fc9  A9      : 0x3fc98660
A10     : 0x3fc98700  A11     : 0x3fc948cf  A12     : 0xffffffff  A13     : 0x3fc98666
A14     : 0x00000014  A15     : 0xfffffff9  SAR     : 0x00000005  EXCCAUSE: 0x00000002
EXCVADDR: 0x3fcef74c  LBEG    : 0x400570e8  LEND    : 0x400570f3  LCOUNT  : 0x00000000

Backtrace: 0x3fcef749:0x3fc986f0 |<-CORRUPTED

v5.2, C6, -Og:

ELF file SHA256: c4031c967

I (319) esp_core_dump_flash: Save core dump to flash...
I (320) esp_core_dump_common: Backing up stack @ 0x4080d6b0 and use core dump stack @ 0x408100d0
I (324) esp_core_dump_flash: Erase flash 12288 bytes @ 0x110000
Guru Meditation Error: Core  0 panic'ed (Instruction access fault). Exception was unhandled.

Core  0 register dump:
MEPC    : 0x4080d570  RA      : 0x4080d570  SP      : 0x40810070  GP      : 0x4080c100
TP      : 0x40805848  T0      : 0x4002289a  T1      : 0xffffffe0  T2      : 0x00000001
S0/FP   : 0x00000000  S1      : 0x00000000  A0      : 0x4080d61c  A1      : 0x00110000
A2      : 0x00003000  A3      : 0x4080ae8c  A4      : 0x40018ebe  A5      : 0x4080d570
A6      : 0x80000000  A7      : 0x00000010  S2      : 0x00000000  S3      : 0x00003000
S4      : 0x00000000  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
T3      : 0x00000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000
MSTATUS : 0x00001801  MTVEC   : 0x40800001  MCAUSE  : 0x00000001  MTVAL   : 0x4080d570
MHARTID : 0x00000000

v5.2, S3, -Os:

ELF file SHA256: 4aa6107ff

I (8005) esp_core_dump_flash: Save core dump to flash...

// Stucked here ...

ESP-ROM:esp32s3-20210327
Build:Mar 27 2021

// And reset by RTCWDT
rst:0x9 (RTCWDT_SYS_RST),boot:0x18 (SPI_FAST_FLASH_BOOT)

v5.3, C6, -O0:

ELF file SHA256: 9f80fd8fa

I (1589) esp_core_dump_flash: Save core dump to flash...
I (1589) esp_core_dump_common: Backing up stack @ 0x40817560 and use core dump stack @ 0x4081a2f0
I (1593) esp_core_dump_flash: Erase flash 12288 bytes @ 0x110000
Guru Meditation Error: Core  0 panic'ed (Load access fault). Exception was unhandled.

Core  0 register dump:
MEPC    : 0x408048a6  RA      : 0x40804890  SP      : 0x4081a220  GP      : 0x40817500
TP      : 0x4080c1cc  T0      : 0x4002289a  T1      : 0xffffffe0  T2      : 0x00000000
S0/FP   : 0x4081a280  S1      : 0x4080d3b4  A0      : 0x00000000  A1      : 0x00110000
A2      : 0x00003000  A3      : 0x00110000  A4      : 0x408174dc  A5      : 0xffffffff
A6      : 0x80000000  A7      : 0x00000010  S2      : 0x00021340  S3      : 0x00000000
S4      : 0x38be5e00  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
T3      : 0x00000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000
MSTATUS : 0x00001801  MTVEC   : 0x40800001  MCAUSE  : 0x00000005  MTVAL   : 0x00000017
MHARTID : 0x00000000

v5.3, C6, -Og:

ELF file SHA256: 8f8b05fe3

I (10180) esp_core_dump_flash: Save core dump to flash...
I (10181) esp_core_dump_common: Backing up stack @ 0x4080c330 and use core dump stack @ 0x4080ee00
I (10185) esp_core_dump_flash: Erase flash 12288 bytes @ 0x110000
Guru Meditation Error: Core  0 panic'ed (Illegal instruction). Exception was unhandled.

Core  0 register dump:
MEPC    : 0xfffffffe  RA      : 0x408028a8  SP      : 0x4080ed90  GP      : 0x4080c300
TP      : 0x408031b4  T0      : 0x4002289a  T1      : 0xffffffe0  T2      : 0x00000001
S0/FP   : 0x00000000  S1      : 0x4080ee44  A0      : 0x4080c290  A1      : 0x00110000
A2      : 0x00003000  A3      : 0x4080b094  A4      : 0xffffffff  A5      : 0x4080c1e0
A6      : 0x80000000  A7      : 0x00000010  S2      : 0x00110000  S3      : 0x00003000
S4      : 0x00000000  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
T3      : 0x00000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000
MSTATUS : 0x00001801  MTVEC   : 0x40800001  MCAUSE  : 0x00000002  MTVAL   : 0x00000000
MHARTID : 0x00000000

v5.3, S3, -O0:

ELF file SHA256: 6343cb008

I (5490) esp_core_dump_flash: Save core dump to flash...
I (5490) esp_core_dump_common: Backing up stack @ 0x3fc9f1c0 and use core dump stack @ 0x3fca20a0
I (5491) esp_core_dump_flash: Erase flash 20480 bytes @ 0x110000
Guru Meditation Error: Core  0 panic'ed (IllegalInstruction). Exception was unhandled.

Core  0 register dump:
PC      : 0x6d75645f  PS      : 0x00060134  A0      : 0x8200c80c  A1      : 0x3fca1fa0
A2      : 0x3fc9f31c  A3      : 0x00110000  A4      : 0x00005000  A5      : 0x00005000
A6      : 0x00110000  A7      : 0x3fca1fa0  A8      : 0x80379796  A9      : 0x3fca1fd0
A10     : 0x3fca1fd0  A11     : 0x3c037c7f  A12     : 0xffffffff  A13     : 0x3fca1f46
A14     : 0x00000014  A15     : 0xfffffff9  SAR     : 0x00000005  EXCCAUSE: 0x00000000
EXCVADDR: 0x00000000  LBEG    : 0x400570e8  LEND    : 0x400570f3  LCOUNT  : 0x00000000

Backtrace: 0x6d75645c:0x3fca1fa0 |<-CORRUPTED

Log messages with workaround:
- See attachment file with-workaround/${IDF_VER}-${TARGET}-${OPT}-log-with-workaround.log.

Stack content of overflowed task with workaround applied:

v5.3, C6, -O0

[Current thread is 1 (process 1082239012)]
add symbol table from file "~/.espressif/tools/esp-rom-elfs/20230320/esp32c6_rev0_rom.elf" at
      .text_addr = 0x400017ec
Reading symbols from ~/.espressif/tools/esp-rom-elfs/20230320/esp32c6_rev0_rom.elf...
(No debugging symbols found in ~/.espressif/tools/esp-rom-elfs/20230320/esp32c6_rev0_rom.elf)
(gdb) p task_stack

// Overflowed for (RESERVED_BYTES - 104) bytes...
// In this case 896 - 104 = 792 bytes

$1 = '\000' <repeats 104 times>, "\001\000\000\000\237w\201@", '\000' <repeats 12 times>, "\001", '\000' <repeats 15 times>, "\260w\201@\000\000\000\000\000\000\000\000\001\000\000\000\317w\201 \000\000\000\000\000\000\000\000\320w\201@\220\017\000B", '\000' <repeats 12 times>, "\340w\201 \000\000\000\000\000\000\000\000\000"...

Other dumped stacks have been excerpted to with-workaround/${IDF_VER}-${TARGET}-${OPT}-corredump-brief.log

Missing logs of C6, release/v5.2 due to https://github.com/espressif/esp-idf/issues/13197.

More Information.

Related bug:
- ESP32-S3 may stuck during swapping the stack in espcordump regardless of ESP-IDF version when processing the overflowed task. Not yet found the possible cause.
Followings are the test code.

main/main.c

#include <driver/usb_serial_jtag.h>
#include <esp_vfs_dev.h>
#include <esp_idf_version.h>

#include <freertos/FreeRTOS.h>
#include <freertos/task.h>

#include <stdio.h>
#include <string.h>

#define ENABLE_RESERVED_SPACE 1

#if ENABLE_RESERVED_SPACE
#if ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 1

#ifdef CONFIG_IDF_TARGET_ESP32C3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEFAULT
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#ifdef CONFIG_IDF_TARGET_ESP32S3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
// Stucked during core dump
#define RESERVED_BYTES 1536
#elif CONFIG_COMPILER_OPTIMIZATION_DEFAULT
#define RESERVED_BYTES 1024
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 1024
#endif
#endif

#elif ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 2

#if CONFIG_IDF_TARGET_ESP32C6
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#ifdef CONFIG_IDF_TARGET_ESP32S3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
// Stucked during core dump
#define RESERVED_BYTES 1536
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
// Stucked during core dump
#define RESERVED_BYTES 1024
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
// Stucked during core dump
#define RESERVED_BYTES 1024
#endif
#endif

#elif ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 3

#if CONFIG_IDF_TARGET_ESP32C3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#if CONFIG_IDF_TARGET_ESP32C6
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#ifdef CONFIG_IDF_TARGET_ESP32S3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
// Stucked during core dump
#define RESERVED_BYTES 1536
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
// Stucked during core dump
#define RESERVED_BYTES 1024
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
// Stucked during core dump
#define RESERVED_BYTES 1024
#endif
#endif

#endif // ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 3
#else  // !ENABLE_RESERVED_SPACE
#define RESERVED_BYTES 0
#endif // ENABLE_RESERVED_SPACE

#define CANARY_WP_SIZE 64
#define STACK_BYTES 3072

StaticTask_t task_tcb;
TaskHandle_t task_handle = NULL;
/**
 * Reserve RESERVED_BYTES bytes of extra space for panic handler and espcoredump before replacing the stack
 * Add COREDUMP_DRAM_ATTR to save the whole stack to coredump.
 */
COREDUMP_DRAM_ATTR __attribute__((aligned(CANARY_WP_SIZE))) StackType_t task_stack[STACK_BYTES + RESERVED_BYTES];

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Winfinite-recursion"
__attribute__((noreturn)) NOINLINE_ATTR void crashme(void) { crashme(); }
#pragma GCC diagnostic pop

void task_func(void *) {
  fprintf(stderr, "Previous frame: %p:%p\n", __builtin_extract_return_addr(__builtin_return_address(0)), esp_cpu_get_sp());

  char buf[32] = {};
  fputs("$ \n", stderr);
  while (true) {
    if (fgets(buf, sizeof(buf), stdin)) {
      fprintf(stderr, "$ %s", buf);
      if (strnstr(buf, "crash", sizeof(buf))) {
        break;
      }
    }

    memset(buf, 0, sizeof(buf));
    vTaskDelay(pdMS_TO_TICKS(1000));
  }

  // Should be triggered on S3 when the reserved space exhausted
#if RESERVED_BYTES >= CANARY_WP_SIZE && ENABLE_RESERVED_SPACE
  assert(esp_cpu_set_watchpoint(0, &task_stack, CANARY_WP_SIZE,  ESP_CPU_WATCHPOINT_STORE) == ESP_OK);
#endif
  crashme();
}

void app_main(void) {
  // Disable buffering on stdin
  setvbuf(stdin, NULL, _IONBF, 0);

  // Install USB-SERIAL-JTAG driver for interrupt-driven reads and writes
  usb_serial_jtag_driver_config_t usb_serial_jtag_config =
      USB_SERIAL_JTAG_DRIVER_CONFIG_DEFAULT();
  assert(usb_serial_jtag_driver_install(&usb_serial_jtag_config) == ESP_OK);

  // Tell vfs to use usb-serial-jtag driver
  esp_vfs_usb_serial_jtag_use_driver();

  fprintf(stderr, "stack@%p-%p\n", &task_stack[0], (const char*)&task_stack[0] + sizeof(task_stack));

  task_handle = xTaskCreateStaticPinnedToCore(
      task_func, "outer", STACK_BYTES, NULL, CONFIG_PTHREAD_TASK_PRIO_DEFAULT, &task_stack[RESERVED_BYTES], &task_tcb, 0);
}

main/CMakeLists.txt

idf_component_register(SRCS "main.c")

CMakeLists.txt

cmake_minimum_required(VERSION 3.16)
include($ENV{IDF_PATH}/tools/cmake/project.cmake)
project(tester)

sdkconfig.defaults

CONFIG_ESP_CONSOLE_USB_SERIAL_JTAG=y
CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y

CONFIG_ESP32_ENABLE_COREDUMP_TO_FLASH=y

# Optionally enable this to print more details.
# CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y

# -O0
CONFIG_COMPILER_OPTIMIZATION_NONE=y
# -Og (Default)
# CONFIG_COMPILER_OPTIMIZATION_DEBUG=y
# CONFIG_COMPILER_OPTIMIZATION_DEFAULT=y
# -Os
# CONFIG_COMPILER_OPTIMIZATION_SIZE=y

# CONFIG_COMPILER_OPTIMIZATION_ASSERTIONS_ENABLE=y

# Enable and allocate a dedicated stack for coredump code to handle the overflowed stacks safely.
CONFIG_ESP_COREDUMP_STACK_SIZE=1536
CONFIG_ESP32_CORE_DUMP_STACK_SIZE=1536

andylinpersonal commented 9 months ago

Extra log files: logs.tar.gz

andylinpersonal commented 9 months ago

Additional note for ESP32-S3 with -O0:

Coredump cannot even continue with enlarged stack
- With a bit of esp_rom_printf, it seems like S3 hanged near esp_core_dump_replace_sp() and xthal_window_spill(). At least esp_rom_printf cannot print anything anymore after xthal_window_spill().

andylinpersonal commented 8 months ago

Update some log files of v5.3 to c460e1cd7e20063ef600107fb58555f597e61b7e.
Add another examination method.

test.tar.gz

andylinpersonal commented 8 months ago

Any update? Thanks.

erhankur commented 8 months ago

@andylinpersonal Thanks for reporting the issue. We have some findings regarding stack replacement on both xtensa and riscv chips. I will wrap up what we did and explain here next week.

Regarding overflow scenarios, I am sorry, I couldn't find a time to investigate yet. Will check it soon.

AxelLin commented 5 months ago

@erhankur Any follow up of https://github.com/espressif/esp-idf/issues/13219#issuecomment-2027991578 ?

espressif / esp-idf

Coredump failed on overflowed tasks due to excessive stack usage of panic handler (IDFGH-12163) #13219

Title

IDF version.

Espressif SoC revision.

Operating System used.

How did you build your project?

Development Kit.

Power Supply used.

What is the expected behavior?

What is the actual behavior?

Steps to reproduce.

Debug Logs.

More Information.

Additional note for ESP32-S3 with -O0: