espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.33k stars 7.2k forks source link

ESP32-C3 Stack Protection Debug Assist module triggering on SP load (IDFGH-13568) #14456

Open projectgus opened 2 weeks ago

projectgus commented 2 weeks ago

Answers checklist.

IDF version.

v5.2.2, also v5.2.2-639-g43098fc4de

Espressif SoC revision.

ESP32-C3 (QFN32) (revision v0.4)

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

SEEED XIAO ESP32-C3

Power Supply used.

USB

What is the expected behavior?

Load SP register with a valid address (inside the current task's stack region) without Debug Assist hardware Stack Protection triggering.

What is the actual behavior?

Loading the SP register seems to intermittently trigger a hardware stack protector interrupt. All of the reported addresses look valid for the running task, i.e. there was no stack overflow or SP corruption.

Steps to reproduce.

Reproduction currently requires the MicroPython master branch and some Python code that sends a lot of data over Wi-Fi. (The original bug is https://github.com/micropython/micropython/issues/15667)

It is probably possible to make a simpler reproducer, best guess is that the key features are:

Note that all of the jumps are happening within the same task, and the stack pointer is saved and restored each time to/from a valid value for the current executing task.

Debug Logs.

Here's a sample crash:

MPY version : v1.24.0-preview.201.g24aa8ed762.dirty on 2024-08-28
IDF version : v5.2.2
Machine     : ESP32C3 module with ESP32C3

Guru Meditation Error: Core  0 panic'ed (Stack protection fault). 

Detected in task "mp_task" at 0x4200b1ee
0x4200b1ee: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55

Stack pointer: 0x3fca7ff0
Stack bounds: 0x3fca43a4 - 0x3fca83a0

Core  0 register dump:
Stack dump detected
MEPC    : 0x4200b200  RA      : 0x403829fa  SP      : 0x3fca7ff0  GP      : 0x3fc96e00  
0x4200b200: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55
0x403829fa: mp_execute_bytecode at /home/gus/ry/george/micropython/py/vm.c:285

TP      : 0x3fc6b838  T0      : 0x3fca7fa0  T1      : 0x40390f52  T2      : 0x0000003f  
0x40390f52: vTaskSuspend at /home/gus/ry/george/esp-idf-v5/components/freertos/FreeRTOS-Kernel/tasks.c:1960 (discriminator 1)

S0/FP   : 0x3fcabbe0  S1      : 0x3fcabc30  A0      : 0x3fca8010  A1      : 0x00000054  
A2      : 0x00000000  A3      : 0x3fcc99c0  A4      : 0x3fcc99c0  A5      : 0x3fca80e0  
A6      : 0x00000002  A7      : 0x21400000  S2      : 0x3c17034c  S3      : 0x3fcc9950  
S4      : 0x00000001  S5      : 0x00000062  S6      : 0x00000068  S7      : 0x3c16dc1c  
S8      : 0x0000001b  S9      : 0x3c16e000  S10     : 0x3c178419  S11     : 0x3c1781b6  
T3      : 0x00000000  T4      : 0x0003877f  T5      : 0x00000003  T6      : 0x00000001  
MSTATUS : 0x00001881  MTVEC   : 0x40380001  MCAUSE  : 0x0000001b  MTVAL   : 0x00004505  
0x40380001: _vector_table at ??:?

MHARTID : 0x00000000  

Backtrace:

0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
55          __asm volatile (
#0  0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
#1  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
ELF file SHA256: 7e6b188d6

Note that the Stack pointer address in the dump is valid for the bounds of the task.

This crash dump was created with a couple of additions in the nlr_jump function to try and get extra debug info:

200b184 <nlr_jump>:
        "sw   x2, 60(x10)       \n" // Store SP.
        "jal  x0, nlr_push_tail \n" // Jump to the C part.
        );
}

NORETURN void nlr_jump(void *val) {
4200b184:       1141                    addi    sp,sp,-16
4200b186:       c226                    sw      s1,4(sp)
4200b188:       c04a                    sw      s2,0(sp)
4200b18a:       c606                    sw      ra,12(sp)
4200b18c:       84aa                    mv      s1,a0
    MP_NLR_JUMP_HEAD(val, top)
4200b18e:       dc1fc0ef                jal     ra,42007f4e <mp_thread_get_state>
4200b192:       01452903                lw      s2,20(a0)
4200b196:       c422                    sw      s0,8(sp)
4200b198:       00091563                bnez    s2,4200b1a2 <nlr_jump+0x1e>
4200b19c:       8526                    mv      a0,s1
4200b19e:       ec2fa0ef                jal     ra,42005860 <nlr_jump_fail>
4200b1a2:       842a                    mv      s0,a0
4200b1a4:       00992223                sw      s1,4(s2)
4200b1a8:       854a                    mv      a0,s2
4200b1aa:       71a420ef                jal     ra,4204d8c4 <nlr_call_jump_callbacks>
4200b1ae:       00092783                lw      a5,0(s2)
4200b1b2:       c85c                    sw      a5,20(s0)
    __asm volatile (
4200b1b4:       854a                    mv      a0,s2
4200b1b6:       000102b3                add     t0,sp,zero  // Note: stored pre-restore SP to t0
4200b1ba:       00852083                lw      ra,8(a0)
4200b1be:       4540                    lw      s0,12(a0)
4200b1c0:       4904                    lw      s1,16(a0)
4200b1c2:       01452903                lw      s2,20(a0)
4200b1c6:       01852983                lw      s3,24(a0)
4200b1ca:       01c52a03                lw      s4,28(a0)
4200b1ce:       02052a83                lw      s5,32(a0)
4200b1d2:       02452b03                lw      s6,36(a0)
4200b1d6:       02852b83                lw      s7,40(a0)
4200b1da:       02c52c03                lw      s8,44(a0)
4200b1de:       03052c83                lw      s9,48(a0)
4200b1e2:       03452d03                lw      s10,52(a0)
4200b1e6:       03852d83                lw      s11,56(a0)
4200b1ea:       03c52103                lw      sp,60(a0)
4200b1ee:       0001                    nop  // <-- address the Debug Assist reports
4200b1f0:       0001                    nop
4200b1f2:       0001                    nop
4200b1f4:       0001                    nop
4200b1f6:       0001                    nop
4200b1f8:       0001                    nop
4200b1fa:       0001                    nop
4200b1fc:       0001                    nop
4200b1fe:       0001                    nop
4200b200:       4505                    li      a0,1  // <-- MEPC when the protection actually triggers
4200b202:       00008067                ret

More Information.

Happy to try anything you recommend, might even be able to provide a C reproducer that uses setjmp/longjmp.

Lapshin commented 2 weeks ago

Hi @projectgus , thank you for reporting!

That's a bizarre bug you caught. Could you please provide a reproducer?

I tried to reproduce it with this code + holding a key pressed :D (click to expand) ```C #include #include #include "sdkconfig.h" #include "freertos/FreeRTOS.h" #include "freertos/task.h" #include "esp_chip_info.h" #include "esp_flash.h" #include "esp_system.h" #include "esp_intr_alloc.h" #include "soc/periph_defs.h" #include "hal/uart_ll.h" void interrupt_handler(__attribute__((unused)) void *) { int fifolen = uart_ll_get_rxfifo_len(&UART0); while (fifolen != 0) { unsigned char data; uart_ll_read_rxfifo(&UART0, &data, 1); fifolen--; } uart_ll_clr_intsts_mask(&UART0, UART_INTR_RXFIFO_FULL | UART_INTR_RXFIFO_TOUT); } void app_main(void) { esp_intr_alloc(ETS_UART0_INTR_SOURCE, 0, interrupt_handler, NULL, NULL); while (1) { asm volatile("add t0,sp,zero"); asm volatile("sw sp,16(t0)"); asm volatile("addi sp,sp,-100"); asm volatile("nop"); asm volatile("nop"); asm volatile("lw sp,16(t0)"); vTaskDelay(50 / portTICK_PERIOD_MS); } } ```

But could not reproduce it with v5.2.2 (3b8741b172)

projectgus commented 1 week ago

@Lapshin I haven't had any luck yet either, maybe it actually requires high Wi-Fi traffic. Will keep at it and let you know.