espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
12.89k stars 7.08k forks source link

[MESH] [v4.4] MTX stack overflow (IDFGH-12923) #13882

Open KonssnoK opened 1 month ago

KonssnoK commented 1 month ago

Answers checklist.

General issue report

hello @zhangyanjiaoesp As per title we have some devices crashing for stack oveflow in the MTX task.

7a00499cc9ae66a81ab1720a1ca1c50f2b1a04b7

is the sha of what this code is based on.

What can consume stack in the MTX task? Is the default value of 2448 supposed to be enough for any scenario? Should we increase it?

**==================== CURRENT THREAD STACK =====================
#0  0x403764c9 in panic_abort (details=0x3fcd8dbc "***ERROR*** A stack overflow in task MTX has been detected.") at /opt/esp-idf/components/esp_system/panic.c:408
#1  0x403846b8 in esp_system_abort (details=0x3fcd8dbc "***ERROR*** A stack overflow in task MTX has been detected.") at /opt/esp-idf/components/esp_system/esp_system.c:137
#2  0x40388cc5 in vApplicationStackOverflowHook (xTask=<optimized out>, pcTaskName=0x3fcd9820 "MTX") at /opt/esp-idf/components/freertos/port/xtensa/port.c:407
#3  0x4038693a in vTaskSwitchContext () at /opt/esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:38
#4  0x4038476f in _frxt_dispatch () at /opt/esp-idf/components/freertos/port/xtensa/portasm.S:436
Backtrace stopped: Cannot access memory at address 0xa5a5a599

======================== THREADS INFO =========================
  Id   Target Id          Frame
* 1    process 1070438376 0x403764c9 in panic_abort (details=0x3fcd8dbc "***ERROR*** A stack overflow in task MTX has been detected.") at /opt/esp-idf/components/esp_system/panic.c:408
  2    process 1070551028 0x4217898e in esp_pm_impl_waiti () at /opt/esp-idf/components/esp_pm/pm_impl.c:855
  3    process 1070549128 0x4217898e in esp_pm_impl_waiti () at /opt/esp-idf/components/esp_pm/pm_impl.c:855
  4    process 1070444044 xQueueReceive (xQueue=0x3d8269e8, pvBuffer=0x3fcdac50, xTicksToWait=<optimized out>) at /opt/esp-idf/components/freertos/queue.c:1400
  5    process 1070504828 xQueueReceive (xQueue=0x3d80626c, pvBuffer=0x3fcebe20, xTicksToWait=<optimized out>) at /opt/esp-idf/components/freertos/queue.c:1400
  6    process 1070433400 0x40385a00 in xQueueSemaphoreTake (xQueue=0x3d806834, xTicksToWait=<optimized out>) at /opt/esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:38
  7    process 1070419336 0x400559e0 in ?? ()
  8    process 1070432968 0x400559e0 in ?? ()
  9    process 1070428328 0x400559e0 in ?? ()
  10   process 1070522816 xQueueReceive (xQueue=0x3d830aec, pvBuffer=0x3fcedf60, xTicksToWait=<optimized out>) at /opt/esp-idf/components/freertos/queue.c:1400
  11   process 1070542428 0x400559e0 in ?? ()
  12   process 1070524280 xQueueReceive (xQueue=0x3fce9e4c, pvBuffer=0x3fcf7360, xTicksToWait=<optimized out>) at /opt/esp-idf/components/freertos/queue.c:1400
  13   process 1070362852 0x400559e0 in ?? ()
  14   process 1070505188 0x40385871 in xQueueReceive (xQueue=0x3d8063c8, pvBuffer=0x3fcecff0, xTicksToWait=<optimized out>) at /opt/esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:38
  15   process 1070423796 xEventGroupWaitBits (xEventGroup=0x3d8252c0, uxBitsToWaitFor=6, xClearOnExit=0, xWaitForAllBits=1, xTicksToWait=4294967295) at /opt/esp-idf/components/freertos/event_groups.c:475
  16   process 1070438736 0x40385a00 in xQueueSemaphoreTake (xQueue=0x3fcf0778, xTicksToWait=<optimized out>) at /opt/esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:38
  17   process 1070363212 0x40385a00 in xQueueSemaphoreTake (xQueue=0x3fcf0778, xTicksToWait=<optimized out>) at /opt/esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:38
  18   process 1070439456 xQueueReceive (xQueue=0x3d8191bc, pvBuffer=0x3fcd0898, xTicksToWait=<optimized out>) at /opt/esp-idf/components/freertos/queue.c:1400
  19   process 1070553440 0x4038859b in prvProcessTimerOrBlockTask (xNextExpireTime=<optimized out>, xListWasEmpty=1) at /opt/esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:38
  20   process 1070537968 xQueueSemaphoreTake (xQueue=0x3fcf168c, xTicksToWait=<optimized out>) at /opt/esp-idf/components/freertos/queue.c:1563
  21   process 1070535780 0x40385a00 in xQueueSemaphoreTake (xQueue=0x3fcf0e00, xTicksToWait=<optimized out>) at /opt/esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:38
  22   process 1070412792 0x400559e0 in ?? ()
  23   process 1070562952 0x400559e0 in ?? ()

       TCB             NAME PRIO C/B  STACK USED/FREE
---------- ---------------- -------- ----------------
0x3fcd97e8              MTX    13/13         2448/108
0x3fcf4ff4            IDLE1      0/0          640/880
0x3fcf4888            IDLE0      0/0          640/892
0x3fcdae0c        uart_task      5/5         688/3400
0x3fce9b7c              tiT    18/18         736/3868
0x3fcd8478        mqtt_task    10/10        1072/3008
0x3fcd4d88        processor      5/5         816/5324
0x3fcd82c8            modem      5/5         912/3180
0x3fcd70a8       supervisor    15/15         736/3356
0x3fcee1c0     task_mesh_rx    15/15         896/3188
0x3fcf2e5c        esp_timer    22/22         656/3432
0x3fcee778             wifi    23/23         688/5960
0x3fcc70e4         mqtt_app      5/5         752/3340
0x3fce9ce4          sys_evt    20/20         720/3880
0x3fcd5ef4          network    15/10        1120/2960
0x3fcd9950            MTXON    13/13        2000/1060
0x3fcc724c        mesh_task    10/10        2048/2040
0x3fcd9c20             MNWK    15/15        1456/3148
0x3fcf5960          Tmr Svc      1/1         640/1396
0x3fcf1cf0             ipc1    24/24          640/884
0x3fcf1464             ipc0    24/24          640/880
0x3fcd33f8          offline      3/3         736/3356
0x3fcf7e88          updater    15/15         736/7440

Thanks

KonssnoK commented 1 month ago

mh, i fear the stack size is not configurable for these internal MESH tasks...

zhangyanjiaoesp commented 1 month ago

@KonssnoK The MTX task stack size is 2660, what did you do when the stack overflow occur?

KonssnoK commented 1 month ago

@KonssnoK The MTX task stack size is 2660, what did you do when the stack overflow occur?

the device was offline for 20 minutes. In these 20 minutes the device rebooted once because it was not able to find a mesh to connect to in 15 min. In these 15 min the device also had to reset the mesh stack multiple times because it was receiving events of disconnections from children that were not connected (this is some kind of bug that i worked around like this). At some point the device crashes and after reboots decides to move to LTE because the WIFI is apparently not working correctly.

These are some logs that were collected while the device was offline. Please note that they are not in order and that the number between [] is the number of repetitions for that log during the online phase.

[8] Perform NO PARENT FIXED ROOT handover 0 0
[10] <MESH_EVENT_TODS_REACHABLE>reachable:1
[5] Disconnection from non registered child! AC1754000590
[5] Skipping wifi connectivity (count 1)
[24] Disconnection from non registered child! AC17540005AC
[8] Negative children count! Reset mesh...
[8] Mesh reset requested.
[4] Disconnection from non registered child! AC175400064C
[7] Disconnection from non registered child! AC17540002FC
[10] Disconnection from non registered child! AC1754000578
[7] Disconnection from non registered child! AC17540000D0
[3] MQTT_EVENT_ERROR 32769 0 0 1 119
[27] MQTT_EVENT_ERROR 32774 0 0 1 0
[1] Perform NO PARENT FIXED ROOT handover 0 1
[1] Child connection with no free slots! AC175400064C
[1] 540005AC -2
[4] 54000574 1
[1] 540000D0 -1
[1] 54000578 -1
[1] 540002FC -1
[1] 54000590 -1
[4] Disconnection from non registered child! AC1754000574
[7] MQTT_EVENT_ERROR 32769 0 0 1 11
[3] Skipping wifi connectivity (count 2)
[1] Couldn't connect for 600s. Restart.
[2] Sensor stored version 2
[2] No PRAM heat data to restore
[2] PDP context definitions: +CGDCONT: 1,"IP","orange.m2m.spec",,0,0
[2] Operator configuration code: 18
[1] LTE signal quality: -108.0,121.0,23.0,3.0,-7.0
[3] AT+CREG?: 0,1
[3] AT+CEREG?: 0,1
[3] AT+CGREG?: 0,0
[1] LTE signal quality: -108.0,121.0,23.0,3.0,-9.0
[3] Selected operator: 0,0,"Orange F",7
[3] Active LTE band(s): 0,0000000000000000080000
[3] GOT IP from ppp_sta
[3] Disconnection from non registered child! AC17540009F4
[1] <MESH_EVENT_PARENT_DISCONNECTED>reason: 100 MESH_REASON_CYCLIC
[1] LTE signal quality: -105.0,123.0,23.0,3.0,-6.0
[1] LTE signal quality: -107.0,121.0,23.0,3.0,-7.0
[2] Child connection with no free slots! AC1754000578
[3] 540005AC 1
[3] 540009F4 -1
[3] 5400064C 1
[3] 54000590 1
[3] 540000D0 1
[1] Child connection with no free slots! AC17540002FC
[1] Triggering DYNAMIC MESH handover
[1] <MESH_EVENT_PARENT_DISCONNECTED>reason: 101 MESH_REASON_PARENT_I
[1] <MESH_EVENT_PARENT_CONNECTED>layer:0-->4, parent:AC17540002FD, I
[1] MQTT_EVENT_ERROR 32794 76 0 1 0
[1] MQTT task stopped after 0ms
[4] STA: Send err 0x400A ESP_ERR_MESH_TIMEOUT
[1] <MESH_EVENT_PARENT_DISCONNECTED>reason: 102 MESH_REASON_LEAF
[1] LTE signal quality: -105.0,123.0,23.0,23.0,-8.0
[1] <MESH_EVENT_PARENT_CONNECTED>layer:4-->1, parent:2266CF7C27B0<RO
[1] LTE signal quality: -140.0,123.0,23.0,23.0,-9.0
[1] GOT IP from sta

The mesh reset is performed as follows:

        if (esp_mesh_is_root()) {
            ESP_LOGW(TAG, "(L%d " MACSTR ") Restart mesh!", esp_mesh_get_layer(), MAC2STR(device_own_mac));
            mesh_stop();
            mesh_init();
            mesh_start();
        }

But i doubt this can trigger a stack issue on the MTX task... (heap i could understand)

zhangyanjiaoesp commented 1 month ago

Local variables, field protection and return addresses for function calls, function parameters, before entering interrupt functions, and interrupt nesting all require stack space.

In these 15 min the device also had to reset the mesh stack multiple times because it was receiving events of disconnections from children that were not connected (this is some kind of bug that i worked around like this).

And the handling of the disconnect event is in the MNWK task.

KonssnoK commented 1 month ago

Local variables, field protection and return addresses for function calls, function parameters, before entering interrupt functions, and interrupt nesting all require stack space.

In these 15 min the device also had to reset the mesh stack multiple times because it was receiving events of disconnections from children that were not connected (this is some kind of bug that i worked around like this).

And the handling of the disconnect event is in the MNWK task.

do you mean that multiple interrupts are all stored on the same task stack? Isn't it there a dedicated interrupt stack for nesting?

This would mean that if i have any task that has a very small stack, it would have to be large enough to hold the contextes for all the nested interrupts? doesn't seem right.

Dazza0 commented 1 month ago

@KonssnoK

do you mean that multiple interrupts are all stored on the same task stack? Isn't it there a dedicated interrupt stack for nesting?

ISRs are executed on a dedicated interrupt stack. On multi-core ESP targets, there is a dedicated interrupt stack for each core.

This would mean that if i have any task that has a very small stack, it would have to be large enough to hold the contextes for all the nested interrupts? doesn't seem right.

Tasks only need to be large enough to contain their own callstacks plus a context frame. Whenever a task is preempted (either due to an interrupt, or by another task), the CPU context of the task that was interrupt is saved on that task's own stack.

When executing an ISR on the interrupt stack, if that ISR gets preempted by a higher priority interrupt (i.e., nested interrupts), the preempted ISR's CPU context is saved directly on the interrupt stack, then then interrupting ISR is then executed on the same interrupt stack. Thus, the interrupt stack must be large enough to support all nested interrupts.

The following diagram illustrates how task stacks and interrupt stacks are used.

       Some Task                   Interrupt Stack
   +--------------+                +--------------+
   |    Func A    |                | ISR 1 Func J |
   +--------------+                +--------------+
   |    Func B    |                | ISR 1 Func K |
   +--------------+                +--------------+
|  |    Func C    |             |  |              |
|  +--------------+             |  | Level 1 ISR  |
|  |              |             |  |   Context    |
v  |    Task      |             v  |              |
   |   Context    |                +--------------+
   |              |                | ISR 2 Func X |
   +--------------+ <--Task SP     +--------------+
   |              |                | ISR 2 Func Y |
   |    Free      |                +--------------+
   |              |
   +--------------+                                
KonssnoK commented 1 month ago

ok @Dazza0 , this is exactly what i would expect.

I still don't understand how MTX could finish the stack tho 🤔

zhangyanjiaoesp commented 4 weeks ago

@KonssnoK Can you provide a demo for mes to reproduce the issue?

KonssnoK commented 4 weeks ago

@zhangyanjiaoesp not really, i found it while looking at crashes on the field, i have no idea how to reproduce.

what should i do to increase or reduce the stack of such task? Is the value exposed to the public code?

zhangyanjiaoesp commented 4 weeks ago

@KonssnoK The user can't change the stack size of the task. Can you provide the complete log of the device when the stack overflow happen?

KonssnoK commented 4 weeks ago

@zhangyanjiaoesp only of the coredump, the devices are in the field and not attached to any terminal monitor

zhangyanjiaoesp commented 4 weeks ago

@KonssnoK What is the length of the sent packet?

KonssnoK commented 4 weeks ago

@zhangyanjiaoesp we have multiple lengths, considering the device was offline when the issue occurred, we can assume all packets to be less than 200B in size, because MQTT is disconnected

zhangyanjiaoesp commented 4 weeks ago

@KonssnoK You can change the stack size of the MTX task in the following way. Use the stack_depth you except when the name is MTX. image