Open KonssnoK opened 1 month ago
mh, i fear the stack size is not configurable for these internal MESH tasks...
@KonssnoK The MTX task stack size is 2660, what did you do when the stack overflow occur?
@KonssnoK The MTX task stack size is 2660, what did you do when the stack overflow occur?
the device was offline for 20 minutes. In these 20 minutes the device rebooted once because it was not able to find a mesh to connect to in 15 min. In these 15 min the device also had to reset the mesh stack multiple times because it was receiving events of disconnections from children that were not connected (this is some kind of bug that i worked around like this). At some point the device crashes and after reboots decides to move to LTE because the WIFI is apparently not working correctly.
These are some logs that were collected while the device was offline. Please note that they are not in order and that the number between [] is the number of repetitions for that log during the online phase.
[8] Perform NO PARENT FIXED ROOT handover 0 0
[10] <MESH_EVENT_TODS_REACHABLE>reachable:1
[5] Disconnection from non registered child! AC1754000590
[5] Skipping wifi connectivity (count 1)
[24] Disconnection from non registered child! AC17540005AC
[8] Negative children count! Reset mesh...
[8] Mesh reset requested.
[4] Disconnection from non registered child! AC175400064C
[7] Disconnection from non registered child! AC17540002FC
[10] Disconnection from non registered child! AC1754000578
[7] Disconnection from non registered child! AC17540000D0
[3] MQTT_EVENT_ERROR 32769 0 0 1 119
[27] MQTT_EVENT_ERROR 32774 0 0 1 0
[1] Perform NO PARENT FIXED ROOT handover 0 1
[1] Child connection with no free slots! AC175400064C
[1] 540005AC -2
[4] 54000574 1
[1] 540000D0 -1
[1] 54000578 -1
[1] 540002FC -1
[1] 54000590 -1
[4] Disconnection from non registered child! AC1754000574
[7] MQTT_EVENT_ERROR 32769 0 0 1 11
[3] Skipping wifi connectivity (count 2)
[1] Couldn't connect for 600s. Restart.
[2] Sensor stored version 2
[2] No PRAM heat data to restore
[2] PDP context definitions: +CGDCONT: 1,"IP","orange.m2m.spec",,0,0
[2] Operator configuration code: 18
[1] LTE signal quality: -108.0,121.0,23.0,3.0,-7.0
[3] AT+CREG?: 0,1
[3] AT+CEREG?: 0,1
[3] AT+CGREG?: 0,0
[1] LTE signal quality: -108.0,121.0,23.0,3.0,-9.0
[3] Selected operator: 0,0,"Orange F",7
[3] Active LTE band(s): 0,0000000000000000080000
[3] GOT IP from ppp_sta
[3] Disconnection from non registered child! AC17540009F4
[1] <MESH_EVENT_PARENT_DISCONNECTED>reason: 100 MESH_REASON_CYCLIC
[1] LTE signal quality: -105.0,123.0,23.0,3.0,-6.0
[1] LTE signal quality: -107.0,121.0,23.0,3.0,-7.0
[2] Child connection with no free slots! AC1754000578
[3] 540005AC 1
[3] 540009F4 -1
[3] 5400064C 1
[3] 54000590 1
[3] 540000D0 1
[1] Child connection with no free slots! AC17540002FC
[1] Triggering DYNAMIC MESH handover
[1] <MESH_EVENT_PARENT_DISCONNECTED>reason: 101 MESH_REASON_PARENT_I
[1] <MESH_EVENT_PARENT_CONNECTED>layer:0-->4, parent:AC17540002FD, I
[1] MQTT_EVENT_ERROR 32794 76 0 1 0
[1] MQTT task stopped after 0ms
[4] STA: Send err 0x400A ESP_ERR_MESH_TIMEOUT
[1] <MESH_EVENT_PARENT_DISCONNECTED>reason: 102 MESH_REASON_LEAF
[1] LTE signal quality: -105.0,123.0,23.0,23.0,-8.0
[1] <MESH_EVENT_PARENT_CONNECTED>layer:4-->1, parent:2266CF7C27B0<RO
[1] LTE signal quality: -140.0,123.0,23.0,23.0,-9.0
[1] GOT IP from sta
The mesh reset is performed as follows:
if (esp_mesh_is_root()) {
ESP_LOGW(TAG, "(L%d " MACSTR ") Restart mesh!", esp_mesh_get_layer(), MAC2STR(device_own_mac));
mesh_stop();
mesh_init();
mesh_start();
}
But i doubt this can trigger a stack issue on the MTX task... (heap i could understand)
Local variables, field protection and return addresses for function calls, function parameters, before entering interrupt functions, and interrupt nesting all require stack space.
In these 15 min the device also had to reset the mesh stack multiple times because it was receiving events of disconnections from children that were not connected (this is some kind of bug that i worked around like this).
And the handling of the disconnect event is in the MNWK task.
Local variables, field protection and return addresses for function calls, function parameters, before entering interrupt functions, and interrupt nesting all require stack space.
In these 15 min the device also had to reset the mesh stack multiple times because it was receiving events of disconnections from children that were not connected (this is some kind of bug that i worked around like this).
And the handling of the disconnect event is in the MNWK task.
do you mean that multiple interrupts are all stored on the same task stack? Isn't it there a dedicated interrupt stack for nesting?
This would mean that if i have any task that has a very small stack, it would have to be large enough to hold the contextes for all the nested interrupts? doesn't seem right.
@KonssnoK
do you mean that multiple interrupts are all stored on the same task stack? Isn't it there a dedicated interrupt stack for nesting?
ISRs are executed on a dedicated interrupt stack. On multi-core ESP targets, there is a dedicated interrupt stack for each core.
This would mean that if i have any task that has a very small stack, it would have to be large enough to hold the contextes for all the nested interrupts? doesn't seem right.
Tasks only need to be large enough to contain their own callstacks plus a context frame. Whenever a task is preempted (either due to an interrupt, or by another task), the CPU context of the task that was interrupt is saved on that task's own stack.
When executing an ISR on the interrupt stack, if that ISR gets preempted by a higher priority interrupt (i.e., nested interrupts), the preempted ISR's CPU context is saved directly on the interrupt stack, then then interrupting ISR is then executed on the same interrupt stack. Thus, the interrupt stack must be large enough to support all nested interrupts.
The following diagram illustrates how task stacks and interrupt stacks are used.
Some Task Interrupt Stack
+--------------+ +--------------+
| Func A | | ISR 1 Func J |
+--------------+ +--------------+
| Func B | | ISR 1 Func K |
+--------------+ +--------------+
| | Func C | | | |
| +--------------+ | | Level 1 ISR |
| | | | | Context |
v | Task | v | |
| Context | +--------------+
| | | ISR 2 Func X |
+--------------+ <--Task SP +--------------+
| | | ISR 2 Func Y |
| Free | +--------------+
| |
+--------------+
ok @Dazza0 , this is exactly what i would expect.
I still don't understand how MTX could finish the stack tho 🤔
@KonssnoK Can you provide a demo for mes to reproduce the issue?
@zhangyanjiaoesp not really, i found it while looking at crashes on the field, i have no idea how to reproduce.
what should i do to increase or reduce the stack of such task? Is the value exposed to the public code?
@KonssnoK The user can't change the stack size of the task. Can you provide the complete log of the device when the stack overflow happen?
@zhangyanjiaoesp only of the coredump, the devices are in the field and not attached to any terminal monitor
@KonssnoK What is the length of the sent packet?
@zhangyanjiaoesp we have multiple lengths, considering the device was offline when the issue occurred, we can assume all packets to be less than 200B in size, because MQTT is disconnected
@KonssnoK You can change the stack size of the MTX task in the following way.
Use the stack_depth you except when the name is MTX.
Answers checklist.
General issue report
hello @zhangyanjiaoesp As per title we have some devices crashing for stack oveflow in the MTX task.
7a00499cc9ae66a81ab1720a1ca1c50f2b1a04b7
is the sha of what this code is based on.
What can consume stack in the MTX task? Is the default value of 2448 supposed to be enough for any scenario? Should we increase it?
Thanks