Open KonssnoK opened 4 months ago
@KonssnoK Please use this wifi lib to see whether the 2/205 error loop can be solved, thanks. wifi_lib_s3_0627.zip (wifi firmware version: ace69bc)
@zhangyanjiaoesp i confirm i'm currently not able to reproduce the 2/205 infinite loop on the pixel phone :) Did you find another error ?
I will continue to try to reproduce
@zhangyanjiaoesp currently the only bug I am able to reproduce is a child device becoming stuck on
W (155190) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:123
On the root node i cannot trigger much with the new library
240627dev2_3.txt 240627dev1_3.txt 240627dev3_3.txt
for reference, it seems that it's not that easy to trigger this error on the child as it was before the 0625 library, so it could be that the library fix of 0625 is not complete
@zhangyanjiaoesp i confirm i'm currently not able to reproduce the 2/205 infinite loop on the pixel phone :) Did you find another error ?
yes, I change the wifi code. For some routers, if the STA is powered off and restarted, the router still records the connection status of the STA, so when the STA is connected again to send the auth request, the router will not reply the auth response. In this case, we first send a deauth packet before the STA sends the auth request, so that the router can clear the previous connection state, and then the STA can send the auth request to the normal connection.
@zhangyanjiaoesp currently the only bug I am able to reproduce is a child device becoming stuck on
W (155190) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:123
On the root node i cannot trigger much with the new library
- dev3 root
- dev1 stuck
240627dev2_3.txt 240627dev1_3.txt 240627dev3_3.txt
for reference, it seems that it's not that easy to trigger this error on the child as it was before the 0625 library, so it could be that the library fix of 0625 is not complete
OK, I will follow up to solve this problem.
ok thanks, meanwhile i will continue some testing to see if something else appears.
apparently i cannot continue on the other issues until this is fixed.
@zhangyanjiaoesp do you have any way to check when this issue was introduced?
I would like to understand if this was always there (and means slow rate is a trigger we didn't check) or if instead it was introduced by a commit, and we could rollback devices in the field
@KonssnoK I have added some debug log for the stuck issue, please use the new wifi lib to test, thanks. wifi_lib_s3_debug_0701.zip wifi firmware version: fbf47d3
@zhangyanjiaoesp dev1 is stuck here without printing anything but
>>>mesh_nwk_task_main,5572, nwk.state = 21
240701dev1_1.txt 240701dev2_1.txt 240701dev3_1.txt
more logs of dev1 once stuck
I (157009) wifi:station: 7c:df:a1:e2:7b:84 leave, AID = 1, bss_flags is 658531, bss:0x3fcb8f24
I (157009) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,2>, prof:11
I (186309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>mesh_nwk_task_main,5572, nwk.state = 21
I (216309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>mesh_nwk_task_main,5572, nwk.state = 21
I (246309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>mesh_nwk_task_main,5572, nwk.state = 21
I (276309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>mesh_nwk_task_main,5572, nwk.state = 21
one thing i don't understand is that when i press reset on dev2 or dev3 (with dev1 root), also the other 2 devices stop sending data completely for some time. Is this expected? It seems like the timeout of the device i'm resetting is blocking the parent, which for 3 devices isn't much, but if i think at 6 devices per node, means a huge lag for no reason...
@zhangyanjiaoesp dev1 is stuck here without printing anything but
>>>mesh_nwk_task_main,5572, nwk.state = 21
240701dev1_1.txt 240701dev2_1.txt 240701dev3_1.txt
more logs of dev1 once stuck
I (157009) wifi:station: 7c:df:a1:e2:7b:84 leave, AID = 1, bss_flags is 658531, bss:0x3fcb8f24 I (157009) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,2>, prof:11 I (186309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER >>>mesh_nwk_task_main,3768, nwk.state = 21 >>>mesh_nwk_task_main,5572, nwk.state = 21 I (216309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER >>>mesh_nwk_task_main,3768, nwk.state = 21 >>>mesh_nwk_task_main,5572, nwk.state = 21 I (246309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER >>>mesh_nwk_task_main,3768, nwk.state = 21 >>>mesh_nwk_task_main,5572, nwk.state = 21 I (276309) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER >>>mesh_nwk_task_main,3768, nwk.state = 21 >>>mesh_nwk_task_main,5572, nwk.state = 21
You didn't reproduce the stuck issue on W (155190) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0
, but stuck on a new case, right? It seems that the wifi task stop running.
for reference, it seems that it's not that easy to trigger this error on the child as it was before the 0625 library, so it could be that the library fix of 0625 is not complete
Can you confirm that this problem would not occurred without the change in 0625 library?
Since we have more logs in this lib I tried it also with our code that uses both fixed root and dynamic one. There seem to be serious issues. 240701dev1_3_alt.txt 240701dev2_3_alt.txt
yes,
i did not reproduce the W (155190) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0
yet, i'll resume it now
no, also before 0625 i had devices getting stuck without printing anything. The way to recover was rebooting them, but since there was no direct way to trigger it, i didn't write about it
@zhangyanjiaoesp here is the infinite loop W (155190) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0
with library 0701
dev1 stuck dev2 stuck dev3 root
240701dev1_4.txt 240701dev3_4.txt 240701dev2_4.txt
>>>esp_mesh_push_to_tx_queue,741
>>>mesh_tx_task_main,252>>esp_mesh_push_to_tx_queue,745
W (211131) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:111
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
esp_mesh_push_to_tx_queue,745
I (211611) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>esp_mesh_forward_packet,241
I (211621) mesh: >>>[TO]dst:00:00:00:00:00:00, parent:7c:df:a1:e1:6a:d9
>>>esp_mesh_process_ucast,1102
>>>esp_mesh_push_to_tx_queue,695, tid = 5, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,733
>>>esp_mesh_push_to_tx_queue,741
>>>esp_mts_tpss_mti_,2_que
e,>4me
h_>xeta_kwm_ta,k_m4i
,5572, nwk.state = 21
W (212351) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:112
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
esp_mesh_push_to_tx_queue,745
W (213571) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:113
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
>>>esp_mesh_push_to_tx_q>e>me7h5tx_task_main,2524
I (214641) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>esp_mesh_forward_packet,241
I (214641) mesh: >>>[TO]dst:00:00:00:00:00:00, parent:7c:df:a1:e1:6a:d9
>>>esp_mesh_process_ucast,1102
>>>esp_mesh_push_to_tx_queue,695, tid = 5, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,733
>>>esp_mesh_push_to_tx_queue,741
>>>>se_hesx_pask_ta_nx2q4eu
,5572, nwk.state = 21k_ma4
W (214791) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:114
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
>>>mesh_tx_task_main,252>>esp_mesh_push_to_tx_queue,745
W (216011) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:115
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
esp_mesh_push_to_tx_queue,745
W (217231) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:116
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
esp_mesh_push_to_tx_queue,745
I (217661) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>esp_mesh_forward_packet,241
I (217661) mesh: >>>[TO]dst:00:00:00:00:00:00, parent:7c:df:a1:e1:6a:d9
>>>esp_mesh_process_ucast,1102
>>>esp_mesh_push_to_tx_queue,695, tid = 5, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,733
>>>esp_mesh_push_to_tx_queue,741
,5572, nwk.state = 21x234e
W (218451) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:117
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
>>>mesh_tx_task_main,252>>esp_mesh_push_to_tx_queue,745
W (219671) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:56, no_wnd_count:0, timeout_count:118
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>esp_mesh_push_to_tx_queue,730
>>>esp_mesh_push_to_tx_queue,741
>>>mesh_tx_task_main,25>>
sp_mesh_push_to_tx_queue,745
I (220691) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>esp_mesh_forward_packet,241
I (220691) mesh: >>>[TO]dst:00:00:00:00:00:00, parent:7c:df:a1:e1:6a:d9
@KonssnoK Can you capture the wifi packets via the Wireshark?
@KonssnoK add more debug log in this wifi lib: wifi_lib_s3_0701_2.zip wifi firmware version: a93bcab
@zhangyanjiaoesp
extension .pcapng
test1.txt
this is while dev1 dev2 are stuck
@zhangyanjiaoesp once again both dev1 and dev2 are stuck with new library 0701 02
W (459370) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:163, no_wnd_count:0, timeout_count:228
240701dev1_5.txt 240701dev3_5.txt 240701dev2_5.txt
Wireshark file
.pcapng
test2.txt
>>>esp_mesh_process_ucast,1102
>>>esp_mesh_push_to_tx_queue,695, tid = 5, tx_task_id = 0
es>_tx_h_swk_maik_2a2n,5572, nwk.state = 21
W (453270) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:163, no_wnd_count:0, timeout_count:223
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>mesh_tx_task_main,2524
>>>esp_mesh_push_to_tx_queue,745
W (454490) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:163, no_wnd_count:0, timeout_count:224
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>mesh_tx_task_main,2524
>>>esp_mesh_push_to_tx_queue,745
W (455710) mesh: [mesh_schedule.c,3146] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:163, no_wnd_count:0, timeout_count:225
>>>mesh_xon_process_disconnected,2649, (1, 1, 0)
>>>mesh_xon_task_main,3084
>>>esp_mesh_send_xon,1034
>>>esp_mesh_send_xon,1228
>>>esp_mesh_push_to_tx_queue,695, tid = 6, tx_task_id = 0
>>>mesh_tx_task_main,2524
>>>esp_mesh_push_to_tx_queue,745
I (456130) mesh: >>>8415[nwk]pending[0], state:MESH_NWK_TIMER
>>>mesh_nwk_task_main,3768, nwk.state = 21
>>>esp_mesh_forward_packet,241
I (456130) mesh: >>>[TO]dst:00:00:00:00:00:00, parent:7c:df:a1:e1:6a:d9
@KonssnoK
Can you print the tasks information when the stuck issue happen? you can refer the following code.
To include the core ID info, you should enable the FREERTOS_VTASKLIST_INCLUDE_COREID
option in menuconfig.
/* 'tasks' command prints the list of tasks and related information /
if WITH_TASKS_INFO
static int tasks_info(int argc, char *argv) { const size_t bytes_per_task = 40; / see vTaskList description / char task_list_buffer = malloc(uxTaskGetNumberOfTasks() * bytes_per_task); if (task_list_buffer == NULL) { ESP_LOGE(TAG, "failed to allocate buffer for vTaskList output"); return 1; }
fputs("Task Name\tStatus\tPrio\tHWM\tTask#", stdout);
ifdef CONFIG_FREERTOS_VTASKLIST_INCLUDE_COREID
fputs("\tAffinity", stdout);
endif
fputs("\n", stdout); vTaskList(task_list_buffer); fputs(task_list_buffer, stdout); free(task_list_buffer); return 0;
}
static void register_tasks(void) { const esp_console_cmd_t cmd = { .command = "tasks", .help = "Get information about running tasks", .hint = NULL, .func = &tasks_info, }; ESP_ERROR_CHECK( esp_console_cmd_register(&cmd) ); }
endif // WITH_TASKS_INFO
Wireshark file
.pcapng
test2.txt
By the way, the capture is on channel 10, but the mesh network is on channel 1,( (10669) mesh: [DONE]connect to router:KI, channel:1, rssi:-37, d2:0f:7d:8a:8b:29[layer:0, assoc:0],
) so there's no valid information in the capture.
i will print it periodically because i have no idea on how to know when the issue occurs.
Yeah, I will redo the capturing on the right channel.
Here are the logs with task status
240702dev1_1.txt 240702dev2_1.txt 240702dev3_1.txt
Currently I can-t use Wireshark dev1 stuck dev2 stuck
W (121404) ping: Task Name Status Prio HWM Task# Affinity
W (121414) ping: main X 1 1292 4 0
IDLE1 R 0 788 6 1
IDLE0 R 0 784 5 0
MTXON B 13 1072 13 -1
netif rx task B 5 916 17 -1
check button ta B 5 956 19 -1
ipc0 B 24 832 1 0
MRX B 13 1972 14 -1
ipc1 B 24 836 2 1
sys_evt B 20 4208 9 0
tiT B 18 376 8 -1
ping B 2 564 16 -1
mqtt_task B 5 3840 20 -1
mqtt task B 5 1088 18 -1
MNWK B 15 968 15 1
MTX B 13 1340 12 -1
esp_timer S 22 4216 3 0
wifi B 23 3852 10 0
Tmr Svc B 1 1400 7 0
MTXBLK B 7 1312 11 -1
@KonssnoK please use the new wifi lib to test, and add the patch. wifi_lib_s3_0702.zip wifi firmware version: 43cf051 0001-add-debug-log-in-event-loop.zip
dev2 stuck with lib 0702
240702dev3_2.txt 240702dev2_2.txt 240702dev1_2.txt
W (274340) ping: From 8.8.8.8 icmp_seq=68 timeout
W (275030) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:253, no_wnd_count:0, timeout_count:0
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (275240) mesh_main: Tried to publish layer:2 IP:10.0.0.3
W (276230) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:253, no_wnd_count:0, timeout_count:1
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (277430) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:253, no_wnd_count:0, timeout_count:2
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (277680) mesh: 5162<active>parent layer:1(node), channel:4, rssi:-3, assoc:0(cnx rssi threshold:-120)my_assoc:0
W (278630) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:253, no_wnd_count:0, timeout_count:3
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>wifi_event_post,2241
I (279180) wifi:state: run -> init (6c0)
>>>ieee80211_sta_new_state,253
>>>ieee80211_sta_new_state,371
>>>cnx_sta_leave,2006
I (279190) wifi:pm stop, total sleep time: 0 us / 8461895 us
I (279190) wifi:<ba-del>idx
I (279190) wifi:new:<4,0>, old:<4,1>, ap:<4,1>, sta:<4,1>, prof:4
>>>ieee80211_sta_new_state,559, post disconnect evt
>>>wifi_event_post,2241
W (279830) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:253, no_wnd_count:0, timeout_count:4
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
@KonssnoK please use this new patch to test, thanks. 0002-add-debug-log-in-event-loop.zip
i flashed the 3 devices aand started them, they are currently in a strange state where nobody can connect..
240703dev3_1.txt 240703dev1_1.txt 240703dev2_1.txt
dev2 reports
d,2638, (1, 1, 0)
I (156290) mesh: 1316[recv]invalid child 7c:df:a1:e1:6a:d8
I (156290) mesh: 1316[recv]invalid child 7c:df:a1:e1:6a:d8
I (156580) mesh: 1316[recv]invalid child 7c:df:a1:e1:6a:d8
I (156580) mesh: 1316[recv]invalid child 7c:df:a1:e1:6a:d8
I (156580) mesh: 1813[XON]async, from child 7c:df:a1:e1:6a:d8
W (157070) mesh: [mesh_schedule.c
dev3 is in an infinite loop
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
dev1 fails to send any data
E (192294) wifi_netif: TXW
W (192894) ping: From 8.8.8.8 icmp_seq=48 timeout
E (193894) ping_sock: send error=0
I try to recover them..
i think you added a log somewhere where not supposed to, maybe an ISR?
@zhangyanjiaoesp I'm not sure if the issue triggered with this code, i do have an infinite loop of
W (100970) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:3, no_wnd_count:43, timeout_count:0
but the counter resets, and no device is able to connect ever, limiter or no limiter
i'm trying to remove the patches and go back tot a functional environment, something is off
ok i had to reboot my phone somehow to recover connectivity.
@zhangyanjiaoesp here are the logs of a stuck device
more logs, dev1 without eventloop patch (i'm recompiling for this device)
dev2 in deadlock (you can see multiple task prints without any log between)
dev3 in infinite >>>mesh_xon_process_disconnected,2638, (1, 1, 0)
loop
240703dev3_4.txt 240703dev2_4.txt 240703dev1_4.txt
dev1 ok and now in same FW as the other 2 dev2 stuck again on the timoeut loop dev3 ok
i see that the eventloop is not running anymore and therefore not handling events, interesting..
more logs
dev1 stuck on
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
dev2 in deadlock
dev3 ok
ok i had to reboot my phone somehow to recover connectivity.
@zhangyanjiaoesp here are the logs of a stuck device
@KonssnoK In this log, it seems that the esp_event_loop_run()
didn't return when it handles the id=7 mesh event (MESH_EVENT_PARENT_CONNECTED). Are you using the mesh example or your own code? If it's your own code, please check why the mesh_event_handler()
didn't return when handling the MESH_EVENT_PARENT_CONNECTED
event.
@zhangyanjiaoesp i'm using the espressif code with the patches i posted above, nothing else
ip_internal_3.patch ip_internal_2.patch ip_internal_1.patch ip_internal_4.patch
here are the changes to the original example (plus the library 0702)
i added a couple of prints following your example, the system gets block between these lines:
ESP_LOGI(TAG, ">>>%s,%d\n",__func__,__LINE__);
netif_sta = create_mesh_link_sta();
// now we create a mesh driver and attach it to the existing netif
mesh_netif_driver_t driver = mesh_create_if_driver(false, false);
if (driver == NULL) {
ESP_LOGE(TAG, "Failed to create wifi interface handle");
return ESP_FAIL;
}
esp_netif_attach(netif_sta, driver);
start_mesh_link_sta();
// If we have a AP on NODE -> stop and remove it!
if (netif_ap) {
esp_netif_action_disconnected(netif_ap, NULL, 0, NULL);
mesh_delete_if_driver(esp_netif_get_io_driver(netif_ap));
esp_netif_destroy(netif_ap);
netif_ap = NULL;
}
ESP_LOGI(TAG, ">>>%s,%d\n",__func__,__LINE__);
added more logs, the block is coming from
esp_netif_action_connected(netif_sta, NULL, 0, NULL);
which i see calls esp_event_send_internal
which calls esp_event_post_to
with IP_EVENT
if the DHCP is not started
which i don-t see in the log.
We do have
dhcp_discover(): dhcp state is DISCOVER
tho,
which means we reach esp_netif_dhcpc_start(esp_netif);
which calls dhcp_start(struct netif *netif)
which calls dhcp_discover
more digging, adding more logs
dhcp_discover(): dhcp state is DISCOVER
dhcp_discover: making request
dhcp_discover: sendto(DISCOVER, IP_ADDR_BROADCAST, LWIP_IANA_PORT_DHCP_SERVER)
I (230951) esp_netif_lwip: SENDW mesh_link_sta 350 FFFFFFFFFFFF7CDFA1E27B84080045000150011200004011788C00000000FFFFFFFF00440043013C283A010106004F43B9EF00000000000000000000000000000000000000007CDFA1E27B840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000063825363350101390205DC0C096573707265737369663D07017CDFA1E27B84370401031C06FF00000000000000000000000000000000000000000000000000000000000000000000
the block is actually on
esp_err_t esp_netif_transmit_wrap(esp_netif_t *esp_netif, void *data, size_t len, void *pbuf)
{
...
ESP_LOGI(TAG,"SENDW %10s %4d %s", esp_netif->if_desc, len, buf);
...
return (esp_netif->driver_transmit_wrap)(esp_netif->driver_handle, data, len, pbuf);
}
which calls
esp_err_t err = esp_mesh_send(NULL, &data, MESH_DATA_TODS, NULL, 0);
which is never returning
@zhangyanjiaoesp we are back to the Mesh library as source of blocking
@KonssnoK Does calling the esp_mesh_send_block_time()
API solve the problem?
@zhangyanjiaoesp not sure how this would help..
does esp_mesh_send_block_time have a default value when not called? why would this occur only on a corner case?
I will try to put it with 5seconds timeout
i think the issue might have happened and the timeout recovered the situation. Not sure if this is how the system is supposed to work tho....
I-ll try to reproduce the error again
I (139081) mesh_mqtt: Other event id:7
I (139091) event: >>>esp_event_loop_run,630
W (139941) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:141, no_wnd_count:0, timeout_count:0
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (140891) mesh_main: Tried to publish layer:2 IP:10.0.0.3
W (141141) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:141, no_wnd_count:0, timeout_count:1
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (142341) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:141, no_wnd_count:0, timeout_count:2
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (143541) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:141, no_wnd_count:0, timeout_count:3
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (143901) event: >>>esp_event_post_to,848, evt base = WIFI_MESH_EVENT, evt id = 36
I (143911) event: >>>esp_event_post_to,898, send to q
I (143911) event: >>>esp_event_post_to,919
W (144741) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:141, no_wnd_count:0, timeout_count:4
E (144741) mesh_netif: Send with err code 16394 ESP_ERR_MESH_TIMEOUT
I (144741) esp_netif_lwip: SENDW mesh_link_sta 42 FFFFFFFFFFFF7CDFA1E27B84080600010800060400017CDFA1E27B840A0000030000000000000A000001
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (145571) wifi:state: run -> init (6c0)
>>>ieee80211_sta_new_state,253
>>>ieee80211_sta_new_state,371
>>>cnx_sta_leave,2006
I (145571) wifi:pm stop, total sleep time: 0 us / 9075657 us
I (145571) wifi:<ba-del>idx
I (145571) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,2>,prof:11
>>>ieee80211_sta_new_state,559, post disconnect evt
>>>wifi_event_post,2241
I (145591) event: >>>esp_event_post_to,848, evt base = WIFI_EVENT, evt id = 5
I (145591) event: >>>esp_event_post_to,898, send to q
I (145601) event: >>>esp_event_post_to,919
W (145961) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:142, no_wnd_count:0, timeout_count:5
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (147161) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:142, no_wnd_count:0, timeout_count:6
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (147971) mesh: 5162<active>parent layer:1(node), channel:11,rssi:-8, assoc:0(cnx rssi threshold:-120)my_assoc:0
W (148361) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:142, no_wnd_count:0, timeout_count:7
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>wifi_event_post,2241
I (149481) event: >>>esp_event_post_to,848, evt base = WIFI_EVENT, evt id = 1
I (149481) event: >>>esp_event_post_to,898, send to q
I (149481) event: >>>esp_event_post_to,919
W (149561) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:142, no_wnd_count:0, timeout_count:8
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (150761) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:142, no_wnd_count:0, timeout_count:9
E (150761) mesh_netif: Send with err code 16394 ESP_ERR_MESH_TIMEOUT
I (150761) esp_netif_lwip: SENDW mesh_link_sta 42 FFFFFFFFFFFF7CDFA1E27B84080600010800060400017CDFA1E27B840A0000030000000000000A000003
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (151981) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:143, no_wnd_count:0, timeout_count:10
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (153181) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:143, no_wnd_count:0, timeout_count:11
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (153761) ping: From 8.8.8.8 icmp_seq=55 timeout
W (154381) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:143, no_wnd_count:0, timeout_count:12
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (155581) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:143, no_wnd_count:0, timeout_count:13
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (156781) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:143, no_wnd_count:0, timeout_count:14
E (156781) mesh_netif: Send with err code 16394 ESP_ERR_MESH_TIMEOUT
I (156781) esp_netif_lwip: SENDW mesh_link_sta 42 FFFFFFFFFFFF7CDFA1E27B84080600010800060400017CDFA1E27B840A0000030000000000000A000001
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (158001) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:144, no_wnd_count:0, timeout_count:15
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (159201) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:144, no_wnd_count:0, timeout_count:16
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (160401) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:144, no_wnd_count:0, timeout_count:17
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (161601) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:144, no_wnd_count:0, timeout_count:18
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (162801) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:144, no_wnd_count:0, timeout_count:19
E (162801) mesh_netif: Send with err code 16394 ESP_ERR_MESH_TIMEOUT
I (162801) event: >>>esp_event_loop_run,640
I (162811) event: >>>esp_event_loop_run,575,base = WIFI_MESH_EVENT, id = 36
I (162821) event: >>>esp_event_loop_run,599
>>>mesh_wifi_event_cb,8045
I (162821) mesh: WIFI_EVENT_MESH_ROUTER, non-root
I (162831) mesh: WIFI_EVENT_MESH_ROUTER, last:WIFI_EVENT_STA_CONNECTED[STOP], reason:2
>>>mesh_wifi_event_cb,8063
>>>mesh_wifi_event_cb,8387, new_state = 44
I (162841) event: >>>esp_event_loop_run,640
i increased the timeout to 30sec
dev2 stuck in deadlock 240703dev2_11.txt
more logs, with 30seconds it-s more clear what is happening:
I (1096820) mesh_netif: >>>start_mesh_link_sta,328
dhcp_discover(): dhcp state is DISCOVER
I (1096840) esp_netif_lwip: SENDW mesh_link_sta 350 FFFFFFFFFFFF7CDFA1E16AD808004500015003A70000401175F700000000FFFFFFFF00440043013C725701010600550027D300000000000000000000000000000000000000007CDFA1E16AD80000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000063825363350101390205DC0C096573707265737369663D07017CDFA1E16AD8370401031C06FF00000000000000000000000000000000000000000000000000000000000000000000
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1098100) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:6
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (1098110) mesh_main: Tried to publish layer:2 IP:10.0.0.2
I (1098110) mesh_mqtt: sent publish returned msg_id=38621
I (1098210) event: >>>esp_event_post_to,848, evt base = MQTT_EVENTS, evt id = 7
I (1098210) event: >>>esp_event_post_to,881
I (1098210) event: >>>esp_event_post_to,891, send to q
I (1098210) event: >>>esp_event_post_to,919
I (1098220) event: >>>esp_event_loop_run,575,base = MQTT_EVENTS, id = 7
I (1098230) event: >>>esp_event_loop_run,599
I (1098230) mesh_mqtt: Other event id:7
I (1098240) event: >>>esp_event_loop_run,630
W (1098990) ping: From 8.8.8.8 icmp_seq=382 timeout
W (1099300) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:7
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (1100110) mesh_main: Tried to publish layer:2 IP:10.0.0.2
W (1100500) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:8
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1101700) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:9
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (1102570) wifi:bcn_timeout,ap_probe_send_start
>>>wifi_event_post,2241
I (1102570) event: >>>esp_event_post_to,848, evt base = WIFI_EVENT, evt id = 21
I (1102580) event: >>>esp_event_post_to,898, send to q
I (1102580) event: >>>esp_event_post_to,919
I (1102860) wifi:AP's beacon interval = 102400 us, DTIM period= 1
I (1102860) event: >>>esp_event_post_to,848, evt base = WIFI_MESH_EVENT, evt id = 36
I (1102860) event: >>>esp_event_post_to,898, send to q
I (1102870) event: >>>esp_event_post_to,919
W (1102900) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:10
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1104100) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:11
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1105300) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:12
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (1106490) mesh: 5162<active>parent layer:1(node), channel:11, rssi:1, assoc:0(cnx rssi threshold:-120)my_assoc:0
W (1106500) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:13
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1107710) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:14
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>wifi_event_post,2241
I (1108000) event: >>>esp_event_post_to,848, evt base = WIFI_EVENT, evt id = 1
I (1108000) event: >>>esp_event_post_to,898, send to q
I (1108000) event: >>>esp_event_post_to,919
W (1108910) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:15
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1110110) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:16
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (1111120) wifi:state: run -> init (6c0)
>>>ieee80211_sta_new_state,253
>>>ieee80211_sta_new_state,371
>>>cnx_sta_leave,2006
I (1111130) wifi:pm stop, total sleep time: 0 us / 14581097 us
I (1111130) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,2>, prof:11
>>>ieee80211_sta_new_state,559, post disconnect evt
>>>wifi_event_post,2241
I (1111140) event: >>>esp_event_post_to,848, evt base = WIFI_EVENT, evt id = 5
I (1111150) event: >>>esp_event_post_to,898, send to q
I (1111160) event: >>>esp_event_post_to,919
W (1111310) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:17
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1112510) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:18
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1113710) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:19
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1114910) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:20
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1116110) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:21
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1117310) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:22
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1118510) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:23
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1119710) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:24
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1120910) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:25
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1122110) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:26
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1123310) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:27
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1124510) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:28
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1125710) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:29
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1126910) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:30
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (1128110) mesh: [mesh_schedule.c,3132] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1046, no_wnd_count:0, timeout_count:31
E (1128110) mesh_netif: Send with err code 16394 ESP_ERR_MESH_TIMEOUT
I (1128110) mesh_netif: >>>start_mesh_link_sta,330
I (1128120) mesh_netif: >>>mesh_netifs_start,460
I (1128120) event: >>>esp_event_post_to,848, evt base = IP_EVENT, evt id = 1
I (1128120) event: >>>esp_event_loop_run,640
I (1128130) event: >>>esp_event_post_to,898, send to q
@zhangyanjiaoesp how should we tackle this ?
for reference, i tried the same patches but with our own dynamic/fixed root code and at times it's impossible to form a dynamic mesh after building a fixed one with latest library and code. This was working before, we are missing something else.
Answers checklist.
General issue report
@zhangyanjiaoesp
based on 27ec26d2d3f44bbde5da14c7fdfc82226d567874 To reproduce
Connect to mobile hostspot wifi
Put phone simcard to 2G.
Wait
The issue happens in the ROOT. Once the root is affected, the issue is propagated to all children.
The issue is either temporary or permanent. When permanent, the only way to recover is to reboot.
espressif_wifi_dump_3.txt espressif_wifi_dump.txt espressif_wifi_dump_2.txt
Other related issues with similar behavior: https://github.com/espressif/esp-idf/issues/8953 https://github.com/espressif/esp-idf/issues/10506