espressif / esp-mdf

Espressif Mesh Development Framework, limited maintain, recommend to use https://github.com/espressif/esp-mesh-lite
Other
779 stars 253 forks source link

infinite transmit retry #262

Open BartoszKubiak opened 3 years ago

BartoszKubiak commented 3 years ago

ESP32-WROOM-32D mdf> version I (2854964) [mdebug_cmd, 53]: ESP-IDF version : v4.3.1-dirty I (2854965) [mdebug_cmd, 54]: ESP-MDF version : v1.0-48-gdf0a825 I (2854976) [mdebug_cmd, 55]: compile time : Nov 17 2021 11:56:09 I (2854977) [mdebug_cmd, 56]: free heap : 73168 Bytes I (2854987) [mdebug_cmd, 57]: CPU cores : 2 I (2854988) [mdebug_cmd, 58]: silicon revision : 1 I (2855000) [mdebug_cmd, 64]: feature : /802.11bgn/BLE/BT/External-Flash:4 MB

mesh topology: fixed root (routerless) + 4 nodes, each node transmit short data to root every 10 seconds, root broadcast time every 30 seconds steps to reproduce: power-on everything -> wait mesh to build -> power-off root

I've observed that sometimes nodes goes in infinite transmit (retry?) loop when recipient is unreachable - I think that node starts sending frame before disconnect event: W (335821) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:3, no_wnd_count:0, timeout_count:0 W (337023) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:3, no_wnd_count:0, timeout_count:1 W (338225) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:3, no_wnd_count:0, timeout_count:2 and so on increasing timeout_count number, meanwhile node can detect mesh disconnect and re-connect, they received broadcast messages from root but transmission is still blocked (maybe because retransmit_enable = y). This symptom spreads to child nodes and even to root when he tries to read data directly from infected node.

In most cases node correctly detect root disconnection: W (1381445) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:27, no_wnd_count:0, timeout_count:0 W (1382647) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:27, no_wnd_count:0, timeout_count:1 W (1383849) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:27, no_wnd_count:0, timeout_count:2 W (1385051) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:27, no_wnd_count:0, timeout_count:3 W (1386253) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:27, no_wnd_count:0, timeout_count:4 I (1386750) [mwifi, 188]: Parent is disconnected, reason: 200 I (1386751) [MAIN, 1073]: event_loop_cb, event: 0x8 I (1386753) [MAIN, 1031]: Parent is disconnected = WIFI_REASON_BEACON_TIMEOUT W (1387455) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:27, no_wnd_count:0, timeout_count:5 W (1387456) [mwifi, 707]: Node failed to send packets, dest_addr: ff:00:00:01:00:00, flag: 0x28, opt->type: 0x08, opt->len: 13, data->tos: 0, data: 0x3ffd8830, size: 49 W (1387478) [mwifi, 960]: Node failed to send packets, data_flag: 0x28, dest_mac: ff:00:00:01:00:00 W (1387539) [APP, 280]: [[[[[[ MESH DISCONNECTED ]]]]]] I (1387960) [mwifi, 188]: Parent is disconnected, reason: 2 I (1387961) [MAIN, 1073]: event_loop_cb, event: 0x8 I (1387962) [MAIN, 962]: Parent is disconnected = WIFI_REASON_AUTH_EXPIRE I (1389168) [mwifi, 188]: Parent is disconnected, reason: 2 I (1389169) [MAIN, 1073]: event_loop_cb, event: 0x8

BartoszKubiak commented 2 years ago

Further investigation shows, that it looks like network disconnect is not always detected on node side. I've made many test powering-off root and observe what happen with mesh network. In most cases nodes generate MDF_EVENT_MWIFI_PARENT_DISCONNECTED event and stop retransmitting packet, but sometimes this not happen. Generally I observe three cases: 1) dozen warnings mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd + disconnect 2) infinite warning mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd 3) lots of [mwifi, 887]: Current network has no root + no disconnect event

My problem is that I use mwifi_write() in my main application task - it hangs my whole application. I've added task watchdog as temporary workaround.

shenjun7 commented 2 years ago

Fixed-root network only layer2 nodes will detect disconnection when root disappeared. At this time esp_mesh_send() will return ESP_ERR_MESH_DISCONNECTED. If esp_mesh_send() block is in the lower-level nodes for a long time, you can call esp_mesh_send_block_time() before esp_mesh_start() to solve the problem.