espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.74k stars 7.3k forks source link

mesh node stuck and keep repeating packet (IDFGH-3963) #5849

Closed sinope-db closed 2 years ago

sinope-db commented 4 years ago

Environment

Problem Description

I encountered an issue where the mesh nodes seem to be stuck transmitting a packet on the mesh (see logs).

What is particular / odd is that every time the issue occurred, the log on the root reported this: xseqno:268435455 (0x0FFFFFFF in hex)

Leaving the root node powered down for a whole night didn't solve the issue. It came back the next morning. I had to power down the nodes for it to stop.

On the root node, the WiFi task is consuming most if not all the cpu time when the issue is present.

Expected Behavior

Re-transmissions would eventually stop. Packet TTL seem to be reset.

Actual Behavior

The packet(s) are flooding the mesh in what seems like forever.

Steps to reproduce

My setup is as follow:

Debug Logs

This is part of the log from the root, I don't have a log of the nodes at the moment.

[Wed Sep 09 14:20:40.497 2020] I (17778) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:20:d6:64, len:42<hlen:28, xseqno:4, ttl:49>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.498 2020] I (17798) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.501 2020] I (17818) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.502 2020] I (17828) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.504 2020] I (17848) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.505 2020] I (17868) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:21:a8:d8, len:42<hlen:28, xseqno:1, ttl:49>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.506 2020] I (17888) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.507 2020] I (17908) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.508 2020] I (17928) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.511 2020] I (17938) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.512 2020] I (17958) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:20:ba:ec, len:202<hlen:188, xseqno:268435455, ttl:47>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.513 2020] I (17968) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.515 2020] I (18008) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.516 2020] I (18018) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.517 2020] I (18038) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:21:a8:d8, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.519 2020] I (18058) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.520 2020] I (18078) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.521 2020] I (18088) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.523 2020] I (18108) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.524 2020] I (18128) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:22:e1:b0, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.933 2020] I (18148) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.934 2020] I (18168) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.935 2020] I (18188) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.935 2020] I (18198) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.937 2020] I (18218) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:20:ba:ec, len:202<hlen:188, xseqno:268435455, ttl:47>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.938 2020] I (18238) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.941 2020] I (18258) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.942 2020] I (18268) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.943 2020] I (18288) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.944 2020] I (18308) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:20:ba:ec, len:202<hlen:188, xseqno:268435455, ttl:47>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.945 2020] I (18328) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.947 2020] I (18348) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.948 2020] I (18358) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.950 2020] I (18378) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.951 2020] I (18388) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:21:a8:d8, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.952 2020] I (18408) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.953 2020] I (18428) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.954 2020] I (18448) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.955 2020] I (18458) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.957 2020] I (18478) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:22:e1:b0, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:40.959 2020] I (18498) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.960 2020] I (18518) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.961 2020] I (18538) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:40.962 2020] I (18558) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:40.963 2020] I (18578) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:26:be:08, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:41.382 2020] I (18598) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.383 2020] I (18618) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.384 2020] I (18628) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.385 2020] I (18648) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.385 2020] I (18658) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:20:ba:ec, len:202<hlen:188, xseqno:268435455, ttl:47>, pending:64, err:0x400d
[Wed Sep 09 14:20:41.387 2020] I (18678) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.391 2020] I (18718) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.392 2020] I (18728) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.393 2020] I (18748) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:22:e1:b0, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:41.394 2020] I (18768) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.396 2020] I (18788) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.397 2020] I (18798) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.400 2020] I (18818) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.401 2020] I (18838) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:20:ba:ec, len:202<hlen:188, xseqno:268435455, ttl:47>, pending:64, err:0x400d
[Wed Sep 09 14:20:41.402 2020] I (18858) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.403 2020] I (18878) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.404 2020] I (18898) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.405 2020] I (18908) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.407 2020] I (18928) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:26:90:d4, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:41.409 2020] I (18948) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.410 2020] I (18968) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.411 2020] I (18978) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.412 2020] I (18998) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.413 2020] I (19018) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:22:e1:b0, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:41.807 2020] I (19038) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.808 2020] I (19058) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.809 2020] I (19078) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.810 2020] I (19098) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.811 2020] I (19108) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:22:e1:b0, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Wed Sep 09 14:20:41.812 2020] I (19128) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.813 2020] I (19148) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.816 2020] I (19158) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.817 2020] I (19178) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0

Other items if possible

sdkconfig-mesh-node.txt sdkconfig-root_mesh.txt

Alvin1Zhang commented 4 years ago

Thanks for reporting and detailed report, we will look into.

sinope-db commented 4 years ago

I was lucky to catch it happening while logging one of the node close to the root.

On the root this happened:

[Fri Sep 11 16:01:01.305 2020] I (193321) MeshIf: Routing table update; total_num: 98, table_size: 98
[Fri Sep 11 16:01:03.789 2020] I (195731) [mwifi, 226]: Routing table is changed by removing leave children remove_num: 1, total_num: 97
[Fri Sep 11 16:01:03.789 2020] I (195741) MeshIf: event_loop_cb, event: 6

[Fri Sep 11 16:01:03.805 2020] I (195751) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:21:ad:ac, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
                               I (195761) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Fri Sep 11 16:01:03.805 2020] I (195781) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Fri Sep 11 16:01:03.805 2020] I (195791) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Fri Sep 11 16:01:03.805 2020] I (195811) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Fri Sep 11 16:01:03.836 2020] I (195821) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:21:ad:ac, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d
[Fri Sep 11 16:01:03.836 2020] I (195841) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Fri Sep 11 16:01:03.836 2020] I (195851) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Fri Sep 11 16:01:03.836 2020] I (195871) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Fri Sep 11 16:01:03.836 2020] I (195891) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0

on the node 24:0a:c4:21:ad:ac this happened at the same time:

[Fri Sep 11 16:00:59.445 2020] I (10194134) MeshIf: Routing table update; total_num: 96, table_size: 54
[Fri Sep 11 16:00:59.523 2020] I (10194424) wifi: n:11 2, o:11 2, ap:11 2, sta:11 2, prof:11
[Fri Sep 11 16:00:59.523 2020] I (10194424) wifi: station: 24:0a:c4:20:c2:a4 join, AID=6, bgn, 40D
[Fri Sep 11 16:00:59.523 2020] WIFI event 16 - 
[Fri Sep 11 16:00:59.523 2020] I (10194454) [mwifi, 220]: Routing table is changed by adding newly joined children add_num: 1, total_num: 55
[Fri Sep 11 16:00:59.773 2020] I (10194454) MeshIf: event_loop_cb, event: 5
[Fri Sep 11 16:00:59.773 2020] I (10194454) MeshIf: Routing table update; total_num: 96, table_size: 55
[Fri Sep 11 16:00:59.852 2020] I (10194734) MeshIf: event_loop_cb, event: 3
[Fri Sep 11 16:00:59.867 2020] <AppMsg MESH_CHILD_CONNECTED>
[Fri Sep 11 16:01:00.586 2020] I (10195614) [mwifi, 226]: Routing table is changed by removing leave children remove_num: 1, total_num: 54
[Fri Sep 11 16:01:00.898 2020] I (10195614) MeshIf: event_loop_cb, event: 6
[Fri Sep 11 16:01:00.898 2020] I (10195614) MeshIf: Routing table update; total_num: 97, table_size: 54

[Fri Sep 11 16:01:02.570 2020] I (10197604) mesh: 1012[xrsp:1]the asked:13, max window:2, force to increase/decrease(up) xseqno:1 for child 24:0a:c4:20:c2:a4, xrsp_seqno:33184, heap:108080
[Fri Sep 11 16:01:02.648 2020] I (10197614) mesh: 1166[recv]cidx[5]24:0a:c4:20:c2:a4, xseqno duplicate check failure, new xseqno:13, in:1, out:1, pending:0
[Fri Sep 11 16:01:02.664 2020] I (10197614) mesh: 1175[recv]cidx[5]24:0a:c4:20:c2:a4 xseqno loss, current/new:0/13, in:1, out:1, pending:0

[Fri Sep 11 16:01:18.679 2020] I (10213434) [mwifi, 226]: Routing table is changed by removing leave children remove_num: 6, total_num: 48
[Fri Sep 11 16:01:18.695 2020] I (10213434) MeshIf: event_loop_cb, event: 6
[Fri Sep 11 16:01:18.695 2020] I (10213444) MeshIf: Routing table update; total_num: 97, table_size: 48
[Fri Sep 11 16:01:35.319 2020] I (10230064) [mwifi, 226]: Routing table is changed by removing leave children remove_num: 1, total_num: 47
[Fri Sep 11 16:01:35.319 2020] I (10230064) MeshIf: event_loop_cb, event: 6
[Fri Sep 11 16:01:35.319 2020] I (10230074) MeshIf: Routing table update; total_num: 97, table_size: 47
[Fri Sep 11 16:01:52.506 2020] I (10247264) [mwifi, 226]: Routing table is changed by removing leave children remove_num: 1, total_num: 46

The node does not print other mesh messages other than children leaving after this, while the root is stuck receiving.

sinope-db commented 4 years ago

I updated ESP-IDF to v3.3.3 and ESP-MDF to 1.0 and the problem still occurs. I see this morning that ESP-IDf was just updated to 3.3.4, I'll be giving this a shot also.

sinope-db commented 4 years ago

@Alvin1Zhang

The issue still occurs w/ ESP-IDF v3.3.4.

Can you help provide information about the meaning and/or cause of the messages? Would increasing WiFi buffers reduce the occurrence?

I found a way to reproduce: My setup has 10 nodes in the same room as the root (that probably are the last hop for other nodes; total 98 nodes). If I quickly power cycle 6 of the 10 nodes (5-10s off then back on) together, the root node get stuck w/ the same messages as described above. If I wait a minute or two before bringing back the 6 nodes online everything is fine.

I also noted that the mac address printed in the root console seem to be from the nodes that are in the same room (most likely because they're last hop). How would I determine if it's a message whose origin if from these nodes or relayed by them?

sinope-db commented 4 years ago

I found that blocking the uart / console might help reproduce the errors I'm seeing on my setup.

My application prints a lot of data on the console when device join/leave (I'm printing routing tables and device list) and I've modified the console to forward to a TCP socket. I noticed that it was harder to reproduce when the forwarding to TCP socket was disabled. To confirm, I tested by inserting a delay in uart_write() function of the vfs component. Indeed, the problem occurs a lot more with the delay.

My modified uart_write() for the test:

static ssize_t uart_write(int fd, const void * data, size_t size)
{
    assert(fd >=0 && fd < 3);
    const char *data_c = (const char *)data;
    /*  Even though newlib does stream locking on each individual stream, we need
     *  a dedicated UART lock if two streams (stdout and stderr) point to the
     *  same UART.
     */
    _lock_acquire_recursive(&s_uart_write_locks[fd]);
    for (size_t i = 0; i < size; i++) {
        int c = data_c[i];
        if (c == '\n' && s_tx_mode != ESP_LINE_ENDINGS_LF) {
            s_uart_tx_func[fd](fd, '\r');
            if (s_tx_mode == ESP_LINE_ENDINGS_CR) {
                continue;
            }
        }
        s_uart_tx_func[fd](fd, c);
    }

    //  Test a delay in the uart console output
    vTaskDelay(7);  //  70ms; 1 OS tick is 10ms

    _lock_release_recursive(&s_uart_write_locks[fd]);
    return size;
}
ESP-iPENCIL commented 4 years ago

Hi sinope-db,

Thanks for reporting this issue.

On the root, [Fri Sep 11 16:01:03.805 2020] I (195751) mesh: 480[recv]push to rx q fail, ifx:AP, child:24:0a:c4:21:ad:ac, len:202<hlen:188, xseqno:268435455, ttl:46>, pending:64, err:0x400d this log indicates the "rx q" was full. "len:202<hlen:188," indicates this packet is an ADD/RMV announcement packet. A mesh rx task is responsible to pop the rx q while it only has a task priority of 13. The following

[Wed Sep 09 14:20:41.409 2020] I (18948) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.410 2020] I (18968) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0
[Wed Sep 09 14:20:41.411 2020] I (18978) mesh: [TXQ]<max:32>up/be:0/0, down/be:0/0, mgmt:0, xon:0, flush up:0, bcast:0, up wnd:0
[Wed Sep 09 14:20:41.412 2020] I (18998) mesh: [RXQ]<max:32,cfg:32+extra:0>SELF:0, <max:32,cfg:32+extra:0>TODS:0

are in the wifi task which is obviously has a higher priority than the mesh rx task.

Two points need your help to confirm, 1) close logs following "push to rx fail" to check if its helpfull for the mesh rx task. How can i give u the libs which close the logs? 2) when "push to rx fail" arises on the root, was the mesh rx task blocked forever or the mesh rx task would be resumed by itself? if blocked forever, the routing table will not be updated anymore. so u can power off any mesh node to check if the mesh rx task of root is blocked forever. 3) find the reason which causes the "push to rx fail" arises the first time.

sinope-db commented 4 years ago

@ESP-iPENCIL thank you for the follow up.

  1. I'm not sure I understand correctly. To stop the log after the error, I would need a way to detect the failure.

  2. I think I have seen both behaviors and it may depends for how long the issue was ongoing. If I reproduce by power cycling and remove another unit, the table is updated and eventually the issue goes away. But I have seen cases where I the issue was ongoing for a few days and I had to shut down the whole network for it to stop.

  3. I'm running in single core and the console usage seem to have an impact on the occurrence (see previous comment). I have a large number of devices that are in relatively close proximity to one another. In my opinion the system should recover from queue full event, if it does get to this point.

ESP-iPENCIL commented 4 years ago

@sinope-db I'm not sure I understand correctly. To stop the log after the error, I would need a way to detect the failure. i can give u a wifi lib without the log, but i dont how to give u.

when the issue was ongoing, did other mesh nodes still send data to root node or stop sending? if still send data to root, the issue will not have any opportunity to recover itself. If I reproduce by power cycling and remove another unit, the table is updated and eventually the issue goes away. i mean use power cycling other mesh nodes to check if the routing table on root is updated accordingly and to confirm the mesh rx task on root is blocked forever or just bcuz it cant handle so many packets received from a higher task. if the latter, we can consider increase the priority of the mesh rx task or incease the number of rx queue. but if the first, we should find the root cause of mesh rx task block.

what's the priority of console? it should be lower than the mesh rx task. In my opinion the system should recover from queue full event, if it does get to this point. yes, the system should recover itself from queue full.

sinope-db commented 4 years ago

@ESP-iPENCIL

i can give u a wifi lib without the log, but i dont how to give u.

Can't you just attach it to your reply (drag and drop)? If you want to exchange private information, maybe send a DM on the esp32 forums, my user is Mr_Red https://www.esp32.com/memberlist.php?mode=viewprofile&u=4487

when the issue was ongoing, did other mesh nodes still send data to root node or stop sending? if still send data to root, the issue will not have any opportunity to recover itself.

The mesh node do not send application data aggressively (maybe once per minute in that state). Plenty for the root to recover.

i mean use power cycling other mesh nodes to check if the routing table on root is updated accordingly and to confirm the mesh rx task on root is blocked forever or just bcuz it cant handle so many packets received from a higher task. if the latter, we can consider increase the priority of the mesh rx task or incease the number of rx queue. but if the first, we should find the root cause of mesh rx task block.

This is what I did. Cycling a node result in an updated routing table on the root when the issue start (I mean I cycle 6 of them to see the issue, wait a couple minutes then cycle another one). But after a while that the issue is going, I don't see routing table update if I cycle one of the nodes. (when I cycle one node after the issue is present for 10+ minutes)

what's the priority of console? it should be lower than the mesh rx task.

The modified console I was using was sending to a task of priority 17. I removed that modification. I can reproduce with a simple delay in the console (see https://github.com/espressif/esp-idf/issues/5849#issuecomment-702293465). This console is not a task (has the calling task priority).

What is puzzling me is that when the issue occurs, the message always end up with "xseqno:268435455" (0x0FFFFFFF) and ttl: 43 - 47 . Is the value of xseqno valid? I would have expected random values for packets sent by the nodes. The TTL seem to reset to 47. I would have expected it to decrease.

ESP-iPENCIL commented 4 years ago

What is puzzling me is that when the issue occurs, the message always end up with "xseqno:268435455" (0x0FFFFFFF) and ttl: 43 - 47 . "xseqno:268435455" (0x0FFFFFFF) means this discard packet is not a request packet of uplink flow control. ttl 43-47 means the source node of the packet is a node in layer 8-4.

sinope-db commented 4 years ago

I tried lowering all my tasks to priority 12 max. Using original console component. When I print routing table on event I see the issue. If I remove printing of routing table to reduce console activity, I don't see the issue.

ESP-iPENCIL commented 4 years ago

When I print routing table on event I see the issue. do u mean u print routing table in the mesh_event_cb? if yes, the print will impact the mesh nwk task seriously which has a priority of 15. and further cause the mesh rx task cant be scheduled in time to handle the mesh rx queue, i think. mesh nwk task manages the networking events, like sta connect/disconnect, ap clients connect/disconnect and the routing table add/del and so on. i recommend not to execute additional codes in the mesh_event_cb. but this constraint is removed in release/v4.0 and the later versions cuz mesh_event_cb is removed and modified to use a new event lib instead.

sinope-db commented 4 years ago

do u mean u print routing table in the mesh_event_cb?

No. I'm using ESP-MDF, so printing routing tables in the MDF event loop task. But it does have a priority of 15, so it probably has the same effect on mesh rx task.

ttl 43-47 means the source node of the packet is a node in layer 8-4.

Why do those nodes get stuck sending those packets and not the nodes from lower layers (1-3)? If the lower layers can't process messages they should somehow timeout / stop sending no?

Seems like so far I've manage to prevent the issue from occurring by reducing console usage and tasks priorities. However, I'm worried that the queue may still get full somehow and the system will not recover (like I saw).

ESP-iPENCIL commented 4 years ago

the intermidate nodes have no right to discard forwarding packets except packets are over a predefined lifetime. if the application priorities are lower than the mesh rx task, the system will revover.

Alvin1Zhang commented 2 years ago

Thanks for reporting, feel free to reopen.