espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.7k stars 7.3k forks source link

[mesh] v4.4 Warnings when no wifi connection available (IDFGH-9865) #11187

Open KonssnoK opened 1 year ago

KonssnoK commented 1 year ago

Answers checklist.

General issue report

Hello, when we switch off the WIFI in our setups (LTE device acting as fixed ROOT), we start seeing the following warnings:

W (3747249) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1270, no_wnd_count:0, timeout_count:0        
W (3748450) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1270, no_wnd_count:0, timeout_count:1        
W (3749651) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1270, no_wnd_count:0, timeout_count:2        
W (3750852) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1270, no_wnd_count:0, timeout_count:3        
W (3752053) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1270, no_wnd_count:0, timeout_count:4 

What are they related to? i expect the root node to propagate window informations to the children 🤔 Could this impact performances?

KonssnoK commented 1 year ago

more info> i have a setup made of 4 devices, not all of them report this warning at the same time. it happens that only one device reports it and the others no. The devices are all close to each other

some more logs


I (11:58:44.201) mesh_netif: Interface WIFI
W (565327) wifi:scan number is 0
W (11:58:44.210) mesh_hand: scan_done 0 1 0
W (11:58:44.212) mesh_main: <MESH_EVENT_TODS_REACHABLE>reachable:0  
W (566546) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:25      
W (567747) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:26      
W (568948) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:27      
W (570149) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:28      
W (571350) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:29      
W (572551) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:30      
W (573752) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:31      
W (574953) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:32      
W (576154) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:33      
W (577355) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:34      
W (578556) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:35      
W (579757) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:36      
W (580958) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:37      
I (11:59:00.321) monitor: history_pop
W (11:59:00.321) mqtt_app: Disconnecting mqtt_app
I (11:59:00.322) mqtt_client: Client asked to disconnect
I (11:59:00.322) monitor: history_end
I (11:59:00.332) offline: object saved to PSRAM, free 2096323
W (582159) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:38      
W (583360) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:39      
W (584561) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:40      
W (585762) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:41      
W (586963) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:42      
W (588164) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:43      
W (589365) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:44      
W (590566) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:45      
W (591767) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:46      
W (592968) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:47      
W (594169) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:48      
W (595370) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:49      
W (596571) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:50      
W (597772) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:51      
W (598973) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:52      
W (600174) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:53      
W (601375) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:54      
W (602576) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:55      
W (603777) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:56      
W (604978) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:839, xseqno:65, no_wnd_count:0, timeout_count:57      
W (11:59:24.014) safety: emergency message sending not yet implemented!
zhangyanjiaoesp commented 1 year ago

@KonssnoK This log indicates that the mesh node's xon window is zero, and it requests xon window from the parent, but the parent doesn't response, so the request timeout. If the xon window is zero, the node can't send data out.

KonssnoK commented 1 year ago

@zhangyanjiaoesp uhm... i see this happen mainly when the wifi is off, i usually don't see this with wifi on unless devices are looking for the wifi.

Shouldn't this xon window independent from the external wifi?

zhangyanjiaoesp commented 1 year ago

@KonssnoK When the wifi off, the parent node can't send data out ,then it can't update the RX queue, this will affect the packet sending of its children. Please refer to this doc: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/esp-wifi-mesh.html?highlight=mesh#upstream-flow-control

KonssnoK commented 1 year ago

@zhangyanjiaoesp consider we see this when we set the devices to mesh_idle when switching from fixed root to dynamic root. There is no parent 🤔

zhangyanjiaoesp commented 1 year ago

@zhangyanjiaoesp consider we see this when we set the devices to mesh_idle when switching from fixed root to dynamic root. There is no parent 🤔

@KonssnoK Then this log won't impact the performances.

KonssnoK commented 1 year ago

@zhangyanjiaoesp ok, good to know, but is it expected when devices are idle (why would you do such thing if you don't have parents nor children)?

Last week we concentrated on stability for our deployed devices, we are improving but still haven't finished. We improved the code handling the mesh/lte handover, we might need to update the example shared on github. We are still investigating crashes and devices isolating themselves from the rest of the network.

This is why we were not active on github :)

zhangyanjiaoesp commented 1 year ago

@zhangyanjiaoesp ok, good to know, but is it expected when devices are idle (why would you do such thing if you don't have parents nor children)?

Maybe we can optimize it.

KonssnoK commented 1 year ago

Maybe we can optimize it.

Ok, we are just investigating what is going, once we have a running setup also for you should be quite easy to see such behaviors

KonssnoK commented 10 months ago

@zhangyanjiaoesp coming back to these logs.

We are currently testing our system when the WIFI is missing -> Forced fixed root shape

What i see is the following:

W (2997278) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:2240, no_wnd_count:0, timeout_count:0
I (15:51:31.834) processor: send status... (1000ms)
W (2998479) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:2240, no_wnd_count:0, timeout_count:1
W (15:51:32.260) mqtt_client: Written 88B in 2424ms

Whenever this happens, a packet takes more than 2 seconds to be sent out -> which i expect it to be 1200 + 1200 + the actual time it takes to send out the packet

W (2725277) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:2030, no_wnd_count:0, timeout_count:0
W (15:46:59.834) monitor: 3W3P:: rule deactivated
W (2726478) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:2030, no_wnd_count:0, timeout_count:1
W (2727679) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:2030, no_wnd_count:0, timeout_count:2
E (15:47:01.444) mesh_driver: STA: Send err 0x400A ESP_ERR_MESH_TIMEOUT
W (15:47:01.451) mqtt_client: Written 88B in 3615ms

again 1200 + 1200 + 1200 + the actual time to send the data.

More logs of the same issue

W (987780) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:520, no_wnd_count:0, timeout_count:0
W (988981) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:520, no_wnd_count:0, timeout_count:1
W (15:18:02.759) mqtt_client: Written 492B in 2420ms
I (15:18:42.338) processor: send status... (60002ms)
I (15:19:43.336) processor: send status... (60998ms)
I (15:20:43.336) processor: send status... (60000ms)
W (1167780) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:660, no_wnd_count:0, timeout_count:0
W (1168981) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:660, no_wnd_count:0, timeout_count:1
W (15:21:02.782) mqtt_client: Written 492B in 2444ms

Please consider that the device is connected to the fixed parent as layer 2 and connected to MQTT,

On top, it seems that it is also cause of other errors in the wifi, like in the following log where it generates a timeout in the beacon 🤔

W (22553) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:21, no_wnd_count:0, timeout_count:6
W (23754) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:21, no_wnd_count:0, timeout_count:7
W (24955) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:21, no_wnd_count:0, timeout_count:8
E (01:00:24.200) mesh_driver: STA: Send err 0x400A ESP_ERR_MESH_TIMEOUT
I (25007) wifi:ap_probe_send over, resett wifi status to disassoc
I (25008) wifi:state: run -> init (c800)
I (25009) wifi:pm stop, total sleep time: 0 us / 15467078 us

I (25012) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (01:00:24.265) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 200 WIFI_REASON_BEACON_TIMEOU

As already discussed, this log is related to the parent handling the window. What i see is that when the device shows

I (16:11:49.724) processor: send offline message...
W (4227278) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:3250, no_wnd_count:0, timeout_count:0
W (4228479) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:3250, no_wnd_count:0, timeout_count:1
W (16:12:03.151) mqtt_client: Written 726B in 2424ms

in the root node i see

I (16:11:58.186) routing_table: Routing table update
I (4417589) mesh: 4016<scan>self-organized is disabled, users shall read out scan results by themselves.
I (16:12:02.492) mesh_main: <MESH_EVENT_SCAN_DONE>number:59
I (16:12:27.002) mesh_hand: Checking wifi connectivity (count 130)[LOCAL]

Could it be that when the root is scanning the WIFI the mesh is blocked?

Is there a way to do a iterative wifi scan?

KonssnoK commented 9 months ago

@zhangyanjiaoesp can you please look at this? thanks

zhangyanjiaoesp commented 9 months ago

@KonssnoK

Could it be that when the root is scanning the WIFI the mesh is blocked? AP is used for communication for the children STA for searching. Apparently, while searching the AP does not respond to the layer 2, which delays the MQTT messages

yes, if the STA is scanning, it will affect the packet sending.

Inside the wifi, when the station has connected to other ap or the softAP has station connected, if the device is scanning, it will back to home channel to send data between scanning consecutive channels, the default dwell home channel time is 30ms, you can increase this value by setting this parameter https://github.com/espressif/esp-idf/blob/bb8dd9d35b4961a8a7181cd856a41e79e436081e/components/esp_wifi/include/esp_wifi_types.h#L161 However, be careful not to exceed the scanning time for each channel.

zhangyanjiaoesp commented 9 months ago

I (25007) wifi:ap_probe_send over, resett wifi status to disassoc I (25008) wifi:state: run -> init (c800) I (25009) wifi:pm stop, total sleep time: 0 us / 15467078 us I (25012) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1 W (01:00:24.265) mesh_main: reason: 200 WIFI_REASON_BEACON_TIMEOU

This log indicates that the station has not received a beacon from the AP for 6 consecutive seconds.

KonssnoK commented 9 months ago

@zhangyanjiaoesp what is the default scan time for a channel? Does using passive scan change anything?

These are our current settings:

    wifi_scan_config_t scan_config = { 0 };
    scan_config.show_hidden = 1;
    scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;

Thanks

zhangyanjiaoesp commented 9 months ago

The default active scan time for a channel is 120ms, the default passive scan time for a channel is 360ms. home_chan_dwell_time won't be affect when using passive scan.

KonssnoK commented 9 months ago

The default active scan time for a channel is 120ms, the default passive scan time for a channel is 360ms. home_chan_dwell_time won't be affect when using passive scan.

so it makes sense to use active scan 🤔 Ok we'll do some tests

Sherry616 commented 6 months ago

Hi @KonssnoK, could you please share your updates about this issue? Thanks.

KonssnoK commented 6 months ago

hello @Sherry616, we did move to active scanning but i did not have time to check if this is still happening!