espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
12.88k stars 7.08k forks source link

[ESP32S3] v4.4 WIFI losing connectivity temporarily or permanently without apparent reason (IDFGH-12977) #13212

Open KonssnoK opened 4 months ago

KonssnoK commented 4 months ago

Answers checklist.

General issue report

@zhangyanjiaoesp

based on 27ec26d2d3f44bbde5da14c7fdfc82226d567874 To reproduce

    esp_ping_config_t ping_config = ESP_PING_DEFAULT_CONFIG();
    IP4_ADDR(&ping_config.target_addr, 8,8,8,8);          // target IP address
    ping_config.count = ESP_PING_COUNT_INFINITE;    // ping in infinite mode, esp_ping_stop can stop it
    ping_config.timeout_ms = 3000;

    /* set callback functions */
    esp_ping_callbacks_t cbs;
    cbs.on_ping_success = test_on_ping_success;
    cbs.on_ping_timeout = test_on_ping_timeout;
    cbs.on_ping_end = test_on_ping_end;
    cbs.cb_args = "foo";  // arguments that will feed to all callback functions, can be NULL

    esp_ping_handle_t ping;
    esp_ping_new_session(&ping_config, &cbs, &ping);
    esp_ping_start(ping);

#define DUMP_INTERVAL_MS (1000*60*2)
    int64_t last_dump = 0;
    while (true) {
        if((os_time_get_timestamp() - last_dump) >= DUMP_INTERVAL_MS) {
            last_dump = os_time_get_timestamp();
            esp_wifi_statis_dump(0xFFFF);
        }
        os_thread_sleep(1000);
    }

static void test_on_ping_success(esp_ping_handle_t hdl, void *args)
{
    // optionally, get callback arguments
    // const char* str = (const char*) args;
    // printf("%s\r\n", str); // "foo"
    uint8_t ttl;
    uint16_t seqno;
    uint32_t elapsed_time, recv_len;
    ip_addr_t target_addr;
    esp_ping_get_profile(hdl, ESP_PING_PROF_SEQNO, &seqno, sizeof(seqno));
    esp_ping_get_profile(hdl, ESP_PING_PROF_TTL, &ttl, sizeof(ttl));
    esp_ping_get_profile(hdl, ESP_PING_PROF_IPADDR, &target_addr, sizeof(target_addr));
    esp_ping_get_profile(hdl, ESP_PING_PROF_SIZE, &recv_len, sizeof(recv_len));
    esp_ping_get_profile(hdl, ESP_PING_PROF_TIMEGAP, &elapsed_time, sizeof(elapsed_time));
    ESP_LOGW(TAG, "%d bytes from %s icmp_seq=%d ttl=%d time=%d ms",
           recv_len, ipaddr_ntoa(&target_addr), seqno, ttl, elapsed_time);
}

static void test_on_ping_timeout(esp_ping_handle_t hdl, void *args)
{
    uint16_t seqno;
    ip_addr_t target_addr;
    esp_ping_get_profile(hdl, ESP_PING_PROF_SEQNO, &seqno, sizeof(seqno));
    esp_ping_get_profile(hdl, ESP_PING_PROF_IPADDR, &target_addr, sizeof(target_addr));
    ESP_LOGW(TAG, "From %s icmp_seq=%d timeout", ipaddr_ntoa(&target_addr), seqno);
}

static void test_on_ping_end(esp_ping_handle_t hdl, void *args)
{
    uint32_t transmitted;
    uint32_t received;
    uint32_t total_time_ms;

    esp_ping_get_profile(hdl, ESP_PING_PROF_REQUEST, &transmitted, sizeof(transmitted));
    esp_ping_get_profile(hdl, ESP_PING_PROF_REPLY, &received, sizeof(received));
    esp_ping_get_profile(hdl, ESP_PING_PROF_DURATION, &total_time_ms, sizeof(total_time_ms));
    ESP_LOGW(TAG, "%d packets transmitted, %d received, time %dms", transmitted, received, total_time_ms);
}

espressif_wifi_dump_3.txt espressif_wifi_dump.txt espressif_wifi_dump_2.txt

Other related issues with similar behavior: https://github.com/espressif/esp-idf/issues/8953 https://github.com/espressif/esp-idf/issues/10506

AxelLin commented 3 weeks ago

@Espressif-liuuuu @zhangyanjiaoesp @nishanth-radja Do you have any finding in above log?

zhangyanjiaoesp commented 3 weeks ago

@KonssnoK Are you using the ip_internal_network example? In your logs, it show the ping timeout, but it can't confirm the wifi connection is down. Can you capture packets for this ?

KonssnoK commented 3 weeks ago

@KonssnoK Are you using the ip_internal_network example? In your logs, it show the ping timeout, but it can't confirm the wifi connection is down. Can you capture packets for this ?

@zhangyanjiaoesp yes the base is the ip_internal_network example. how would you check if the wifi connection is down? apart from seeing no packets are sent/receive.

Also, how would you get the packets ? Wireshark connected to a sniffer?

KonssnoK commented 3 weeks ago

So @zhangyanjiaoesp i was able to generate one strange behavior, even if it's not exacly the one reported in this issue.

With the same code (v4.4 top of c0e0af03d153d2c157d1d420831ab33d48888768 )

you can apply patches 1 2 3, which enable monitoring and pinging

03_ip_internal.patch 02_if_dumps.patch 01_packets_dump.patch

by randomly detaching/attaching the layer 2 device i was able to reach this state, where the L2 device is never able to communicate with L1 timeout_l2.txt

I got an extract of L1 too (i would say MESH_EVENT_CHILD_CONNECTED to track L2 events)

ESP-IDF_test.txt

KonssnoK commented 3 weeks ago

interestingly enough to recover the L2 device i had to reboot both devices, meaning rebooting only the L2 device was not solving the issue, and even rebooting the L1 device while L2 device was stuck (after reboot) did not solve the issue

KonssnoK commented 2 weeks ago

@zhangyanjiaoesp again by simply resetting the 2 devices in different ways, i was able to trigger another case in which one device does not work anymore until reboot. to be noted: once this device is failing, rebooting the other device makes it fail too.

timeout_reset1.txt timeout_reset2.txt

to recover the devices i had to keep them offline enough for the phone to lose the cache of connected devices ( pixel8 )

zhangyanjiaoesp commented 2 weeks ago

@KonssnoK please provide your sdkconfig file, and you are using PSRAM, right?

KonssnoK commented 2 weeks ago

sdkconfig.txt @zhangyanjiaoesp here it is

zhangyanjiaoesp commented 2 weeks ago

So @zhangyanjiaoesp i was able to generate one strange behavior, even if it's not exacly the one reported in this issue.

With the same code (v4.4 top of c0e0af0 )

you can apply patches 1 2 3, which enable monitoring and pinging

03_ip_internal.patch 02_if_dumps.patch 01_packets_dump.patch

by randomly detaching/attaching the layer 2 device i was able to reach this state, where the L2 device is never able to communicate with L1 timeout_l2.txt

I got an extract of L1 too (i would say MESH_EVENT_CHILD_CONNECTED to track L2 events)

ESP-IDF_test.txt

This log show the device didn't get the IP address, which cause the ping timeout.

@zhangyanjiaoesp again by simply resetting the 2 devices in different ways, i was able to trigger another case in which one device does not work anymore until reboot. to be noted: once this device is failing, rebooting the other device makes it fail too. timeout_reset1.txt timeout_reset2.txt to recover the devices i had to keep them offline enough for the phone to lose the cache of connected devices ( pixel8 )

And this log show the device can't connect to the router, the reason is auth timeout.

I have tested using the router, and can't reproduce this issue. I will use the mobile hostspot to test again, can you provide the model of your phone? Or any phone can reproduce this issue? @KonssnoK

KonssnoK commented 2 weeks ago

@zhangyanjiaoesp i reproduced with a Google Pixel8.

not getting the IP - strange, would mean the IP service is stuck 🤔

KonssnoK commented 1 week ago

@zhangyanjiaoesp i moved to 3 devices and trying to replicate but for now without success..

KonssnoK commented 1 week ago

and as soon as i wrote that, something strange happened again: dev3 is not able to connect

dev2.txt disconnected_dev3.txt dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other)

it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

KonssnoK commented 1 week ago

one hour in: W (6444006) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5334

KonssnoK commented 1 week ago

W (13189216) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:10955

KonssnoK commented 1 week ago

this is instead the log of device1 getting stuck and not trying to connect to the mesh anymore

dev1_stuck.txt

zhangyanjiaoesp commented 1 week ago

and as soon as i wrote that, something strange happened again: dev3 is not able to connect

dev2.txt disconnected_dev3.txt dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other)

it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK In the log, I see that initially communication among the three devices was normal, and then you restarted device2 and device3? And in the logs of device2 and device3, there are logs showing I (42466) wifi:state: run -> init (2c0), this means the wifi connection is disconnected. The wifi disconnection will cause the W (6444006) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5334. But I can't find why the wifi disconnect. Can you use wireshark to capture packets of the device1/2/3, and send the logs and captures to me? In the log, please display the absolute time.

And I have using the Google Pixel5 mobile to test, but I didn't reproduce the problem.

KonssnoK commented 1 week ago

and as soon as i wrote that, something strange happened again: dev3 is not able to connect dev2.txt disconnected_dev3.txt dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other) it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK In the log, I see that initially communication among the three devices was normal, and then you restarted device2 and device3? And in the logs of device2 and device3, there are logs showing I (42466) wifi:state: run -> init (2c0), this means the wifi connection is disconnected. The wifi disconnection will cause the W (6444006) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5334. But I can't find why the wifi disconnect. Can you use wireshark to capture packets of the device1/2/3, and send the logs and captures to me? In the log, please display the absolute time.

And I have using the Google Pixel5 mobile to test, but I didn't reproduce the problem.

@zhangyanjiaoesp today i have no way to use wireshark. the only way to recover that device was to reboot it again. Yes, to trigger the issue i simply randomly restarted devices. i'm not sure on how to display absolute time considering the logs are in close sourced files. I understand the wifi is disconnected, but i would expect it to retry a connection once disconnected 🤔

zhangyanjiaoesp commented 1 week ago

Yes, to trigger the issue i simply randomly restarted devices

Ok, I will try to restart the device and test again.

KonssnoK commented 1 week ago

@zhangyanjiaoesp in my experience slow data rates help achieving the issues. Please put your phone cell technology to 2G or go to "developer options" networking "network download rate limit" and put the minimum.

KonssnoK commented 1 week ago

please note that the logs are more or less synchronized at the end, not the start! (i extract them more or less at the same time)

240619dev3.txt 240619dev1.txt 240619dev2.txt

@zhangyanjiaoesp i rebooted the root device and it went offline without managing to reconnect.

After a while device 3 managed to change status and directly connect as the root. the other 2 remain disconnected

240619dev3_2.txt 240619dev1_2.txt 240619dev2_2.txt

device 2 at some point manages to recover too.

240619dev3_3.txt 240619dev1_3.txt 240619dev2_3.txt

device one is still disconnected and not able to recover instead.

phone connected in 5G with rate limiter at 128kbps

device 1 dodes not recover

KonssnoK commented 1 week ago

@zhangyanjiaoesp for reference this setup seems to trigger the issue in the above message quite often. once again the root device is stuck after a reboot, device 2 takes the root in this occasion, device 3 follows 2, but device 1 is stuck.

240619dev3_4.txt 240619dev1_4.txt 240619dev2_4.txt

KonssnoK commented 1 week ago

@zhangyanjiaoesp i tried also today to replicate, to verify if this is consistent:

it's quite easy to create issues in this configuration, please let me know if you manage.

240620dev3.txt 240620dev1.txt 240620dev2.txt

after a while dev 3 recovers and then also dev 2. dev 1 is stuck.

240620dev1_3.txt 240620dev3_3.txt 240620dev2_3.txt

zhangyanjiaoesp commented 1 week ago

@KonssnoK I'm sorry, I have an urgent task recently. I will test your issue next week.

KonssnoK commented 1 week ago

@zhangyanjiaoesp sure, i'll concentrate on another issue meanwhile

zhangyanjiaoesp commented 6 days ago

and as soon as i wrote that, something strange happened again: dev3 is not able to connect

dev2.txt disconnected_dev3.txt dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other)

it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK I have reproduced this issue by rebooting the root device, and I have found the root cause, the following wifi libs can solve the problem. Please replace the wifi libs and test again. wifi_lib_s3_0625.zip (wifi firmware version: f736b07)

For the other issues, I still can't reproduce them although I randomly reboot the device2/3.

AxelLin commented 6 days ago

@KonssnoK I have reproduced this issue by rebooting the root device, and I have found the root cause

What's the root cause?

KonssnoK commented 6 days ago

and as soon as i wrote that, something strange happened again: dev3 is not able to connect dev2.txt disconnected_dev3.txt dev1.txt

I (42466) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (43196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:0
W (44396) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1
W (45596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:2
W (46796) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:3
W (47996) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:4
W (49196) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:5

(devices are 30cm apart fom each other) it seems it goes on forever W (1983596) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:43, timeout_count:1617

@KonssnoK I have reproduced this issue by rebooting the root device, and I have found the root cause, the following wifi libs can solve the problem. Please replace the wifi libs and test again. wifi_lib_s3_0625.zip (wifi firmware version: f736b07)

For the other issues, I still can't reproduce them although I randomly reboot the device2/3.

@zhangyanjiaoesp sure I will try them. Should this unblock devices from avoid reconnection? Because we are currently seeing this kind of issues a bit everywhere on the field. Thanks

KonssnoK commented 6 days ago

@zhangyanjiaoesp i changed the library but i see no difference in the behavior of the children 🤔 240625dev1_1.txt 240625dev3_1.txt 240625dev2_1.txt

EDIT: i think this might have been related to the fact that the 4th device, which was acting as root, was not updated yet with latest libraries. I will now retry in the nominal configuration

zhangyanjiaoesp commented 6 days ago

@zhangyanjiaoesp i changed the library but i see no difference in the behavior of the children 🤔 240625dev1_1.txt 240625dev3_1.txt 240625dev2_1.txt

EDIT: i think this might have been related to the fact that the 4th device, which was acting as root, was not updated yet with latest libraries. I will now retry in the nominal configuration

It's wired. Here is device2 log on my side when the root rebooting. device2.txt

KonssnoK commented 6 days ago

@zhangyanjiaoesp here is dev1 blocked again

240625dev3_2.txt 240625dev2_2.txt 240625dev1_2.txt

W (423934) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:103, no_wnd_count:0, timeout_count:209

zhangyanjiaoesp commented 6 days ago

So @zhangyanjiaoesp i was able to generate one strange behavior, even if it's not exacly the one reported in this issue.

With the same code (v4.4 top of c0e0af0 )

you can apply patches 1 2 3, which enable monitoring and pinging

03_ip_internal.patch 02_if_dumps.patch 01_packets_dump.patch

@KonssnoK My test is based on the ip_internal_network example and added the above patch. And I noticed this print W (106224) mesh_hand: Triggering DYNAMIC MESH handover in your logs. You used a different test code, right?

KonssnoK commented 6 days ago

So @zhangyanjiaoesp i was able to generate one strange behavior, even if it's not exacly the one reported in this issue. With the same code (v4.4 top of c0e0af0 ) you can apply patches 1 2 3, which enable monitoring and pinging 03_ip_internal.patch 02_if_dumps.patch 01_packets_dump.patch

@KonssnoK My test is based on the ip_internal_network example and added the above patch. And I noticed this print W (106224) mesh_hand: Triggering DYNAMIC MESH handover in your logs. You used a different test code, right?

yes sorry it's an evolution of your code. I will put back the old one and reproduce again

KonssnoK commented 6 days ago

dev1 is now blocked in a I (135890) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:2 loop

240625dev2_3.txt 240625dev1_3.txt 240625dev3_3.txt

KonssnoK commented 6 days ago

same issue triggered also on dev3

240625dev1_4.txt 240625dev3_4.txt 240625dev2_4.txt

(rate limiter always on)

KonssnoK commented 6 days ago

with the same procedure another issue appeared:

240625dev1_5.txt 240625dev2_5.txt 240625dev3_5.txt

KonssnoK commented 6 days ago

another case of dev1 getting stuck after reset. the reset was hold for some seconds before releasing (i don't always just press/release)

240625dev2_6.txt 240625dev1_6.txt 240625dev3_6.txt

KonssnoK commented 6 days ago

@zhangyanjiaoesp last example of this issue:

240625dev3_7.txt 240625dev1_7.txt 240625dev2_7.txt

now i'll go back to the other one while you fix this

zhangyanjiaoesp commented 5 days ago

dev1 is now blocked in a I (135890) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:2 loop

240625dev2_3.txt 240625dev1_3.txt 240625dev3_3.txt

When the root device is connecting to the router, and the disconnect reason is 2 (auth expire), then the root will continue to reconnect to the router. So you see the I (135890) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:2 log loop. User should handle this case in the application layer. By the way, I (50960) wifi:state: auth -> init (200) this log indicates the device sends auth request, but the router doesn't reply auth response. It's strange why the router doesn't reply auth response. Could it have something to do with the hotspot used? Do you have this problem if you use another hotspot or router?

KonssnoK commented 5 days ago

dev1 is now blocked in a I (135890) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:2 loop 240625dev2_3.txt 240625dev1_3.txt 240625dev3_3.txt

When the root device is connecting to the router, and the disconnect reason is 2 (auth expire), then the root will continue to reconnect to the router. So you see the I (135890) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:2 log loop. User should handle this case in the application layer. By the way, I (50960) wifi:state: auth -> init (200) this log indicates the device sends auth request, but the router doesn't reply auth response. It's strange why the router doesn't reply auth response. Could it have something to do with the hotspot used? Do you have this problem if you use another hotspot or router?

how should the user handle this error? and why is it reported continuously alternated to the error 205?

I will try with a samsung device

zhangyanjiaoesp commented 5 days ago

with the same procedure another issue appeared:

  • generation of 2 networks that do not merge back together
  • dev1 and dev3 are both ROOT and do not try to merge together

240625dev1_5.txt 240625dev2_5.txt 240625dev3_5.txt

By default, it allows more than one root existing in one mesh network. Please call esp_mesh_allow_root_conflicts(false) to disable it.

zhangyanjiaoesp commented 5 days ago

how should the user handle this error?

For example, if the auth fails because the router moved its position, user can call esp_mesh_waive_root() to change a better root.

and why is it reported continuously alternated to the error 205?

When connects fail, sta will add this AP to a blacklist, and reason code 205 indicates scan fail due to the AP is in blacklist, after this the AP will be removed from the blacklist. So you will see the reason code is 2/205 loop.

KonssnoK commented 5 days ago

For example, if the auth fails because the router moved its position, user can call esp_mesh_waive_root() to change a better root.

well my router is always the same and not moving 🤔 so it is a bit strange

Is it the STA blacklisting the AP or or the AP blacklisting the STA? because i would expect the router to blacklist a device that constantly disconnect.

Also i expected the device to check if another device became root, since the rest of the network manages to recover

something like this for the waiving?

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(MESH_TAG,
                 "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d",
                 disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();

        if (esp_mesh_is_root() && disconnected->reason == WIFI_REASON_CONNECTION_FAIL){
            esp_mesh_waive_root();
        }
    }
KonssnoK commented 5 days ago

with the same procedure another issue appeared:

  • generation of 2 networks that do not merge back together
  • dev1 and dev3 are both ROOT and do not try to merge together

240625dev1_5.txt 240625dev2_5.txt 240625dev3_5.txt

By default, it allows more than one root existing in one mesh network. Please call esp_mesh_allow_root_conflicts(false) to disable it.

@zhangyanjiaoesp should this be called on all devices or only on root? when should it be called ? before mesh_start?

zhangyanjiaoesp commented 5 days ago

with the same procedure another issue appeared:

  • generation of 2 networks that do not merge back together
  • dev1 and dev3 are both ROOT and do not try to merge together

240625dev1_5.txt 240625dev2_5.txt 240625dev3_5.txt

By default, it allows more than one root existing in one mesh network. Please call esp_mesh_allow_root_conflicts(false) to disable it.

@zhangyanjiaoesp should this be called on all devices or only on root? when should it be called ? before mesh_start?

Call it before mesh start on all devices.

zhangyanjiaoesp commented 5 days ago

Is it the STA blacklisting the AP or or the AP blacklisting the STA?

The STA add the AP to STA's blacklist.

something like this for the waiving?

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(MESH_TAG,
                 "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d",
                 disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();

        if (esp_mesh_is_root() && disconnected->reason == WIFI_REASON_CONNECTION_FAIL){
            esp_mesh_waive_root();
        }
    }

yes, maybe add a check for the number of failures?

KonssnoK commented 5 days ago

@zhangyanjiaoesp calling esp_mesh_waive_root always returns error

ESP_ERR_MESH_DISCARD

and the device remains stuck.

Patch for latest code 04_waive_root.patch

Example of partial log:


I (455320) mesh: [wifi]disconnected reason:2(auth expire), continuous:239/max:12, root, vote(,stopped)<><>
W (456320) ping: From 8.8.8.8 icmp_seq=114 timeout
E (457320) ping_sock: send error=0
I (457990) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:205
I (457990) mesh: [wifi]disconnected reason:205(), continuous:240/max:12, root, vote(,stopped)<><>
W (457990) mesh_main: esp_mesh_waive_root 16405
I (458080) wifi:new:<11,2>, old:<11,0>, ap:<11,2>, sta:<11,0>, prof:11
I (458080) wifi:state: init -> auth (b0)
I (459080) wifi:state: auth -> init (200)
I (459090) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,0>, prof:11
I (459090) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:2
I (459090) mesh: [wifi]disconnected reason:2(auth expire), continuous:241/max:12, root, vote(,stopped)<><>
W (460320) ping: From 8.8.8.8 icmp_seq=115 timeout
E (461320) ping_sock: send error=0
I (461760) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:205
I (461760) mesh: [wifi]disconnected reason:205(), continuous:242/max:12, root, vote(,stopped)<><>
I (461800) wifi:new:<11,2>, old:<11,0>, ap:<11,2>, sta:<11,0>, prof:11
I (461800) wifi:state: init -> auth (b0)
I (462810) wifi:state: auth -> init (200)
I (462810) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,0>, prof:11
I (462810) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:2
I (462810) mesh: [wifi]disconnected reason:2(auth expire), continuous:243/max:12, root, vote(,stopped)<><>
W (464320) ping: From 8.8.8.8 icmp_seq=116 timeout
E (465320) ping_sock: send error=0
zhangyanjiaoesp commented 4 days ago

@KonssnoK

ESP_ERR_MESH_DISCARD this error indicates that the softAP doesn't have children. In your auth failure scenario, I think we should first confirm why the hotspot does not reply auth response. Have you tested on another hotspot or router?

KonssnoK commented 4 days ago

@KonssnoK

ESP_ERR_MESH_DISCARD this error indicates that the softAP doesn't have children. In your auth failure scenario, I think we should first confirm why the hotspot does not reply auth response. Have you tested on another hotspot or router?

No I updated the code and retried with the changes. I will go back to testing with the Samsung phone. In any case, it should work any router/phone 🤔, A different behavior in the AP should be handled in any case correctly by the STA

KonssnoK commented 4 days ago

with samsung phone:

W (435697) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:26, no_wnd_count:0, timeout_count:244 240627dev1_1.txt 240627dev3_1.txt 240627dev2_1.txt

KonssnoK commented 4 days ago

@zhangyanjiaoesp the failed auth does not happen on the samsung phone but dev2 apparently gets stuck as a child.

240627dev3_2.txt 240627dev1_2.txt 240627dev2_2.txt

but on the Google phone what should be done when the root gets into a 2 205 error loop? Of course it doesn't have children because the rest of the network reshape by itself. What should be done to make it recover and search for other devices?