espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.76k stars 7.3k forks source link

WIFI Disconnect with Reason 34 ?! (IDFGH-4410) #6241

Closed alorbach closed 2 years ago

alorbach commented 3 years ago

I am on ESP-IDF release 3.3 with arduino-esp32 master branch

I have a sketch that receives about 60 UDP packages per seconds, with a data size of up to 1024 bytes per packet. At the same time its streaming DTLS data (UDP) with about 25 packets per second with about 70 bytes erach packet.

I have more than enough memory free:

Free: 141KB Heap: 182KB

At a random point WIFI becomes unstable, it disconnects with reason 34 which I can not find any documentation on!

%debug: 1:4803491: wpsWiFiEvent:Disconnected: Disconnected from station reason: 34 (other reason), reconnect handling...

The Wifi error codes go from 0 to 24 and from 200 to 205 as described here: https://github.com/espressif/esp-idf/blob/0b71a0a46d67cce681ec55973b020d950d8596bd/docs/en/api-guides/wifi.rst#wi-fi-reason-code

So what does this error code mean?

This is my reconnect handling after the disconnect

WiFi.disconnect(true);
WiFi.mode(WIFI_OFF);
WiFi.mode(WIFI_MODE_STA);
WiFi.begin();

After the problem has happened, this repeats every few seconds until I reset:

%debug: 0:4803874: esp32debug: I (4805664) wifi: %debug: 0:4803874: esp32debug: new:<13,2>, old:<13,0>, ap:<255,255>, sta:<13,2>, prof:1 %debug: 0:4803878: esp32debug: I (4805668) wifi: %debug: 0:4803882: esp32debug: state: init -> auth (b0) %debug: 0:4803895: esp32debug: I (4805684) wifi: %debug: 0:4803895: esp32debug: state: auth -> assoc (0) %debug: 0:4803907: esp32debug: I (4805697) wifi: %debug: 0:4803907: esp32debug: state: assoc -> run (10) %debug: 0:4803952: esp32debug: I (4805742) wifi: %debug: 0:4803953: esp32debug: connected with wifi, aid = 5, channel 13, 40D, bssid = dc:39:6f:51:74:e4 %debug: 0:4803958: esp32debug: I (4805748) wifi: %debug: 0:4803963: esp32debug: security type: 3, phy: bgn, rssi: -66 %debug: 0:4804038: esp32debug: I (4805828) wifi: %debug: 0:4804039: esp32debug: pm start, type: 0 %debug: 0:4804039: esp32debug: I (4805829) wifi: %debug: 0:4804043: esp32debug: AP's beacon interval = 102400 us, DTIM period = 1 %debug: 1:4804055: wpsWiFiEvent: STA Connected to: wifi

alorbach commented 3 years ago

Another step further, debugging lwip shows me where the mem alloc fails

lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 lwip_getsockopt(57, SOL_SOCKET, SO_ERROR) = 12 ssl_client_udp error: send_ssl_data failed with res 0XFFFFFFB2, errno 12, retry after vTaskDelay! lwip_send(57, data=0x3ffec5f4, size=71, flags=0x0) lwip_sendto(57, data=0x3ffec5f4, short_size=71, flags=0x0 to=0.0.0.0 port=0 ssl_client_udp error: send_ssl_data failed with res 0XFFFFFFB2, errno 12, tear down session! lwip_close(57) lwip_close: is_tcp=0

alorbach commented 3 years ago

Even with LWIP Stats enable an pbub debug enabled, nothing which explains the ERR_MEM:

lwip_send(52, data=0x3ffee904, size=71, flags=0x0) lwip_sendto(52, data=0x3ffee904, short_size=71, flags=0x0 to=0.0.0.0 port=0 pbuf_alloc(length=71) pbuf_alloc(length=71) == 0x3ffd6e0c netconn_send: sending 71 bytes pbuf_header: old 0x3ffd6e70 new 0x3ffd6e68 (8) pbuf_header: old 0x3ffd6e68 new 0x3ffd6e54 (20) pbuf_header: old 0x3ffd6e54 new 0x3ffd6e46 (14) pbuf_free(0x3ffd6e0c) pbuf_free: deallocating 0x3ffd6e0c lwip_send(52, data=0x3ffee904, size=71, flags=0x0) lwip_sendto(52, data=0x3ffee904, short_size=71, flags=0x0 to=0.0.0.0 port=0 pbuf_alloc(length=71) pbuf_alloc(length=71) == 0x3ffd6e0c netconn_send: sending 71 bytes pbuf_header: old 0x3ffd6e70 new 0x3ffd6e68 (8) pbuf_header: old 0x3ffd6e68 new 0x3ffd6e54 (20) pbuf_header: old 0x3ffd6e54 new 0x3ffd6e46 (14) pbuf_free(0x3ffd6e0c) pbuf_free: deallocating 0x3ffd6e0c lwip_send(52, data=0x3ffee904, size=71, flags=0x0) lwip_sendto(52, data=0x3ffee904, short_size=71, flags=0x0 to=0.0.0.0 port=0 pbuf_alloc(length=71) pbuf_alloc(length=71) == 0x3ffd6e0c netconn_send: sending 71 bytes pbuf_header: old 0x3ffd6e70 new 0x3ffd6e68 (8) pbuf_header: old 0x3ffd6e68 new 0x3ffd6e54 (20) pbuf_header: old 0x3ffd6e54 new 0x3ffd6e46 (14) pbuf_free(0x3ffd6e0c) pbuf_free: deallocating 0x3ffd6e0c lwip_getsockopt(52, SOL_SOCKET, SO_ERROR) = 12

LINK xmit: 3 recv: 3 fw: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

ETHARP xmit: 24 recv: 279 fw: 0 drop: 429 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 429 opterr: 0 err: 0 cachehit: 15310

IP_FRAG xmit: 0 recv: 0 fw: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

IPv6 FRAG xmit: 0 recv: 0 fw: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

IP xmit: 15507 recv: 20934 fw: 0 drop: 1009 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

ND xmit: 3 recv: 0 fw: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

IPv6 xmit: 3 recv: 524 fw: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

IGMP xmit: 0 recv: 7 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 proterr: 0 rx_v1: 0 rx_group: 0 rx_general: 7 rx_report: 0 tx_join: 0 tx_leave: 0 tx_report: 0

MLDv1 xmit: 0 recv: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 proterr: 0 rx_v1: 0 rx_group: 0 rx_general: 0 rx_report: 0 tx_join: 0 tx_leave: 0 tx_report: 0

ICMP xmit: 53 recv: 42 fw: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

ICMPv6 xmit: 0 recv: 0 fw: 0 drop: 0 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 0 opterr: 0 err: 0 cachehit: 0

UDP xmit: 15316 recv: 19747 fw: 0 drop: 11 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 11 opterr: 0 err: 0 cachehit: 19

TCP xmit: 87 recv: 130 fw: 0 drop: 1 chkerr: 0 lenerr: 0 memerr: 0 rterr: 0 proterr: 1 opterr: 0 err: 0 cachehit: 112

SYS sem.used: 0 sem.max: 0 pbuf_alloc(length=102)

alorbach commented 3 years ago

It seems to fail at a simple malloc call (traced back to mem_malloc), could this be heap fragmentation? https://github.com/espressif/esp-lwip/blob/61d840ff4778f4946c8743f7e412345abcd537f1/src/core/mem.c#L251

YouDONG-ESP commented 3 years ago

@alorbach It's possible, but it is unlikely to happen if your packet size remains the same.

I didn't find reason code 34 in our code, is that a hexadecimal number? If so, 22 is refer to INVALID_RSN_IE_CAP, it may happens with very few ap, dose it happen with other ap?

alorbach commented 3 years ago

@YouDONG-ESP Yes it happens with other AP as well, I have three different AP's available for testing. However I think that the root cause is, that the mem_malloc calls inside LWIP fail at some random point and causing all kinds of network issues afterwards.

alorbach commented 3 years ago

I had added heap_caps_print_heap_info before and after the failure,:

BEFORE:

Heap summary for capabilities 0x00000804: At 0x3ffbdb60 len 84 free 0 allocated 36 min_free 0 largest_free_block 0 alloc_blocks 1 free_blocks 0 total_blocks 1 At 0x3ffc1bb4 len 2408 free 0 allocated 2024 min_free 0 largest_free_block 0 alloc_blocks 22 free_blocks 0 total_blocks 22 At 0x3ffbdb28 len 52 free 0 allocated 4 min_free 0 largest_free_block 0 alloc_blocks 1 free_blocks 0 total_blocks 1 At 0x3ffb8000 len 6688 free 0 allocated 6480 min_free 0 largest_free_block 0 alloc_blocks 11 free_blocks 0 total_blocks 11 At 0x3ffb0000 len 25480 free 0 allocated 24944 min_free 0 largest_free_block 0 alloc_blocks 31 free_blocks 1 total_blocks 32 At 0x3ffae6e0 len 6192 free 0 allocated 5920 min_free 0 largest_free_block 0 alloc_blocks 14 free_blocks 1 total_blocks 15 At 0x3ffaff10 len 240 free 0 allocated 120 min_free 0 largest_free_block 0 alloc_blocks 5 free_blocks 1 total_blocks 6 At 0x3ffb6388 len 7288 free 0 allocated 6920 min_free 0 largest_free_block 0 alloc_blocks 21 free_blocks 0 total_blocks 21 At 0x3ffb9a20 len 16648 free 24 allocated 15476 min_free 8 largest_free_block 0 alloc_blocks 68 free_blocks 4 total_blocks 72 At 0x3ffca158 len 89768 free 11592 allocated 70876 min_free 208 largest_free_block 5184 alloc_blocks 439 free_blocks 58 total_blocks 497 At 0x3ffe0440 len 15072 free 13568 allocated 1436 min_free 16 largest_free_block 10840 alloc_blocks 1 free_blocks 2 total_blocks 3 At 0x3ffe4350 len 113840 free 76204 allocated 37536 min_free 76204 largest_free_block 75560 alloc_blocks 3 free_blocks 2 total_blocks 5 Totals: free 101388 allocated 171772 min_free 76436 largest_free_block 75560

AFTER FAIL:

heap summary for capabilities 0x00000804: At 0x3ffbdb60 len 84 free 0 allocated 36 min_free 0 largest_free_block 0 alloc_blocks 1 free_blocks 0 total_blocks 1 At 0x3ffc1bb4 len 2408 free 0 allocated 2024 min_free 0 largest_free_block 0 alloc_blocks 22 free_blocks 0 total_blocks 22 At 0x3ffbdb28 len 52 free 0 allocated 4 min_free 0 largest_free_block 0 alloc_blocks 1 free_blocks 0 total_blocks 1 At 0x3ffb8000 len 6688 free 0 allocated 6480 min_free 0 largest_free_block 0 alloc_blocks 11 free_blocks 0 total_blocks 11 At 0x3ffb0000 len 25480 free 0 allocated 24944 min_free 0 largest_free_block 0 alloc_blocks 31 free_blocks 1 total_blocks 32 At 0x3ffae6e0 len 6192 free 0 allocated 5920 min_free 0 largest_free_block 0 alloc_blocks 14 free_blocks 1 total_blocks 15 At 0x3ffaff10 len 240 free 0 allocated 120 min_free 0 largest_free_block 0 alloc_blocks 5 free_blocks 1 total_blocks 6 At 0x3ffb6388 len 7288 free 0 allocated 6920 min_free 0 largest_free_block 0 alloc_blocks 21 free_blocks 0 total_blocks 21 At 0x3ffb9a20 len 16648 free 24 allocated 15476 min_free 8 largest_free_block 0 alloc_blocks 68 free_blocks 4 total_blocks 72 At 0x3ffca158 len 89768 free 12016 allocated 70468 min_free 208 largest_free_block 4332 alloc_blocks 438 free_blocks 58 total_blocks 496 At 0x3ffe0440 len 15072 free 11280 allocated 3708 min_free 16 largest_free_block 10840 alloc_blocks 2 free_blocks 2 total_blocks 4 At 0x3ffe4350 len 113840 free 76204 allocated 37536 min_free 76204 largest_free_block 75560 alloc_blocks 3 free_blocks 2 total_blocks 5 Totals: free 99524 allocated 173636 min_free 76436 largest_free_block 75560

alorbach commented 3 years ago

@alorbach It's possible, but it is unlikely to happen if your packet size remains the same.

I didn't find reason code 34 in our code, is that a hexadecimal number? If so, 22 is refer to INVALID_RSN_IE_CAP, it may happens with very few ap, dose it happen with other ap?

The main UDP buffers used are the same, about 500 bytes for incoming (RAW Data) and about 70 bytes for outgoing (DTLS)

YouDONG-ESP commented 3 years ago

Could you provide a code to reproduce this issue?

alorbach commented 3 years ago

Well you will need a DTLS receiving Server. I am currently testing a few things and I plan to build against esp-idf 4.2 today to check if the problem happens there as well.

If I am not able to reproduce the issue, I will try to greate a minimal sketch that reproduces the issue somehow.

alorbach commented 3 years ago

@YouDONG-ESP:

Interesting development! The problem has not totally gone but somehow it is better now - meaning Wifi (LWIP) is not becoming unstable anymore after the first ERR_MEM occurred.

So I have a Class that does thwe DTLS UDP Streaming which uses HTTPClient from arduino-esp32, and I was creating / deleting HTTPClient objects dynamically after the ERR_MEM occurred.

I remembered that I had trouble (strange exceptions) when deleting dynamic HTTPClient objects before in my OTA Updatecode. So as a workaround, I made the HTTPClient Object static in my OTA update code which was a sufficient at this time for me.

Now I did the same to my Class that does the DTLS Streaming, I made the HTTPClient object static at the top of the Class code like this: HTTPClient mHttp;

And it appears to have fixed the problem mostly - I am running DTLS Streaming now since about 30 minutes without any ERR_MEM till yet.

So I think HTTPClient is messing something badly up in the destructor.

me-no-dev commented 3 years ago

Please give a minimal example that shows the problem.

0xFEEDC0DE64 commented 3 years ago

we have a customer with an esp32 based product of ours and the esp32 is not able to connect to his network, but works literally everywhere else. I get the error 34 (not in hex, but decimal), maybe this is the same issue here? #7593

I cant reproduce it in our office and the customer says he has a devolo "mesh network setup", where only the access points talk mesh to each other and send out a access point (then usable by the customer). But I dont get the whole mesh thematic here :)

alorbach commented 3 years ago

@0xFEEDC0DE64 I have got the problem more or less under control, it happens very very seldom. What I can tell you is, if you have any invalid memory access somewhere like even writting 1 byte to much into a char array, this error will happen sooner than earlier.

I believe this is a following error of issues like this, and the hardest error's to find btw.

SJFriedl commented 2 years ago

FWIW, I can repro WIFI disconnect with reason 34 with a lot of pings to the IP address, and it appears that reason 34 is documented in the 802.11 standard as: "Disassociated because excessive number of frames need to be acknowledged, but are not acknowledged due to AP transmissions and/or poor channel conditions".

Searching for "WLAN_STATUS_DENIED_POOR_CHANNEL_CONDITIONS" will bring up plenty of references.

Asking for a reconnect (assuming the poor channel conditions have recovered) works fine for me.

mhdong commented 2 years ago

hi @alorbach this issue may have fixed at new version, please checked with the newest esp-idf master.

AxelLin commented 2 years ago

hi @alorbach this issue may have fixed at new version, please checked with the newest esp-idf master.

https://github.com/espressif/esp-idf/issues/7593#issuecomment-1022754386

Alvin1Zhang commented 2 years ago

Thanks for reporting, feel free to reopen.