espressif / arduino-esp32

Arduino core for the ESP32
GNU Lesser General Public License v2.1
13.34k stars 7.37k forks source link

wifi connect failures stuck in long timeout loop, autoreconnect on PMK failure is wrong #7968

Open tablatronix opened 1 year ago

tablatronix commented 1 year ago

Board

esp32*

Device Description

devkit

Hardware Configuration

-

Version

v2.0.7

IDE Name

deviot

Operating System

osx

Flash frequency

40

PSRAM enabled

yes

Upload speed

921600

Description

On my Unifi APs If I use wrong password I get this failure, instead of auth failure. Auto reconnect gets stuck here for a very long time this will retry many many times then abort. (60s) Similar issue with some auth protections wifi:Association refused temporarily

[MEM] free: 253092 | max: 45044 | frag:  83% 
*wm:[3] . 
[  7038][V][WiFiGeneric.cpp:362] _arduino_event_cb(): STA Disconnected: SSID: leela, BSSID: f4:92:bf:7f:35:4b, Reason: 15
[  7040][D][WiFiGeneric.cpp:931] _eventCallback(): Arduino Event: 5 - STA_DISCONNECTED
[  7047][W][WiFiGeneric.cpp:955] _eventCallback(): Reason: 15 - 4WAY_HANDSHAKE_TIMEOUT
[  7054][D][WiFiGeneric.cpp:979] _eventCallback(): WiFi AutoReconnect Running

https://github.com/tzapu/WiFiManager/issues/1563

Sketch

WiFi.begin();

Debug Message

*wm:[3] . 
[  7038][V][WiFiGeneric.cpp:362] _arduino_event_cb(): STA Disconnected: SSID:xxxxx, BSSID: f4:92:bf:7f:35:4b, Reason: 15
[  7040][D][WiFiGeneric.cpp:931] _eventCallback(): Arduino Event: 5 - STA_DISCONNECTED
[  7047][W][WiFiGeneric.cpp:955] _eventCallback(): Reason: 15 - 4WAY_HANDSHAKE_TIMEOUT
[  7054][D][WiFiGeneric.cpp:979] _eventCallback(): WiFi AutoReconnect Running
[  7063][V][WiFiGeneric.cpp:97] set_esp_interface_ip(): Configuring Station static IP: 0.0.0.0, MASK: 0.0.0.0, GW: 0.0.0.0
[MEM] free: 252800 | max: 45044 | frag:  83% 
*wm:[3] [EVENT] WIFI_REASON:  15


### Other Steps to Reproduce

2.0.6
2.0.7

### I have checked existing issues, online documentation and the Troubleshooting Guide

- [X] I confirm I have checked existing issues, online documentation and Troubleshooting guide.
tablatronix commented 1 year ago

https://github.com/tzapu/WiFiManager/issues/1563#issuecomment-1433951131

I have not had time to packet sniff anything

Jason2866 commented 1 year ago

@tablatronix Can you try this framework (easy with Platformio)? We (Tasmota) have timeout issues in lwip (DHCP) in some rare cases. This build uses a older lwip commit (IDF). Our issues are gone when using.

platform = https://github.com/tasmota/platform-espressif32/releases/download/2023.02.00/platform-espressif32.zip
platform_packages = framework-arduinoespressif32 @ https://github.com/Jason2866/esp32-arduino-lib-builder/releases/download/1254/framework-arduinoespressif32-lwip_timeout-ed6742e7f0.zip 
tablatronix commented 1 year ago

@Jason2866 I will, I saw that when investigating last night but it did not sound exactly similar, Ill try it.

tablatronix commented 1 year ago

I did not see any change with that repo, Do I need to wipe anything to force lwip in pio? It looks like just an include not requiring an xtensa wipe

*wm:[2] [SYS] WM version:  v2.0.15-rc.1
*wm:[2] [SYS] Arduino version:  2.0.7
*wm:[2] [SYS] ESP SDK version:  4.4.4.20230310
tablatronix commented 1 year ago

unifi ap logs

This could be a host issue, it should not be handling wrong password like this.. wrong pass on my iphone does the same thing.

hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: invalid MIC in msg 2/4 of 4-Way Handshake


Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: : wevent[11872]: wevent.ubnt_custom_event(): EVENT_STA_LEAVE ra0: ac:67:b2:df:85:7c / 21
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.11: authentication OK (open system)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c MLME: MLME-AUTHENTICATE.indication(ac:67:b2:df:85:7c, OPEN_SYSTEM)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c MLME: MLME-DELETEKEYS.request(ac:67:b2:df:85:7c)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.11: authenticated
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.241010] CFG80211_PKT: TX AUTH Frame
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.241062] ra0: AUTH - CFG80211_AuthRespHandler
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.244338] CFG80211_AssocReqHandler(): peer support rate mode = 0x7
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.244492] ap_cmm_peer_assoc_req_action(): peer support rate mode = 0x7
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.11: associated
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: event 1 notification
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: start authentication
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.1X: unauthorizing port
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: sending 1/4 msg of 4-Way Handshake
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.247405] ===[APAssocRespTxSHandler]: (RE=0, LE=0, ME=0), wlan_idx = 21
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.247453] ===[APAssocRespTxSHandler]: wlan_idx = 21,AuthAssocNotInProgressFlag = 1 pEntry->bssid f4:92:bf:7f:35:4b
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.249084] ra0: Send EAPOL-Key M1, DA=ac:67:b2:df:85:7c, SA=f4:92:bf:7f:35:4b, len=113
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.253984] ra0: Recv EAPOL-Key M2, DA=f4:92:bf:7f:35:4b, SA=ac:67:b2:df:85:7c, len=129
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: : wevent[11872]: wevent.ubnt_custom_event(): EVENT_STA_JOIN ra0: ac:67:b2:df:85:7c / 21
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: received EAPOL-Key frame (2/4 Pairwise)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: invalid MIC in msg 2/4 of 4-Way Handshake
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: stahtd: stahtd[11874]: [STA-TRACKER].stahtd_dump_event(): {"message_type":"STA_ASSOC_TRACKER","mac":"18:74:2e:9c:1e:5e","vap":"ra0","event_type":"fixup","assoc_status":"0","event_id":"4","arp_reply_gw_seen":"yes","dns_resp_seen":"yes","avg_rssi":"-53"}```
Jason2866 commented 1 year ago

That's my version ESP SDK version: 4.4.4.20230310 so the issue is not the same and not related to the lwip changes in IDF.

tablatronix commented 1 year ago

It seems this is a normal PSK PMK failure, But this handling is new, so either these events were missing in previous sdk and handled by auth_fail, not sure how you can infer the difference now..

EIther way a 60s autoreconnect is too damn high.. lol

mrengineer7777 commented 1 year ago

@tablatronix Interesting. Are you saying the Autoreconnect is not working, or it's too aggressive and never fails? I agree that the UniFi AP may be sending back a wrong response on a bad password.

tablatronix commented 1 year ago

It looks like adding autoreconnect to reason 15 was a bad idea, as bad passwords (depending on wpa config) will throw this error. Also autoreconnecting for 60s is way too long but we can pass custom timeout so not the main issue.

tablatronix commented 1 year ago

Conclusion waitForConnectResult does not check for WL_CONNECTION_LOST and gets stuck for timeout

WL_CONNECTION_LOST is < WL_DISCONNECTED

tablatronix commented 1 year ago

oh hey @mrengineer7777 you are the person that refactored this, I wasn't pulling git, thanks this needed a major cleaning up.

A couple things I am looking at atm.

tablatronix commented 1 year ago

The good news is that my other deauth issue # is gone now, I can connect in 300ms..

tablatronix commented 1 year ago

My solution atm is to fail on this reason, and it is working ok. Will have to re-read old issues that demonstrate other causes of this reason and breaking effects

       if(reason == WIFI_REASON_ASSOC_LEAVE) {                                     //Voluntarily disconnected. Don't reconnect!
        }
        if(reason == WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT) {                                     //Voluntarily disconnected. Don't reconnect!
            WiFiSTAClass::_setStatus(WL_CONNECT_FAILED);
        }
        else if(first_connect) {                                                    //Retry once for all failure reasons

and exclude it from autoreconnect

bool WiFiGenericClass::_isReconnectableReason(uint8_t reason) {
    switch(reason) {
        case WIFI_REASON_UNSPECIFIED:
        //Timeouts (retry)
        case WIFI_REASON_AUTH_EXPIRE:
        // case WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT:
        case WIFI_REASON_GROUP_KEY_UPDATE_TIMEOUT:
mrengineer7777 commented 1 year ago

@tablatronix There was quite a bit of discussion on this. Please read here: https://github.com/espressif/arduino-esp32/issues/7210.

It's been a while, but I believe WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT was one of the common failures I would see on initial connection with a good password. Thus it needs to be in the first retry. I could be convinced that it shouldn't be included in additional retries.

Honestly the whole reconnect mechanism feels kludgy and it could use a re-write. However with the PR I submitted it works "well enough" and I really don't have time to write my own WiFi library from scratch.

I developed and tested _isReconnectableReason() under arduino-esp32 v2.0.3 and v2.0.5. It's possible the behavior is different under v2.0.7. I recently read that IDF 4.4.4 changed the LWIP version and may have introduced a DHCP timeout bug.

Probably unrelated: An API change was also introduced here: https://github.com/espressif/arduino-esp32/pull/7577/files . I don't see how it impacts the disconnect reasons. It could cause disconnects if it introduces the DHCP timeout bug.

tablatronix commented 1 year ago

I figured those errors was due to a bug in PMF, similar to the binary change that was needed here.

https://github.com/espressif/esp-idf/issues/8192

I would be curious to see if your spurious 4way timeouts are still occurring

mrengineer7777 commented 1 year ago

@tablatronix That was an interesting read. I'll have to turn off autoreconnect and see what kind of connection failures I am getting. I manage our UniFi AP at work, so I can also try turning on protected management frames (PMF) and see if that is an issue. Unfortunately I'm not setup for packet capture on wireless. Best I can give you is disconnect reason codes from the Arduino core.

According to IDF release notes, PMF fix may have been added in IDF 4.4.2. Arduino-esp32 v2.0.5 includes 4.4.2. It is very possible it was fixed since I did my initial testing on v2.0.3.

Anecdotally I sometimes still see WiFi disconnects shortly after a reboot, but the autoreconnect logic gets the device right back on the network.

I'll do some testing. Ping me in 2 days if I forget.

tablatronix commented 1 year ago

Yeah those are usually deauth protections and it makes you wait to prevent dos, usually there is a relevant packet descriptor, at least that is my assumption, never found a whitepaper or hostapd docs on that, and its probably not in spec.

Alot of this stuff probably gets copied from cisco

tablatronix commented 1 year ago

I updated the title to clarify this is not a problem with trying to reconnect, its a problem with WPA2 connection fails, and they should not be retried, because this is the only event response to abort the connection. Hence we get stuck in a waitforconnectresult timeout.

If there are legitimate auth or connection issues with 4-way timeouts then they should be handled special, waitFor.. timeoutLength should only be there to timeout in cases of missing events

I would revert this fix and readdress the bug it was intended to fix.

In that case perhaps another single reconnect attempt limiter inside of firstconnect

Another thing to test is does the spec or router implementations send another event after this timeout to explicitly say auth failed..? Maybe we just need to wait for that, I will see what i can find. Need a wifi engineer here...

Jason2866 commented 1 year ago

@tablatronix Released today a new core build based on latest IDF 4.4. A lot of changes are done since last release. Maybe it fixes the issue. Only working with Platformio!

platform = https://github.com/tasmota/platform-espressif32/releases/download/2023.06.04/platform-espressif32.zip
tablatronix commented 1 year ago

@Jason2866 thanks, will check it out