espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
12.93k stars 7.09k forks source link

[TW#13516] Wifi Disconnect Loop #724

Closed Resultfactory closed 6 years ago

Resultfactory commented 7 years ago

Lately after a few hours of running the esp32 goes into a disconnect loop.

A crash used to occur on:

I (22425788) wifi: ap_probe_send over, resett wifi status to disassoc

See log

Ever since esp-idf 058eb26574a41d0be7fcbae554b1236ec6147854, the esp32 is going into a wifi disconnect loop instead of a crash. See log. The last 10 lines in this log file repeat until a reset happens.

We have tested esp_wifi_disconnect(), which properly disconnects and reconnects. See log

Except on those wifi: ap_probe_send over, resett wifi status to disassoc This problem occurs on a DOIT board with the esp-wroom32, Pycom LoPy and SiPy. We also tested this on 4 different routers.

Another disconnect loop on both a DOIT and LoPy log where wifi log level is set to verbose. This happend after we updated esp-idf to 7ed8c665471200613552e113fd0c1e2d017ea8f7

What can we do about this problem? We would like to recover from this disconnect. A crash is less problematic because after a restart the esp32 functions again. The same issue is posted on esp32.com

Resultfactory commented 7 years ago

We added a wifi deinit to try and circumvent the disconnect loop.

static void reinit_wifi(void *param)
{
    ESP_LOGD(TAG, "Stop, Deinit Wifi");
    ESP_ERROR_CHECK(esp_wifi_stop());
    ESP_ERROR_CHECK(esp_wifi_deinit());

    ESP_LOGD(TAG, "Init, Start Wifi");
    ESP_ERROR_CHECK(esp_wifi_init(&cfg));
    ESP_ERROR_CHECK(esp_wifi_set_storage(WIFI_STORAGE_FLASH));
    ESP_ERROR_CHECK(esp_wifi_set_mode(WIFI_MODE_STA));
    ESP_ERROR_CHECK(esp_wifi_set_config(WIFI_IF_STA, &wifi_config));
    ESP_ERROR_CHECK(esp_wifi_start());

    esp_wifi_set_ps(WIFI_PS_NONE); // Disable powersave

    ESP_LOGD(TAG, "Restart Wifi Done");
    vTaskDelete(NULL);
}

This function is started as a task, after 10 consecutive disconnects. We tested this function by restarting the router. This caused 10 consecutive disconnects and then the ESP32 reconnected to the router and the internet. See log It seems I (81906) wifi: ap_probe_send over, resett wifi status to disassoc also occurs when resetting the router.

This build ran from 21-6-2017 17:30 till 23-6-2017 on a DOIT and a LoPy The disconnect loop occurred three times on both devices. LoPy: log DOIT: log

The disconnect loops and subsequent crashes occurred on 6 different times. These are the last time marks in the log files before the crash:

DOIT
I (14567551) Statistics: [21-06-17 20:06:56Z] Uptime: 000.04:02:38  IPv4: 10.0.2.251    Free Heap: 17140    Temperature: 75
I (9736161) Statistics: [22-06-17 16:21:57Z] Uptime: 000.02:42:07   IPv4: 10.0.2.251    Free Heap: 75276    Temperature: 76
I (7312241) Statistics: [22-06-17 18:25:33Z] Uptime: 000.02:01:43   IPv4: 10.0.2.251    Free Heap: 75948    Temperature: 75
LoPy
I (23643342) Statistics: [21-06-17 22:09:54Z] Uptime: 000.06:33:54  IPv4: 10.0.2.250    Free Heap: 55524    Temperature: 41
I (4682685) Statistics: [22-06-17 01:18:33Z] Uptime: 000.01:17:53   IPv4: 10.0.2.250    Free Heap: 62288    Temperature: 41
I (12137265) Statistics: [22-06-17 23:27:11Z] Uptime: 000.03:22:07  IPv4: 10.0.2.250    Free Heap: 75264    Temperature: 41
HubbyGitter commented 7 years ago

My code has a control flow aus outlined below: I can confirm that the connect calls yield a disconnect event with reason 201 very often. It takes e.g. 5 or 8 attempts to successfully connect. However, the first connect attempt after a reset via button appears to be always successful -- but not necessarily after the "hard reset" which make flash announces.

ESP32 Rev 0 here.

startup:   tcpip_adapter_init()   esp_wifi_init()   esp_wifi_set_storage()   esp_wifi_set_mode()   esp_wifi_set_config() ondemand:   WAIT_FOR_DEMAND   esp_wifi_start()   esp_wifi_connect() active:   PROCESS_DEMAND   CHECK_DEMAND_OR_TIMEOUT   IF (demand) GOTO active   esp_wifi_disconnect()   esp_wifi_stop()   GOTO ondemand

Resultfactory commented 7 years ago

@HubbyGitter We have no trouble connecting to the AP. There are certain situations where calling esp_wifi_connect() is not enough. Our Wifi event handler can successfully recover from a disconnect, even the reinit_wifi() has successfully run. mentioned in our previous post:

See log

We enabled CONFIG_SW_COEXIST_ENABLE, this does not have a effect on the occurrence of the disconnect loop. Our system ran this weekend on the DOIT and LoPy. 9 Disconnect loops occurred at these times:

DOIT
I (2339213) Statistics: [24-06-17 17:56:45Z] Uptime: 000.00:38:50   IPv4: 0.0.0.0   Free Heap: 139816   Temperature: 74
I (32786244) Statistics: [25-06-17 03:04:19Z] Uptime: 000.09:06:17  IPv4: 10.0.2.253    Free Heap: 74476    Temperature: 74
I (10469024) Statistics: [25-06-17 06:00:27Z] Uptime: 000.02:53:36  IPv4: 10.0.2.253    Free Heap: 38792    Temperature: 73
I (69754) Statistics: [25-06-17 06:08:20Z] Uptime: 000.00:01:00 IPv4: 10.0.2.253    Free Heap: 78924    Temperature: 74

LoPy
I (38835507) Statistics: [25-06-17 10:28:41Z] Uptime: 000.10:47:07  IPv4: 10.0.2.251    Free Heap: 66060    Temperature: 41
I (11640467) Statistics: [25-06-17 14:45:12Z] Uptime: 000.03:13:51  IPv4: 10.0.2.251    Free Heap: 34316    Temperature: 40
I (674877) Statistics: [25-06-17 14:58:18Z] Uptime: 000.00:11:05    IPv4: 10.0.2.251    Free Heap: 79044    Temperature: 40
I (1662877) Statistics: [25-06-17 20:40:20Z] Uptime: 000.00:27:34   IPv4: 10.0.2.251    Free Heap: 78648    Temperature: 41
I (8036827) Statistics: [25-06-17 23:17:55Z] Uptime: 000.02:13:47   IPv4: 10.0.2.251    Free Heap: 78080    Temperature: 41

In all 9 disconnect loops the WDT triggers ( The Task watchdog is set to 60seconds) (log) These lines from reinit_wifi are never reached:

    ESP_LOGD(TAG, "Restart Wifi Done");
    vTaskDelete(NULL);
HubbyGitter commented 7 years ago

@Resultfactory I'd appreciate if you elaborated a bit. When is calling esp_wifi_connect() not enough?

Maybe I should have mentioned that the control flow stops the WiFi and starts from there when an event occurs which does not lead in the right direction.

Resultfactory commented 7 years ago

@HubbyGitter that is our question as well. Our Wifi works fine, until this Disconnect loop happens Log. On every disconnect event, esp_wifi_connect() is called. After 10 consecutive disconnects we try to restart the wifi with reinit_wifi() which ends up triggering the WDT.

This weekend for example, the system was online for 10hours 48minutes (Log). Then this disconnect loop happens, we call reinit_wifi() and the system crashes.

rojer commented 7 years ago

gentlemen, can i ask you if wifi station reconnect works for you after AP goes down and back up? i just filed https://github.com/espressif/esp-idf/issues/738 and feel like an idiot - it seems so trivial. but "reason 201" loop is the common part, so can i ask you to test and see if this may be an issue for you too?

TimXia commented 7 years ago

@Resultfactory If disconnect event happens, do not need to reinit WiFi, just call esp_wifi_connect() to reconnect.

HubbyGitter commented 7 years ago

@rojer For me, it does not matter why a reconnect takes place -- lost connection or intended disconnect. The first connect after system boot works well, and after that, I'm running into (finite) 201 loops, which do succeed eventually (e.g. as I write this, on the fourth attempt). I'm stopping WiFI but not reinitialising it in my FSM if something goes wrong and also, there'll be a short delay. The latest updates as of now have not changed the behaviour substantially.

FHFS commented 7 years ago

FWIW, we're aware of this bug and we're trying hard to get this fixed before the 2.1 release. - ESP_Sprite on esp32.com

HubbyGitter commented 7 years ago

@FHFS Thank you for sharing this piece of critical information.

Resultfactory commented 7 years ago

@HubbyGitter We tested your workaround for reconnecting to the Wifi AP. You call a esp_wifi_stop & esp_wifi_start. We have tried doing the same. For the record we scan BLE advertisements. (*)On every disconnect event we call a esp_wifi_connect, after 10 disconnects with failed esp_wifi_connect's we call esp_wifi_stop & esp_wifi_start. The call to esp_wifi_start never returns and a WDT is generated.

We also preformed this test over the weekend with BLE turned off. The same 201 disconnect loop occurs, but we are able to recover by calling esp_wifi_stop & esp_wifistart. (*)EDIT clarifaction_

HubbyGitter commented 7 years ago

@Resultfactory I don't see it as a workaround to stop the WiFI if it delivers an unexpected event. It doesn't come with too much extra costs (the most expensive stuff happens on wifi init), saves a case distinction or two, and gets you back to a consolidated state (with reasonable likelihood).

I have no idea what "count to ten" means (not an empty loop from 1 to 10, hopefully) and no, I'm delaying between wifi stop and wifi start, not between error and wifi stop -- wifi stop delivers an asynchronous event, for which I first wait (with a generous timeout). Still, there's a delay before the retry. Aggressive retries are rarely helpful, especially since a retry's justification is the hope that something changed in the meantime.

Yes, I'm in the robust code fan club :registered:... :white_check_mark:

Resultfactory commented 7 years ago

@HubbyGitter Thanks for the suggestion, we will test adding one second delay between the stop and start. Although the test we did without BLE recovered nicely without delay from a 201 disconnect loop with stop start.

liuzfesp commented 6 years ago

Hi @Resultfactory I think this issue already fixed in release 2.1

FayeY commented 6 years ago

Any update? Is this problem solved now?

Resultfactory commented 6 years ago

@FayeY Today we encountered our first disconnect loop in a few weeks. Our code had changed alot(we don't continuously scan ble advertisements anymore), this is why we could not be certain if esp-idf v2.1 fixed this issue. See line 336 of our log The last message came in on

9-8-2017 14:17:59> Device: [dev01], Data:[{"Uptime":"000.00:47:00", "Connects":1, "Disconnects":0, "KeysReceived":0, "OpenDoors":0, "InvalidKeys":0, "FreeHeap":56420, "Temperature":41, "IPv4":"10.0.2.250"}]

The disconnect loop keeps repeating, sometimes it finds the AP and the state is set to run. But then disconnects.

W (3668260) BEAST: restartWifi -> done E (3668260) BEAST: Unhandled event 3 W (3668270) BEAST: network_status_changed: esp_wifi_connect on SYSTEM_EVENT_STA_START I (3668400) wifi: n:4 0, o:1 0, ap:255 255, sta:4 0, prof:1 I (3668400) wifi: state: init -> auth (b0) I (3668400) wifi: state: auth -> assoc (0) I (3668410) wifi: state: assoc -> run (10) I (3669470) wifi: connected with linksysTest, channel 4 D (3669470) BEAST: WiFi RSSI: -61 I (3672460) wifi: state: run -> auth (2c0) I (3672460) wifi: n:4 0, o:4 0, ap:255 255, sta:4 0, prof:1 E (3672460) BEA*ST: SYSTEM_EVENT_STA_DISCONNECTED (107), Reason 2

ricksondpenha commented 6 years ago

Hi is there any update on this? any workarounds? this seems to be a critical bug in our project.. restarting the esp wifi during mid transmission is a big bummer!!

liuzfesp commented 6 years ago

HI @ricksondpenha is this issue easy to reproduce or not? According to the log line88, the RSSI is -71, it means the WiFi signal is week, I guess the AP (the router) is far aways from the device (esp32). It disconnects in 336 because esp32 receives a de-auth packet from AP, then the esp32 can't find the AP. So could you help to check the following things:

  1. How far is the AP from ESP32?
  2. Is the mobile easy to connect the AP?
  3. If the issue is easy to reproduce, check whether the mobile phone can find the AP when ESP32 fails to find the AP (when the reason code is 201).
FayeY commented 6 years ago

Closing due to lack of response. Please feel free to reopen it if the issue persists and you can supply more details.