esp8266 / Arduino

ESP8266 core for Arduino
GNU Lesser General Public License v2.1
16.03k stars 13.33k forks source link

WiFi cannot connect until a power cycle. #5527

Closed Swedish-Coder closed 4 years ago

Swedish-Coder commented 5 years ago

Basic Infos

Platform

-Hardware: ESP8266 -Core Version: 2.4.2 -Development Env: Visual Micro -Operating System: Windows

Settings in IDE

-Module: Generic ESP8266 Module -Flash Mode: qio -Flash Size: 2MB (256k Spiffs) -lwip Variant: v1.4 -Reset Method: ck -Flash Frequency: 40Mhz -CPU Frequency: 80Mhz -Upload Using: Serial -Upload Speed: 921600

Problem Description

Using the reconnect from below works fine 99% of the time. Running this test on ~50 ESP8266. Reconnect is called every 30seconds. After some undefined time (ranging from 1h to 10days..) the Wifi will not be able to reconnect.

The Status stays like this until the device is reset (hardware reboot), have tried letting the devices stay in this mode for days, but they never recover. Rebooting the AP does not help. The signal strength is also not the issue (ran a test with AP in front of ~42 ESP8266 with ~2m away from devices, they all had ~ -65 dba).

Have tried different disconnect / reconnects, all resulting in the same stale state. Currently trying WiFi.disconnect(true); forceSleepBegin(35000); //…do other tasks for 25ms WiFi.forceSleepWake(); WiFi.persistent(false); WiFi.begin(ssid.c_str(), psk.c_str());

This sometimes causes crashes on the WiFi.begin(..) line.

Is there a way to “properly” do a reconnect on WiFi, or a way to reset the WiFi to “hardware boot” status (delete all DHCP information, known AP, etc.)?

MCVE Sketch

void Reconnect(){
WiFi.disconnect(true);
WiFi.mode(WIFI_STA);

While(WiFi.status() != WL_CONNECTED){
//do other things
}
Serial.println(String(F("[WiFi] Connect now... heap: ")) + String(ESP.getFreeHeap()) + String(F(", ts: ")) + String(millis()));

//… load ssid & pass from spiffs
WiFi.begin(ssid.c_str(), psk.c_str());
}
kherNI commented 5 years ago

I'm glad I'm not the only one that has seen this with 2.4.2. I have 2 devices that are installed a long way away and after anywhere from 1-10 days, like you mention, they will not reconnect to the WiFi AP. In this application they are going through a cellular modem and they seem to lose communication fairly regularly (multiple times a day) even though the signal strengths have been pretty good. I was thinking that maybe it had something to do with the changes with lwIP. I haven't been able to try and diagnose them due to their location. In the meantime I rolled back to 2.4.0 (I've had serious problems with 2.4.1) and the reliability seems to have improved. I've also implemented some logic that if after a certain period of time (i.e. 5 minutes) of not being able to reconnect the MCU is restarted. I was hoping to see if things improve with 2.5.0 when it comes out of Beta.

d-a-v commented 5 years ago

I was hoping to see if things improve with 2.5.0 when it comes out of Beta.

We need to reproduce to have a chance to fix. Fixes will be made to 2.5.0-beta2, not to 2.4.x. Please try it, and enable debug on Serial to possibly get firmware messages that would help understanding.

I've been running this sample for several hours with core-2.5.0-beta2 with no issue so far.

#include <ESP8266WiFi.h>

#ifndef STASSID
#define STASSID "your-ssid"
#define STAPSK  "your-password"
#endif

const char* ssid     = STASSID;
const char* password = STAPSK;

const char* host = "10.0.1.253";
const uint16_t port = 80;

void connect ()
{
  Serial.printf("wake = %d\n", WiFi.forceSleepWake());
  WiFi.mode(WIFI_STA);
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }
  Serial.println("");
  Serial.println("WiFi connected");
  Serial.println("IP address: ");
  Serial.println(WiFi.localIP());
}

void disconnect ()
{
  WiFi.disconnect(true);
  Serial.printf("sleep 1us = %d\n", WiFi.forceSleepBegin(1));
}

void setup() {
  Serial.begin(115200);
  Serial.println();
  Serial.print("Connecting to ");
  Serial.println(ssid);
  WiFi.persistent(false);
}

void loop() {
  connect();
  Serial.print("connecting to ");
  Serial.print(host);
  Serial.print(':');
  Serial.println(port);

  WiFiClient client;
  if (!client.connect(host, port)) {
    Serial.println("connection failed");
    delay(5000);
    return;
  }

  Serial.println("sending data to server");
  if (client.connected()) {
    client.println("hello from ESP8266");
  }

  unsigned long timeout = millis();
  while (client.available() == 0) {
    if (millis() - timeout > 5000) {
      Serial.println(">>> Client Timeout !");
      client.stop();
      return;
    }
  }

  Serial.println("receiving from remote server");
  while (client.available()) {
    char ch = static_cast<char>(client.read());
    Serial.print(ch);
  }

  Serial.println();
  Serial.println("closing connection");
  client.stop();

  disconnect();
}
Swedish-Coder commented 5 years ago

An update on the progress of tracking down this bug in the new Version:

-Prepared a new Version using 2.5.0-Beta2 and lwlp v2. -Ran a test on 10 Devices yesterday.

We seem to have some other issues from the switch from 2.4..2 to 2.5.0-Beta 2, the heap is way lower and it seems that we got a memory leak from switching as the heap is in a steady decline, causing a reboot ~ every 2hours. Keep that in mind for the results..

4 Devices did not make it through the night. All of them giving up ~50-60min after the last boot (after which the end up in a state where they cannot establish a connection). These keep on trying to connect every 10 seconds, but have never been able to establish the connection again.

Restarted 2 of the 4 Devices made them able to reconnect on the first try. Leaving the other 2 to keep on trying to connect for now to see if they recover.

The other 6 are still running and reconnect successfully every 10 seconds. (but are also rebooting a lot)

The next steps will be to find and at least reduce the memory leak so that we can conduct a more meaningful test on this. Will increase reconnect timer to 15 seconds to give the chip more time to reconnect. Will also model the disconnect / reconnect after the example d-a-v posted (using force sleep / wake), but will also wait 1.2 seconds between disconnect and reconnect. Since all but one of these devices are not connected via Serial, will also implement that the last "failed wifi" status will be saved to the spiffs for more information. Ill post my findings here soon...

OK will have 4 running with the new version which is a bit more stable and using the changed mentioned above. Will update if the issue happens on any of the 4 Devices.

I will also run 6 devices on Version 2.4.0 with the changes above and reconnect every 15seconds to see if the issue occurs on this version as well.

Update

The issue also happened once on the devices running 2.4.0 and lwIp 1.4.

TD-er commented 5 years ago

At ESPeasy we're getting similar reports on this for many months now. My plan for a next test was to make a call to set the wifi mode to off and on again to see if that will fix this state.

d-a-v commented 5 years ago

The sketch I posted above had been working for two full weeks (I was away and had let it run).

In order to help debugging, we need to be able to understand where the cause is. Ideally, a packet capture facility running on the access point would help to check wether the esp is connecting to wifi (radio, layer1) then sends dhcp requests (or any other IP packet with fixed address, layer2) on the network. A dd/openwrt-like AP is nice because one can run tcpdump -ni <intf> host 192.168.xxx on it.

TD-er commented 5 years ago

@d-a-v It seems there are several issues at hand which overlap in observed symptoms. A quick glance at the ESPeasy WiFi related issues shows it has been reported for quite some time now.

This one ESPeasy - DeepSleep Problem to send Data has some nice Wireshark dumps/screenshots.

But like I said, it seems to be more than one issue, or at least there is not one single remedy (apart from reboot) to solve it. Some reported fixes (e.g. reboot AP or disconnect ESP from within AP management tools) do not seem to work for others.

One of our users is now testing if this suggested fix will work (in his setup): https://github.com/esp8266/Arduino/issues/2186#issuecomment-233853152

Also this closed issue (WiFi Reconnect only after power cycle #2235) does seem to be related to the one we're discussing right now.

d-a-v commented 5 years ago

Please do your tests and reports using the latest "git" version of this core. Remember that espressif radio/layer1 firmware has been updated since core-2.4.2 (last summer).

TD-er commented 5 years ago

You mean the 'stage' branch?

d-a-v commented 5 years ago

We have only one branch. So yes. Or release 2.5.0-beta2

Swedish-Coder commented 5 years ago

The error has now occured since we have implemeted a bit more exception logging. Ammongs other things we now save the last WiFi dc / reconnect Event. The Last WiFi Connection error that we received was "WIFI_DISCONNECT_REASON_AUTH_EXPIRE"

This on 2.4.2 and LwIP 1.4.

TD-er commented 5 years ago

WIFI_DISCONNECT_REASON_AUTH_EXPIRE is a normal disconnect reason and it may occur quite often (depending on the settings in de AP)

Swedish-Coder commented 5 years ago

Another short update We are now logging the last 10 Disconnect / Reconnect Events. The Devices are Auto rebooted after 20 hours with no Wifi.

We have this running on ~50 Devices, with another ~400 comming online soon. The Issue has occured 3 times since then...So it does not occur often. However it happens faster the more we disconnect / reconnect.

Here are the Logs

LastWiFiDisconnectReason=8,2,8,201,2,201,2,8,2,8,;TotalFoundWiFis=7;RSSI=66%(-67) LastWiFiDisconnectReason=2,8,2,8,204,2,201,2,8,2,;TotalFoundWiFis=4;RSSI=64%(-68) LastWiFiDisconnectReason=2,201,2,8,2,8,2,8,2,201,;TotalFoundWiFis=7;RSSI=66%(-67)

The 8 (ASSOC_LEAVE) is mostly triggered when we try to reconnect (disconnect is called) We are seeing a lot of reason 2 (WIFI_DISCONNECT_REASON_AUTH_EXPIRE) which I agree is a normal reason, but not for so many consecutive tries....

Swedish-Coder commented 5 years ago

Short update on what we are currently examining.... We have modified the Code to recreate the WiFi class when we reconnect.

WiFi.~ESP8266WiFiClass();
WiFi = ESP8266WiFiClass();

We have been running this test for the last 10 days on 40+ Devices and the issue has never happened since. This on 1.4.2 and LwIp 1.4.

TD-er commented 5 years ago

@Swedish-Coder That's a nice one to try also.

I'm looking at those classes. This class inherits from a lot of other classes, but it has no destructor defined (thus also not a virtual one)

https://github.com/esp8266/Arduino/blob/9ec03ed3f695a18c4f61fc257604e26a44a753eb/libraries/ESP8266WiFi/src/ESP8266WiFi.h#L57

That should be no issue, if the classes it inherits from do have a virtual destructor defined. But ESP8266WiFiGeneric also has none (also no members, not really a problem I guess) Same for ESP8266WiFiSTA , ESP8266WiFiScan and ESP8266WiFiAPClass.

All these classes have static variables defined, but maybe these are only reset in the constructor?

openfsguruh commented 5 years ago

Hi all, I like to reopen or reconfirm the issue and can hopefully give relevant information! I am using Arduino 1.8.5 on Windows 10 using now ESP8266 2.5.0 (released one no beta) I can still see this reconnect issue on some boards but not on all. This is really wired because all ESP8266 are the same Wemos D1 mini and I buyed them at the same date (around Jan 2018) ESP8266 2.4.0 didn't show me this reconnect issue, I got it since 2.5.0. Some ESP reconnects after the reset switch fine, others don't. The crazy thing is WiFi.status() reports a connected state but it is not working. After playing a while around I found this persistent option is playing for me a key role. Usually (and think most of you) have WiFi.persisten(false) in their sketches. It was for me the same. When I set for test reasons WiFi.persistent(true), I can reproduce the reconnect issue on all(!) ESP8266. So this allowed me to analyse it in more detail on a other working ESP reproducing the problem I have only on some of them with peristent off.

I have seen the reconnect was reportet internally really fast but later on was not working. So when I do two cycles in the startup I got a workaround for this reconnection issue also for my bad ESPs.

So in setup I do following: WiFi.persistent( false ); WiFi.setAutoConnect(true); WiFi.mode(WIFI_STA); WiFi.config(ip,gateway,subnet,channel, bssid); WiFi.setSleepMode(WIFI_LIGHT_SLEEP); delay(100); WiFi.begin(ssid, password); WiFi.disconnect(); // first cycle which may be broken delay(200); WiFi.begin(ssid, password); // now WiFi.status() != WL_CONNECTED cycle as usual

This slows down the startup cycle but solved the reset issue for me. It seems there is something stored in flash from older ESP versions, which I can't reset and seems is still influencing WiFi startup even when not explicitly stated. lwIP variant: v2 lower Memory VTables: flash Exceptions: disabled

AndreKR commented 5 years ago

There is one thing I noticed when I enabled WIFI debugging, maybe it is of interest.

In my case I was waking up from deep sleep and I had auto-connect on (the default). When I was looking at the log, I saw the order of events was:

Now I noticed there was no second "dhcp client start..." in the log.

I guessed that maybe this was the reason that the WiFi connection didn't work sometimes.

I now disable the auto-connect in setup() and handle the WiFi connection myself:

  if (WiFi.SSID() != "") {
    println(F("WiFi credentials set in flash, wiping them"));
    WiFi.disconnect();
  }
  if (WiFi.getAutoConnect()) {
    println(F("Disabling auto-connect"));
    WiFi.setAutoConnect(false);
  }
  WiFi.persistent(false);
  WiFi.mode(WIFI_STA);
  WiFi.begin(...);

And suddenly the connection works every time.

d-a-v commented 5 years ago

@AndreKR @openfsguruh can you temporarily remove your workaround and try the one-line #6057 proposal ?

AndreKR commented 5 years ago

@d-a-v No, that doesn't seem to help. Seems my diagnosis was wrong.

But I was able to reduce the code to a simple repro - this sketch quite reliably1 breaks WiFi on all ESP-12F modules I have:

#include <ESP8266WiFi.h>

void setup() {
  Serial.begin(115200);
  delay(3000);
  WiFi.mode(WIFI_STA);
  WiFi.setAutoConnect(true);
  WiFi.begin("Home", "XXX");
}

void loop() {
  if (WiFi.status() != WL_CONNECTED)
    Serial.println("connected");
  else
    Serial.println("not connected");

  Serial.println(WiFi.status());

  delay(500);
}

Example log output:

SDK:3.0.0-dev(c0f7b44)/Core:2.5.0=20500000/lwIP:STABLE-2_1_2_RELEASE/glue:1.1/BearSSL:6778687
wifi evt: 2
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 5
cnt 

connected with Home, channel 6
dhcp client start...
wifi evt: 0
ip:192.168.1.105,mask:255.255.255.0,gw:192.168.1.1
wifi evt: 3
hello wifi
stopping
starting
connected
6
scandone
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
not connected
3
pm open,type:2 0
not connected
3
not connected
3
not connected
3
...

The hello wifi, stopping and starting are debug output that I added together with your workaround. hello wifi at the beginning of ESP8266WiFiSTAClass::begin and stopping and starting right before your workaround.

1 The WiFi connects maybe one in 10 resets.

d-a-v commented 5 years ago

Thanks for this reproducible MCVE. For now I only see this as a firmware bug and I found no possible way to recover from it (yet). This bug exists with both lwIP-v1.4 and -v2. I tried a hack in lwip2 with half luck. I stayed connected a bit longer: The "status == connected" is lost right after the internal firmware serial message "scandone" (on which we have no control, it's not an exposed event).

(the hack is to call system_station_got_ip_set(old-address) a second time with the new address)

877dev commented 5 years ago

Hi all,

Was there an attempt to fix this issue in 2.5.2? As I am having issues with wifi not reconnecting, and I have to power cycle (using 2.5.2)

Thanks

d-a-v commented 5 years ago

@AndreKR Your sketch

  if (WiFi.status() != WL_CONNECTED)
    Serial.println("connected");
  else
    Serial.println("not connected");

should be

  if (WiFi.status() == WL_CONNECTED)
    Serial.println("connected");
  else
    Serial.println("not connected");

(which works)

@877dev please have a try with latest git master branch (post 2.5.2).

@Swedish-Coder @TD-er @openfsguruh @AndreKR Is the issue still present in master ?

TD-er commented 5 years ago

I have not seen reports of unreachable units running core 2.5.2 But what I have seen on nodes I have is that the reported state of WiFi.status() is (still) not always correct. In ESPeasy there are quite a few (too many if you ask me) workarounds present to restart WiFi if it is detected that there is no actual network connection. Some of these result in crashes, which also solves this symptom, albeit not really elegant.

AndreKR commented 5 years ago

:disappointed: Will investigate again.

d-a-v commented 5 years ago

@TD-er Do you detect loss of connectivity with WiFi.status() or some other mean like ping ?

TD-er commented 5 years ago

Nope. The status keeps returning WL_CONNECTED, even when the connection is apparently lost. This does lead to timeouts or waiting for data that will never arrive. I do test the status always before starting a connection to the network and also when waiting for a reply. But every now and then it does happen the returned status is telling me there is a connection but anything I try to do then does result in timeouts. (DNS lookup, NTP request, http request, waiting for data)

I do send gratuitous ARP requests, so the switch/AP should know how to reach the node.

d-a-v commented 5 years ago

We might learn things from the serial port in debug mode when this connectivity loss happens.

TD-er commented 5 years ago

Is there a way of collecting this by not being connected to this port? Problem is that it may take quite a long time for this to occur, at least here where we don't have a lot of WiFi interference. Or else I must make some kind of setup using a Linux machine which will pipe the serial port to a file or something like that.

d-a-v commented 5 years ago

I think we would need to hijack ets_printf() or ets_putc() to do that.

TD-er commented 5 years ago

OK, I've been working on the WiFi part for almost the whole day today and I now have something that is almost 75% reproducible in the crashes I see.

First a very short description of what I'm seeing here and I am not entirely sure it is fully related to this issue or a side effect of something else.

I did try to simplify the WiFi code of ESPeasy by trying to use the WiFiMulti class. While using parts of this code, I ran into something very similar to this issue. It would connect just fine but after a while it would take 10s of attempts (sometimes over a 100) to reconnect to the access point. After a power cycle it was able to reconnect much faster. But the thing is, that adding some delays after a call to the WiFi.begin(...) or inbetween calls to run of the WiFiMulti class could also help. So it looks like it may not be entirely related to the ESP but maybe the AP was simply not yet aware of the fact the ESP was no longer connected.

With this idea in mind, I tried to disconnect the node from the AP by sending a 202 Auth Fail from the AP to the ESP node. In my code it will almost immediately try to reconnect to the last known AP. It will even use the last known channel and BSSID.

In my test setup, I have the ESP running some code to try to connect to a non-existing host on the network (HTTP connection to host "unknown" at 100 msec interval). If I kick the node from the network, it will do a call to WiFi.begin(...)

  setupStaticIPconfig();
  setConnectionSpeed();
  switch (wifi_connect_attempt) {
    case 0:
      if (lastBSSID[0] == 0)
        WiFi.begin(ssid, passphrase);
      else
        WiFi.begin(ssid, passphrase, last_channel, &lastBSSID[0]);
      break;
    default:
      WiFi.begin(ssid, passphrase);
  }
  delay(100); // FIXME TD-er: Not sure why, but a delay is really needed here or else you will run into timeouts.
  ++wifi_connect_attempt;
  switch (WiFi.status()) {
    case WL_DISCONNECTED: {
      break;
    }

When it crashes, the log I write in this case WL_DISCONNECTED: is the last entry I see. After this, the node should receive 2 events (got connected and got IP) to continue. It looks like either these are missed (or never sent) or the connect attempt does really take over 8 seconds. Or maybe some buffers are not being emptied and thus preventing other actions since all resources are occupied. (max. connections, etc)

This is more of a write down for me to keep track of what's happening here, since I'm not really sure yet what is happening here. One thing that is for sure is that the WiFi.status() is not always in sync with reality. For example even after I received the disconnect event I've seen the status value still reporting WL_CONNECTED. And also the other way around. After receiving the "got IP" event, it will take 40 - 100 msec before you can actually send stuff. If you try to send data before, you are very likely to hit the timeout of the client first. So waiting 40 - 100 msec can make things faster :)

Edit: I added these to the function processing the disconnect event (called from the loop())

  WiFiUDP::stopAll();
  WiFiClient::stopAll();

It does make it a bit harder to reproduce (1 in 4 lead to a crash instead of 3 in 4), so it looks like it may have some impact.

TD-er commented 5 years ago

Just a small update here. I just compared the ESP8266 NON_OS SDK API reference (3.5.15. wifi_station_get_connect_status) with the esp8266/Arduino code of status()

https://github.com/esp8266/Arduino/blob/2d6b680554886ac182801bbcca6dd43f85ee6eeb/libraries/ESP8266WiFi/src/ESP8266WiFiSTA.cpp#L627-L643

The case of STATION_CONNECTING is not handled and thus returning WL_DISCONNECTED. Maybe it is a good idea to also give this as a possible return value? (it may break existing code checking for the WL_DISCONNECTED value) At least it is the only enum value that's missing so it seems.

Also there is a note in the PDF which does kind of looks related to the issue I'm seeing:

In a special case, if you call wifi_station_set_reconnect_policy to disable reconnection, and do not call wifi_set_event_handler_cb to register Wi-Fi event handler, wifi_station_get_connect_status will become invalid and can not get the right status.

The reconnect policy is enabled here, and also I have registered event handler callback functions but still the return value of the status() function may not always be up-to-date.

d-a-v commented 5 years ago

Thanks for your researches, did you happen to get STATION_CONNECTING from wifi_station_get_connect_status() ?

TD-er commented 5 years ago

I have not tested that function directly, but I do get the WL_DISCONNECTED a lot when it fails (and thus WD reboots) to perform a (re)connect. So indirect it is very likely I get that return value since there is no other enum value mapping to that result.

TD-er commented 5 years ago

Just to give some debugging context here:

7516 : WIFI : Disconnected: WiFi.status() = 0
7620 : WIFI : Connecting MikroTik attempt #2
7723 : WIFI : WiFi.status() = WL_DISCONNECTED  SSID: MikroTik
7841 : WIFI : Disconnected: WiFi.status() = 6
8049 : WIFI : Disconnected: WiFi.status() = 6
8254 : WIFI : Disconnected: WiFi.status() = 6
8461 : WIFI : Disconnected: WiFi.status() = 6
8678 : WIFI : Disconnected: WiFi.status() = 6
8883 : WIFI : Disconnected: WiFi.status() = 6
9087 : WIFI : Disconnected: WiFi.status() = 6
9338 : WIFI : Disconnected: WiFi.status() = 6
9542 : WIFI : Disconnected: WiFi.status() = 6
9748 : WIFI : Disconnected: WiFi.status() = 6
9970 : WIFI : Disconnected: WiFi.status() = 6
10174 : WIFI : Disconnected: WiFi.status() = 6
10379 : 
 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U55 : 

INIT : Booting version:  (ESP82xx Core 2_5_2, NONOS SDK 2.2.2-dev(c0eb301), LWIP: 2.1.2 PUYA support)
55 : INIT : Free RAM:33072

And at the time where the connect if successful:

10111 : WIFI : Disconnected: WiFi.status() = 6
10360 : WIFI : Disconnected: WiFi.status() = 6
11405 : WIFI : Entering processConnect()
11507 : WIFI : Connected! AP: MikroTik (B8:69:F4:9F:21:FA) Ch: 1 Duration: 3770 ms
11509 : EVENT: WiFi#ChangedAccesspoint
11536 : backgroundtasks: WiFi connected

The last line is also triggered when the WiFi.status() reports WL_CONNECTED The "WIFI : Disconnected...." lines are printed from within the loop(). "WIFI : Entering ..." is when an event has been received and needs to be processed (boolean flag checked in loop() )

And some debug output when there was a connection and I kicked it from the AP:

21116 : WIFI : Entering processDisconnect()
21217 : EVENT: WiFi#Disconnected
21235 : WIFI : Disconnected! Reason: '(202) Auth fail' Connected for 9666 ms
21236 : WIFI : Disconnected: WiFi.status() = 0
21341 : WIFI : Disconnected: WiFi.status() = 0
21447 : WIFI : Disconnected: WiFi.status() = 0
21551 : WIFI : Disconnected: WiFi.status() = 0
21665 : MQTT : Connection lost, state: Connection lost
21666 : EVENT: MQTT#Disconnected
21686 : WIFI : Disconnected: WiFi.status() = 0
21790 : WIFI : Connecting MikroTik attempt #0
21895 : WIFI : WiFi.status() = WL_DISCONNECTED  SSID: MikroTik

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U54 :

INIT : Booting version:  (ESP82xx Core 2_5_2, NONOS SDK 2.2.2-dev(c0eb301), LWIP: 2.1.2 PUYA support)
55 : INIT : Free RAM:33072
d-a-v commented 5 years ago

There are phy updates in #6257, can you try with it ?

TD-er commented 5 years ago

In PlatformIO I must then use this define? PIO_FRAMEWORK_ARDUINO_ESPRESSIF_SDK22y And of course somehow making sure your PR branch is checked out on the computer where I'm building....

d-a-v commented 5 years ago

PIO_FRAMEWORK_ARDUINO_ESPRESSIF_SDK22y

Yes, but I didn't try it myself

TD-er commented 5 years ago

Since I'm doing this every now and then, I have a feature request.... Can it be made a lot easier for platformIO users to just add some label somewhere to fetch a PR branch? Every time I have to fetch some PR version to test some Arduino code it takes me >1 hour to make a build and sometimes a next build is just not referring to the same branch anymore which takes a lot of time to debug.

d-a-v commented 5 years ago

What kind of label ? Whenever I need to test a PR I have this script called PR:

#!/bin/sh
test -z "$2" && { echo "give # name"; exit 1; }
set -x
git checkout master
git branch -D pr-$1-$2
git fetch origin pull/$1/head:pr-$1-$2 && git checkout pr-$1-$2
git branch | grep pr-$1-$2

which gives:

$ PR 6257 my-sdk222y-testing-branch
[...]
Switched to branch 'pr-6257-my-sdk222y-testing-branch'
[...]

(this script works with any repository on github)

TD-er commented 5 years ago

I did those steps in the terminal by hand and the result is that PlatformIO is detecting it is another core lib and thus renames the folder and start fetching the one it thinks it needs.

So it renames C:\Users\gijs\.platformio\packages\framework-arduinoespressif8266 into C:\Users\gijs\.platformio\packages\framework-arduinoespressif8266@2.5.0-dev

No idea where it gets this 2.5.0-dev from, since it states 2.6.0-dev in the file platform.txt.

Edit: It is in package.json. Changed the version in there to match the one it thinks it is building against and renamed the folders. Now it does have the newer core lib, since it is not able to build due to compiler errors.

Edit2: It is building now, I forgot to run the submodule init call.

d-a-v commented 5 years ago

Maybe @ivankravets could help with the platformio process ?

TD-er commented 5 years ago

OK, now it has been built with the ..SDK22y flag And it is running ESP82xx Core 27327617, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support

So far I have not been able to get it to crash after a disconnect. But it also seems to ignore/miss the WiFi disconnects I send from the MikroTik AP.

Edit: Sadly it still will crash occasionally on a reconnect

187456 : EVENT: WiFi#Disconnected
187495 : WIFI : Disconnected! Reason: '(202) Auth fail' Connected for 4513 ms
187496 : WIFI : Disconnected: WiFi.status() = 0
187602 : WIFI : Disconnected: WiFi.status() = 0
187706 : WIFI : Disconnected: WiFi.status() = 0
187811 : WIFI : Disconnected: WiFi.status() = 0
187915 : WIFI : Disconnected: WiFi.status() = 0
188019 : WIFI : Connecting MikroTik attempt #0
188123 : WIFI : WiFi.status() = WL_DISCONNECTED  SSID: MikroTik

 ets Jan  8 2013,rst cause:4, boot mode:(3,7)

wdt reset
load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v27327617
~ld
�U69 : 

INIT : Booting version:  (ESP82xx Core 27327617, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support)
70 : INIT : Free RAM:32800
71 : INIT : Warm boot #26 Last Task: Const Interval timer, id: 3 - Restart Reason: Hardware Watchdog
74 : FS   : Mounting...

But good news is it is a lot less often

TD-er commented 5 years ago

Just an observation here. I tried to disconnect it a few times and noticed nothing on the ESP side. So no disconnect events were received. The unit is still reachable from the network etc. but the reported WiFi.status() is 4, That's WL_CONNECT_FAILED right?

N.B. automatic reconnect is enabled.

Edit: Never mind, it is echo'ing an internal value I use, not the output of the WiFi.status().

TD-er commented 5 years ago

Something fishy is going on here and not just regarding this new SDK version. I've also seen it happen a lot yesterday, now that I realize it.

Problem is that one build seems to be able to connect to the AP just fine and a next one is just not able to connect to the AP. Only once in 10's or maybe 100 attempts (and thus crash/reboots)

Yesterday I saw it when changing delay values to get a feeling for their impact, but now I saw it again. I just replaced some simple value in a log string.

if (loglevelActiveFor(LOG_LEVEL_INFO)) {
    String log;
    log.reserve(60);
    log = F("WD   : Uptime ");
    log += wdcounter / 2;
    log += F(" ConnectFailures ");
    log += connectionFailures;
    log += F(" FreeMem ");
    log += FreeMem();
    log += F(" WiFiStatus ");
    log += wifiStatus;
    addLog(LOG_LEVEL_INFO, log);
  }

And then replacing wifiStatus into WiFi.status().

With just this single change the resulting executable was not able to connect at all. Not even with a clean and rebuild. Changed it back, did a clean and a rebuild and it connects just fine.

This is also behavior we're seeing for a long time in ESPeasy builds. Some people report issues about not being able to connect while for others it is working just fine.

This does "smell" like some buffer allocation or initialization which is not right. So either at runtime some buffers are allocated in a different order or there is some compile time effect in play. It just looks like we're addressing something that's either not 0-terminated or just a byte short in allocated space and then try to find the end of a string or something. Also I have no idea if it is in my code or the core library, but these things do happen every now and then.

Is there some compiler flag possible to force any allocated array to be initialized with 0 (or some predefined value) ?

Edit: I also noticed the 'failing' builds (crashing at the initial wifi connect the same way it crashes "normally" when processing a disconnect) were also not running I2C communications well. In the logs I noticed a lot of errors of a plugin not able to read some registers from a connected I2C device. With the proper build I2C is also working fine.

ivankravets commented 5 years ago

Do we have any problems with PlatformIO?

TD-er commented 5 years ago

@ivankravets Well I am not 100% sure. :) At least I have some issues with fetching a specific core lib version using PIO. The use case here is to mark somehow a specific PR or branch from this repository and build a version against it. (which is more of a feature request I guess)

About whether or not we have a problem with PIO... A while ago (about a year ago) you mentioned something regarding linker issues which may be related to something in PIO's build chain. I wonder if it is possible I am seeing the same again with my builds since they act different between builds if something completely unrelated has changed. (e.g. WiFi connection issues and/or I2C issues when changing a log text somewhere else)

d-a-v commented 5 years ago

@TD-er

Is there some compiler flag possible to force any allocated array to be initialized with 0 (or some predefined value) ?

It is the default. A special attribute is needed to not initialize globals or allocated variables (to 0).

This does "smell" like some buffer allocation or initialization which is not right.

In doubt, can you run your tests with disable_extra4k_at_link_time() to be called somewhere in your app (in setup() it is fine).

ivankravets commented 5 years ago

@TD-er

A while ago (about a year ago) you mentioned something regarding linker issues which may be related to something in PIO's build chain.

If I remember correctly, we now use the same build workflow as Arduino IDE, even the same linker scripts. You can take a look at https://github.com/esp8266/Arduino/blob/master/tools/platformio-build.py

Or you mean when a library has extern declarations?

TD-er commented 5 years ago

Or you mean when a library has extern declarations?

As far as I know, we don't use those in this project, so that should not be an issue here.

@d-a-v That one was already present in my code, wrapped in #ifdef ESP8266_DISABLE_EXTRA4K So I will now test some builds with it. For now it doesn't seem to make any difference. A while ago I've already cleaned up the stack allocated stuff in ESPeasy, which is put on the sys stack. So it should not really be an issue anymore, but it's worth a try I guess.

N.B. for my tests I went back to the core 2.5.2 SDK 2.2.1 It does seem a bit faster to me in rendering web pages, but maybe that's too early to tell right now. (SDK 2.2.1 vs. your PR version)

d-a-v commented 5 years ago

What's the status of this issue ? If you still have problem, would you mind to try #6356 which is an attempt to "uniformize" the WiFi sleep-off-on sequence so to have a common base to debug ?

(edit: #6356 is merged)

d-a-v commented 5 years ago

@TD-er @Swedish-Coder Is this issue still relevant in git master ?