espressif / arduino-esp32

Arduino core for the ESP32
GNU Lesser General Public License v2.1
13.31k stars 7.36k forks source link

ESP stops answering network requests #2267

Closed slompf18 closed 5 years ago

slompf18 commented 5 years ago

Hardware:

Board: esp32doit-devkit-v1 IDE name: Platform.io Flash Frequency: 80Mhz Upload Speed: 115200 Computer OS: Mac OSX

Description:

I'm trying to deliver a website for configuration of ssid/password. The site contains several js/css/ajax calls. Everything works flawlessly when using it in Chrome. When using it in Safari (ios or macos) the board stops answering when reloading the site (second or third time, not deterministic).

To serve the site I was trying ESPAsyncWebServer and my own little webserver (essentially a pool of WiFiClients). Every solution shows the same results.

So my question is, who to debug this problem?

The sketch I'm providing shows the symptoms. I know I could scan the networks async and then it would work in this little sketch. But doing it sync was the only way I found to demonstrate the problem. In the real application the scan is done async, but the symptoms remain.

Sketch:

#include <Arduino.h>
#include <ESPAsyncWebServer.h>
#include <SPIFFS.h>

AsyncWebServer server(80);

const char* ssid = "YOUR_SSID";
const char* password = "YOUR_PASSWORD";

void setup() {
    Serial.begin(115200);
    WiFi.begin(ssid, password);
    SPIFFS.begin();
    if (WiFi.waitForConnectResult() != WL_CONNECTED) {
        Serial.printf("WiFi Failed!\n");
        return;
    }

    Serial.print("IP Address: "); Serial.println(WiFi.localIP());

    server.serveStatic("/", SPIFFS, "/").setDefaultFile("index.htm");

    server.on("/getNetworkDetails", HTTP_GET, [](AsyncWebServerRequest *request){
        String result = "Network details: ";
        result += WiFi.SSID() + "/" + WiFi.getHostname();
        request->send(200, "plain/text", result);
    });

    server.on("/getNetworks", HTTP_GET, [](AsyncWebServerRequest *request){
        String result = "Found networks: ";
        result += WiFi.scanNetworks();
        request->send(200, "plain/text", result);
    });

    server.begin();
}

void loop() {
}

index.htm:

<html>
<body>
    <div>
        ESP TEST
    </div>
    <div id="networkDetails">loading network details ...</div>
    <div id="networks">loading number of networks ...</div>

    <script>
        document.addEventListener("DOMContentLoaded", function(event) { 
            var xhr1 = new XMLHttpRequest();
            xhr1.open('GET', 'getNetworkDetails');
            xhr1.onload = function() {
                document.getElementById("networkDetails").innerHTML = xhr1.responseText;
            };
            xhr1.send();

            var xhr2 = new XMLHttpRequest();
            xhr2.open('GET', 'getNetworks');
            xhr2.onload = function() {
                document.getElementById("networks").innerHTML = xhr2.responseText;
            };
            xhr2.send();
        });
    </script>
</body>
</html>

Debug Messages:

There are no error messages, even when setting debug level on verbose.

me-no-dev commented 5 years ago

I have been catching the browsers to open extra connections when requesting the site. I do not guarantee that this is what is going on, but could be the case of having some connection hanging. Generally Async can deal with that and timeout, but who knows what exactly happens... :) maybe with more clues we will come up to a conclusion and fix.

slompf18 commented 5 years ago

There are no extra connections in networks tab of the browsers and I did not see any extra connections when debugging my own test web server. And the sample is simple enough, to not animate the browser to do so.

Do you have any idea how to find the other clues. ;)

me-no-dev commented 5 years ago

Ahh you have even debug enabled... hmmm... ESP8266 users have also complained about this since they switched to newer LwIP (same as on esp32)... question is to trace it down to what is causing it.

slompf18 commented 5 years ago

Because the logs I see are not that verbose, I want to make sure we are talking about the same when saying "enabling logs". I was setting the following defines at build time:

and calling the following line in setup: esp_log_level_set("*", ESP_LOG_VERBOSE);

Is there something else?

Here are the logs I get, even when the board stops answering:

I (199) wifi: mode : sta (30:ae:a4:20:51:44)
[D][WiFiGeneric.cpp:345] _eventCallback(): Event: 2 - STA_START
[D][WiFiGeneric.cpp:345] _eventCallback(): Event: 0 - WIFI_READY
I (334) wifi: n:10 0, o:1 0, ap:255 255, sta:10 0, prof:1
I (1068) wifi: state: init -> auth (b0)
I (1075) wifi: state: auth -> assoc (0)
I (1080) wifi: state: assoc -> run (10)
I (1106) wifi: connected with Ways, channel 10
I (1111) wifi: pm start, type: 1

[D][WiFiGeneric.cpp:345] _eventCallback(): Event: 4 - STA_CONNECTED
[D][WiFiGeneric.cpp:345] _eventCallback(): Event: 7 - STA_GOT_IP
[D][WiFiGeneric.cpp:389] _eventCallback(): STA IP: 192.168.1.168, MASK: 255.255.255.0, GW: 192.168.1.1
IP Address: 192.168.1.168
[D][WiFiGeneric.cpp:345] _eventCallback(): Event: 1 - SCAN_DONE
luc-github commented 5 years ago

@me-no-dev looks like similar behavior I reported with telnet connection stopping answering after few exchange, no ?

plewka commented 5 years ago

Hello, my first post...I'm in trouble with this problem, too.

Hardware:

Board: Olimex ESP32-EVB CPU: ESP32D0WDQ6 (revision 1) Core Installation/update date: 12/jan/2019 IDE name: Arduino IDE 1.8.8 Cores Frequency: 80..240Mhz Flash Frequency: 80 MHz PSRAM enabled: yes/no Upload Speed: 460800 Partitioning: Standard Core DebugLevel: Verbose Computer OS: Ubuntu18.04-AMD64 Wired Ethernet

Description:

ESPAsyncWebServer hangs any Network connection. Even a ping to the ESP stops permanently. Events like disconnect (plug cable) etc. are dead, too. No log output on core debug level VERBOSE. More difficult to cause via a WLAN-Client (will try to cause failure by WLAN, next). Propability rises with rising time to process requests. I recogniced halted stack after minutes to hours, but can be forced to happen immediately. Higher CPU clock, smaller files, non-simultan requests decrease propability.

What I did:

I isolated the problem to a simple webserver using ESPAsyncWebServer, SPIFF over wired Ethernet. These few lines are enough to fail. I started with browser, but recursive wget works, too.

I even tried to put files to RAM. This only decreases propability. If there is only one (big) transfer per time it seems to be stable forever...many small ones, too.

If I access via a WLAN-based client the propability is much lower, than with 1:1 direct cable connection without switch etc. but it still fails some time. I now doubt it is caused by the wired ethernet.

It's enough to do a simple html which makes the browser load a few pictures with simultanous requests. One bigger file of 300kB and some small pictures. Immediate loss of connection.

No success: Find a way how to detect the failure within the system itself to do a reboot...

Sketch:

removed it here, see next post....

plewka commented 5 years ago

WLAN == Ethernet == Hanging Network, while loop() is fine I just tried via WLAN-Client and ESP via WLAN, too:

Force the bug to happen immediately:

It is more difficult to force the bug, but it is there. Some base traffic and some reloads with browser cache deactivated and it hangs. I used

watch -n 2 wget \<url\> 

to load a "big" file of 300kB and and Firefox in parallel with some reload. In Firefox (web developer-> Network analysis) requests show up to be are less simultanous than with wired ethernet. For sure I deactivated the cache inside web developer to force the browser to get the small files on any reload.

SKETCH:

//#include <Arduino.h>
#include <SPIFFS.h>
#include <ESPAsyncWebServer.h>
#include <ETH.h>
#define WLAN

AsyncWebServer server(80);
void setup()
{

  Serial.begin(230400);   

#ifdef WLAN
  WiFi.begin("***", "***");
  WiFi.mode(WIFI_STA);

  while (WiFi.status() != WL_CONNECTED) {
    delay(1000);
    Serial.println("Connecting to WiFi..");
  }
  Serial.println(WiFi.localIP());

#else
  ETH.begin();
  ETH.config(0xc805a8c0, 0x0105a8c0, 0x00ffffff); // 192.168.5.200 / 192.168.5.1 / 255.255.255.0
#endif 

  if (!SPIFFS.begin(true)) {
    Serial.println("An Error has occurred while mounting SPIFFS");
    return;
  }
   server.serveStatic("/", SPIFFS, "/");
  server.begin();
}
void loop()
{
  delay(1000);
  Serial.println(".");
}
slompf18 commented 5 years ago

I did not geht this bug fixed, because nobody was able to tell me how to analyse it. Now looking into RTOS. Seems to run more stable.

plewka commented 5 years ago

If FreeRTOS is stable doing something equivalent here this sounds like an issue at the interfacing of AsyncTCP and lwIP. AsyncTCP is full of great features you won't implement by your own, though. But something fully freezes the lwip stack. I tried to put load to my SOnOffs on Tasmota, but they limit traffic/connections quite soon but continue to answer even if it takes a minute or more.

Maybe good idea to disable one Core to prevent SMT? I don't need the speed. Is there anybody having access to one of the single core ESP32s or knowing how to manipulate FreeRTOS options?!

slompf18 commented 5 years ago

When working with RTOS I experienced random panics that lead to a reboot. The error looks like the one I experienced when working with Arduino.

The problem had its root in the Watchdogs running in the background (described here for example). The default time out in RTOS is 20 seconds. But that doesn’t seem to matter. If the system is in idle for 19 seconds and then starts a job that takes 3 seconds, the system seems to reboot. I made this work by calling esp_task_wdt_reset() before certain operations.

Maybe we do have the same problem here?

plewka commented 5 years ago

No, this is fully different. The system is basically fine and fully responsible, no reboot and no (verbose) message! It simply doesn't respond to network requests anymore including PING. It even doesn't detect a cable disconnect when used over cable (not related to cable though). Up to my limited knowledge I tend to say the stack fully hangs. I recognized a delay(250) ms in main loop() strongly increases probability, too. There only has been an if which doesn't trigger and the delay in the main loop. Anything which takes some time anywhere seems to be harmful.

Is Arduino really using both cores?!

allex1978 commented 5 years ago

The same issue on latest (30/01/19) Core and AsyncWebServer . Esp32 randomly stops answer to network including a ping...but CPU works fine. i see it on display.

malbrook commented 5 years ago

I have a similar problem using AsyncWebServer on a couple of different ESP32 projects. The web browser on a PC is connected to the ESP32 via a WiFi network displaying the html pages which use regular javascript ajax calls to the server to update sections of the screen without updating the whole screen. These calls happen every 2 seconds on one system and around 30 seconds an the second project. On both projects the web browser loses connection after an indeterminate time period after which the ESP32 cannot be detected on the network using a network scanner, no ping etc.

Both projects also have an access point running on the ESP32 and this also disappears as well. Using verbose debug I find a message rx timeout and ack timeout just before the WiFi packs up. I can see the ESP32 is still running as IO signals are still working changing LEDs in response to inputs, I also run a second process on the second core and I can see that this is also running even outputting via a serial port so the ESP32 itself is still operating.

Looking for the source of the messages I found they are generated in AsyncTCP and I noticed that when these messages are generated there is a call to _close() which I suspect is closing down the WiFi , so as a test I added a global variable in AsyncTCP that I could monitor this in the rest of the process, and if the message is triggered this is used to stop and then restart the WiFi. This seems to have reduced the problems by up to 50%, however I am now getting system crashes after some of the restarts, mainly relating to heap poisoning, so it clearly needs more work to solve the problem.

The problem can be regularly caused by opening a web page that contains configuration information, make a change and click the update button which causes the page to send a POST to the server which in turn will cause a write to the preferences, which are stored in SPIFFS, and then immediately select a new page on the browser causing the browser to request a new page from the server which in turn is served from the SPIFFS area. Does this imply that there is a problem when writing to the preferences area of the flash and at the same time reading from a different area of the flash causing an issue with the WiFi.

System is coded using Arduino IDE 1.8.9 with ESP32 1.0.2rc1 and SDK V3.3, AsyncWebServer and AsyncTCP are latest versions from github.

BlackBird77 commented 5 years ago

Me too... Arduino 1.8.7 with ESP32 1.0.2 freezes the "Network Stack" completely. CPU is fine! Ping no chance! Same with PlatformIO!

Arduino 1.8.7 and ESP32 1.0.1 is better, but some pings are lost. The latency goes higher and higher up to 600ms and than fall back to 1ms.

Here my minimal code, to reproduce. Take the IP from Serial and Ping the ESP!

#include "WiFi.h"

void setup()
{
  Serial.begin(115200);
  Serial.setDebugOutput(true);

  Serial.println("Start Wifi..");
  WiFi.begin("***", "***");

  Serial.println("Started.. Wait for IP...");

  while (WiFi.status() != WL_CONNECTED) {
     delay(500);
     Serial.print(".");
  }  
  Serial.println();
  Serial.print("Got Ip: ");
  Serial.println(WiFi.localIP());
}

void loop()
{

}

image

BlackBird77 commented 5 years ago

Ok it looks to me, that the newest platform with ESP-IDF3.2.0 has the freeze problem!

The slow pings come from WiFi power saveing, which can be turned off. Then the pings are okay with ESP-IDF 3.1.3

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This stale issue has been automatically closed. Thank you for your contributions.

DhrubojyotiDey commented 4 years ago

So can someone please guide on how to close this network freezing issue. It really does pose a problem. And is there any micropython code for this?

zekageri commented 3 years ago

Still an issue.

aleuarore commented 3 years ago

Still?

phil31 commented 3 years ago

i can confirme this trouble too .. no update/news ? thanks

chegewara commented 3 years ago

I tested with example code from this issue and here is result:

64 bytes from 192.168.0.103: icmp_seq=153 ttl=255 time=94.0 ms
64 bytes from 192.168.0.103: icmp_seq=154 ttl=255 time=15.1 ms
64 bytes from 192.168.0.103: icmp_seq=155 ttl=255 time=38.0 ms
^C
--- 192.168.0.103 ping statistics ---
155 packets transmitted, 155 received, 0% packet loss, time 154227ms
rtt min/avg/max/mdev = 12.645/63.517/133.061/30.096 ms

How many pings do i have to send to confirm the issue or to confirm it works fine?

EDIT test performed with some commit that was v4.2 once:

commit beedeea4541116106b38fc5c3a03821cdf6fe288 (HEAD, origin/idf-release/v4.2, idf-release/v4.2)
phil31 commented 3 years ago

u right, my problem is maybe not similar, PING continue to works, as you, but webserver hang time to time, for 10/20 seconds, then restart to work !

it's not stopping LAN requests, it stop HTTP requests for 10 or 20 seconds

chegewara commented 3 years ago

What i mean is that maybe 150 ping requests is not long enough to reproduce issue and i should have wait a bit longer.

@phil31 if your problem is different, please open new issue with minimal code to reproduce and informations about version/branch etc

Darktemp commented 3 years ago

Hi, it seems, that I have a similar issue trying several approaches to avoid it, but it always ends in the situation that I can still ping the ESP (like mentioned with increasing latency) but I cannot connect to it (TCP/UDP) and it cannot connect to mqtt. Is there any hint what I could do to narrow down the reason? Annoyingly it only happens very randomly after around 5-7 days; the longest it took was 41 days until it froze.

Darktemp commented 3 years ago

If someone finds this via google, I found two reasons which should improve the situation: https://github.com/espressif/arduino-esp32/pull/5487 which should be part of the next release (hopefully) and issue #4736 . Still takes ~ 41 days until I know if these were the last causes of the problem 😆

Darktemp commented 2 years ago

ok, I can confirm, that it never froze up anymore since 16.August!

szerwi commented 1 year ago

Any updates regarding this issue? Is there any fix planned to be implemented?

VojtechBartoska commented 1 year ago

@szerwi Do you still face this on latest Arduino Core version 2.0.9?

szerwi commented 1 year ago

@VojtechBartoska I do have similar issue on arduino-esp32 2.0.7. Sometimes my ESP32 looses WiFi connection (I cannot enter web server or ping it), but WiFi.status() is probably still returning WL_CONNECTED, as the ESP does not try to reconnect (I do have my own mechanism to disconnect and reconnect to the network again when it detects that it is not connected). This issue happens only when there is some client connected to the web server. Sometimes it is also causing the ESP32 to crash.

I've heard that there are many bugs in ESPAsyncWebServer and AsyncTCP libraries and there are some forks of those libraries that are more stable, but I'm not sure which fork is the best available at this time.

zekageri commented 1 year ago

@szerwi Everyone has a problem with ESPAsyncWebServer. Unfortunatelly it is buggy.

My best shot was these forks

https://github.com/yubox-node-org/AsyncTCPSock and https://github.com/yubox-node-org/ESPAsyncWebServer

These are really stable for me but they inherit the same buggy design from the original library.

Websocket clients can stuck in there since there is a possibility that the client does not close the socket cleanly.

Edzelf commented 1 year ago

Check the result of heap_caps_get_largest_free_block ( MALLOC_CAP_8BIT ). It should be well over 20k for a stable WiFi connection at any time.