espruino / Espruino

The Espruino JavaScript interpreter - Official Repo
http://www.espruino.com/
Other
2.77k stars 745 forks source link

Sockets get stuck in drain-wait state on ESP32 in AP mode #2297

Open GaryOtt opened 1 year ago

GaryOtt commented 1 year ago

EDIT: Scroll down to https://github.com/espruino/Espruino/issues/2297#issuecomment-1369643325 for later test code.


A failure to pipe data to a TCP socket can occur on the ESP32 build. I have not tested UDP sockets.

The socket may get stuck in a drain wait condition for tens of seconds, sometimes minutes, sometimes seemingly infinitely. This continues in ESP-IDF 3.3.6 based builds.


Details of my test:

test_code.zip

In this test, the WiFi is in AP mode and Espruino is running a web server (esp32_code.js). A script running on my development machine (test_http.js) sends HTTP requests repeatedly. Requests are sent 5 seconds after the completion of the last. The content of a file loaded from Storage is piped to the HTTP response.

The web server code keeps a check on the progress of requests and logs a timeout after 30 seconds of inactivity. When everything works fine, there is a continuous stream of data piped to the client but as the below video shows, sometimes it just stops for minutes at a time.

Results: https://youtu.be/4FyuYrBg8ck

0.00 Script is already uploading to ESP32. 0.23 Upload completes. 0.29 MacBook is connected to the WiFi AP hosted by the ESP32. 0.30 I start the node script to make repeated HTTP requests. 0.36 First HTTP request is completed and we see 'Finished' logged. 0.41 Second HTTP request made. 0.44 Some very minor pauses seen. 0.46 Second HTTP completes. (third, fourth and fifth requests complete perfectly). 1.20 We get some significant pauses. 1.57 30 second timeout elapses. 3.00 Rest of page delivered. 3.13 More pauses. 3.43 30 second timeout elapses. 3.54 End

MaBecker commented 1 year ago

Thanks, for posting, hope you find some time to fix it .

GaryOtt commented 1 year ago

Yes, I have a few things to attend to first but I plan to be working on this by the end of the week.

GaryOtt commented 1 year ago

This issue is client dependent. I experience it when connecting using a MacBook running Big Sir 11.7.1. I have been told of problems with clients using iOS and to a lesser extent Windows. There are no issues when I use my android phone and there were no issues when I used a now dead Dell laptop running Linux Mint.

The problem occurs when the select() call on line 279 of network_esp32.c returns 0. I'm still trying to understand what might cause that.

https://github.com/espruino/Espruino/blob/039ffb54415713e926f31c46c57ec4f170176422/libs/network/esp32/network_esp32.c#L270-L304

I have changed the #include statements to match ESP-IDF 3.3.6 example code. It compiles and runs but the problem persists unchanged.

Now:

#include <string.h>

#include "lwip/err.h"
#include "lwip/sockets.h"
#include "lwip/sys.h"
#include <lwip/netdb.h>

#include "network_esp32.h"
gfwilliams commented 1 year ago

Thanks for the update! Interesting about the client issue - even though they're all just running Node.js?

At least on desktop browsers, I seem to remember having some issue where they could leave the HTTP connection open in the hope that they can make another request. The server now responds with a Connection:close header.

I wonder whether this could be something similar? I guess once the max number of sockets are open, the ESP32 will probably just wait until the last one times out and closes which might produce this behaviour.

I also note your HTTP server code is quite big, and uses Storage. Personally, I'd try and strip it right down to the absolute bare minimum (just creating and sending the Strings directly), and maybe try and work without even using piping. It may be that at some point during that it starts working and might help to find out what the issue is

GaryOtt commented 1 year ago

The clients mentioned used a variety of web browsers. The choice of browser does not affect the issue. The node script is just something I wrote to give me a consistent test that is transparent about what is going on and easy to use. It makes sure to finish each request before starting another so as to avoid concurrent connections. Also worth noting that sometimes it occurs on the first or second request.

Yes, you're right. I shall simplify the web server and see what I get. Quite honestly, coming back to this after 6 weeks or so, I've forgotten what testing I've already done.

MaBecker commented 1 year ago

Here are the simple tests that had been created by one of the ESP32 developer:

https://github.com/espruino/Espruino/tree/master/targets/esp32/tests

GaryOtt commented 1 year ago

This is a much simplified test that I should have shared in the first place. It is exhibiting the same behavior.

//jshint esversion:6
const wifi = require("Wifi");

wifi.startAP(
    "EspruinoESP32",

    {
        authMode: "open"
    },

    (err) => {
        //This is not for meter level errors. This is a system failure.
        console.log("WiFi AP active. IP address:", wifi.getAPIP().ip);

        server = require("http").createServer((req, res) => {
            console.log("HTTP Request received: ", req.url);

            res.writeHead(200, { "Content-Type": "text/plain" });

            let len    = 33074;   //The length of the file
            let offset = 0;
            let chunk  = 0;

            while (offset < len) {
                chunk = Math.min(len - offset, 256);
                offset += chunk;

                res.write(".".padEnd(chunk, "*"));
            }

            res.end();
        });

        server.listen(80);
        console.log("Listening...");
    }
);
GaryOtt commented 1 year ago

The test at https://github.com/espruino/Espruino/blob/039ffb54415713e926f31c46c57ec4f170176422/targets/esp32/tests/simple_web_server.js appears much less problematic. There might be some occasional slowing of responses but at least I can say it is not so obviously problematic. I am guessing that is because the response is so small that it never requires a second packet.

My tests replicate problems seen delivering a web interface consisting of a HTML file (33K), a CSS file (17K), a Javascript file (64K) and a SVG file (1K).

gfwilliams commented 1 year ago

Yes - I'd expect the code in https://github.com/espruino/Espruino/issues/2297#issuecomment-1334025918 to have issues - because JavaScript is expecting to be non-blocking, and by writing 33k of data in one go you're just blocking execution of everything else until sending all the data for that one socket finishes.

... but the piping in your initial socket example should be better - or actually just handling the drain event directly to send another chunk of data: http://www.espruino.com/Reference#l_httpSRs_drain

GaryOtt commented 1 year ago

Just a quick note on where I got to.

I don't hold high hopes that there is any JavaScript code that can side step the issue because all methods eventually use net_esp32_send() to transmit data. But I was going to try code that sets up a TCP socket server anyway, just to make sure that no problem is introduced by the HTTP layer. I will try directly using the drain event (as suggested) because it might surprise me.

Something interesting I have seen is that the TCP socket server example in ESP-IDF does not use select() so I need to investigate if select() can safely be discarded like that or if it foregoes flow control completely. Where I left off (I've been stuck in meetings all day), I was trying to find documentation about how the send() function handles a drain wait scenarios. I might find it blocks until it can send, it might be undefined behavior, or it might return an error. There is a viable solution if an error response can differentiate between drain-wait conditions and connection loss.

gfwilliams commented 1 year ago

It is possible that select actually blocks... On Linux (IIRC - it was a while back), you have to configure select so that it returns immediately regardless of whether the socket is ready or not.

If it hasn't been configured like that on ESP32 I guess that might somehow be able to cause a deadlock (eg it's waiting to send data out, but it can't because buffers are full with data that needs to be read in).

GaryOtt commented 1 year ago

For the purpose of recreating this issue, I recommend these scripts: Socket Drain Wait Test.zip

I have created many different tests. Each test resulted in more questions and more tests, until I eventually run out of avenues of investigation. My testing was conducted on two ESP32 development boards and on a "Data Logger" (a device including an ESP32 from the project which this problem first arose).

My testing was done in the following stages:

My final attempts to fix the problem were to try to pin the WiFi task to Core 1. This achieved no appreciable difference and I subsequently found a comment on a forum somewhere that the ESP-IDF option that I was trying to use, although selectable is not implemented.

That's where I got to before the missus ordered me to have a break over Christmas. I intend to return to this in the coming weeks.

The code, raw data and notes for all my tests are here: Drain Wait Problem.zip

gfwilliams commented 11 months ago

Just wondering - I know you're using ESP32 as an AP - do you get the same behaviour when it's just connected to an existing WiFi network?

GaryOtt commented 11 months ago

Hi Gordon, I have not tried the ESP32 in station mode. I will see if I can try that today.

GaryOtt commented 11 months ago

The problem is not at all apparent in station mode.

Method:

I had an ESP32 respond to new TCP connections by transferring 40,000 bytes and then closing the connection. The client opened 5 connections initially and another each time a connection was closed.

In station mode, I left the test running for 45 minutes without observing any problems. in the order of 150,000 connections were opened.

In AP mode, timeout errors were observed after about 3 minutes.

Just as I was ready to report my results, it occurred to me that I had changed more than one variable. I have previously seen that the client device can impact the severity of the problem with Apple devices being most affected. So I repeated the station mode test after setting up a MacBook as a hotspot. Station mode again worked perfectly for a full hour.

gfwilliams commented 11 months ago

Ok, thanks - that's really interesting. So I think you did narrow it down pretty well before, but that basically clinches it - that it's something buried inside the ESP-IDF, so hopefully an upgraded version may fix this.

It also explains why more people weren't reporting it being an issue - I think the normal use case for most people is when it's in station mode.

MaBecker commented 2 months ago

@GaryOtt Does this also occur with the latest firmware?

GaryOtt commented 2 months ago

Sorry, I am unable to test this right now. It maybe a few days but I shall respond again as soon as possible.

About a month ago, I found this comment containing a potential solution to this issue while working on another project. This lead me to try the then latest firmware, which was encountering a Guru Meditation Error within a few minutes of starting the affected application. Unfortunately, it was about then that I was assigned to a project with a ridiculous deadline so I have been unable to return to working on it.