Closed nkinnan closed 1 month ago
Interesting!
I will cherry-pick your changes to the fork we maintain at https://github.com/mathieucarbou/ESPAsyncWebServer
Is the PR final ? Would you mind squashing all the commits and add a meaningful commit title and description ?
Thanks!
@mathieucarbou
edit: let me address the two self-comments and the code // todo:
first, I'll just take care of that now
@mathieucarbou Ready for cherry-pick.
@nkinnan : before applying your fix, I ran some tests on the fork we actively maintain (based on https://github.com/yubox-node-org/ESPAsyncWebServer). It has received a lot of bug fixes that the ESPHome fork does not have, including some in the AsyncTCP lib (we also maintain it).
Running our SSE example with a delta of 1 ms (one event sent per millis), Chrome receives about 1 message out of 2-4:
I get some console log errors such as:
[ 25588][E][AsyncEventSource.cpp:197] _queueMessage(): Too many messages queued: deleting message
because of course the loop is exhausting the queue.
Setting delta to 5 ms, I do not have any messages discarded.
Which means running in AP mode, Chrome connected to a bare ESP32, I can send at a rate of 1 (short) message each 5 ms without loosing some.
Is it possible that the delay you observed is caused by the communication time in STA mode ?
I've re-created your PR in our fork (https://github.com/mathieucarbou/ESPAsyncWebServer/pull/92). I definitely see an improvement too. I can decrease in AP mode to a rate of 1 msg each 3ms without any message loss.
@mathieucarbou
1) I was using an 8266 for testing - "worst case scenario" - rather than an esp32 2) If you're running on linux you probably have TCP_QUICKACK set which reduces the time for the OS to respond to TCP packets with an ACK to zero (on Windows it's 50ms). That really helps close the loop time. I just put in a fix to the API to set this flag on the socket if supported, and then added support for it on windows to the python runtime (man, those guys are PICKY with their PRs haha). This alone will cause another 2-3x-ish improvement and if you're on Linux it may already be in effect since many distros set TCP_QUICKACK true OS-wide. 3) I'm not sure what your firmware build looks like but esphome is limited to 60 loop()s per second by design, it sleeps (in yield) the rest of the time to save power. The network is serviced in yield() so max 60 send buffers / second on esphome (on a single core processor like the 8266). If your build doesn't do the same, that will affect things.
These factors in combination probably explain why you saw better performance than I did without this PR, but then the PR still did improve things further.
I will definitely take a look at your forks and see if there's anything we want to pull over here, thanks for the heads up! :)
Edit: 4) The TCP library used on ESP32 class chips is more advanced and capable, see the update to the PR description for details.
Point 3 is why my source branch is named async_event_source_yield by the way - I originally thought that was the problem and added yield's all over the place to see if it fixed the throughput issue :) Eventually I found the true root cause but never switched to a new branch name.
Bugfix: AsyncEventSource writes multiple events per tcp send, including partial events that straddle buffers
On line 148 in the diff, client->send() was being called before additional events could write into the tcp client send buffer. Once send is initiated, client->canSend() will return false preventing more events from being written. (edit: on the 8266, see update section for esp32 behavior) This resulted in only a single event per packet to be sent.
Because an ack must be received before canSend will become true again (the buffer must be held onto in case a tcp retry is needed) this resulted in (observed with WireShark) only one event every 70-80ms -- ESP-01: 20ms to send another packet after ack received, windows: 40-50ms to send an ack back after receiving a packet due to the equivalent of 'nagle algorithm' for acks on the receive side
The call to _client->send(); is now on line 267 after as many events have been written to the client tcp send buffer as will fit (including partial events to fully maximize the throughput). This improves event-throughput-per-ack by as much as 10-15x in some cases but probably around 5x in general (it varies with event size) based on the log output I was observing during testing. (edit: on the 8266, see update section for esp32 behavior)
I call this a bugfix since clearly the code was trying to add multiple events per send, that appears to be the original author's intent, it was just failing to do so because client->send() was called after the first event was written and that locked the tcp send buffer against further writes. (edit: on the 8266, see update section for esp32 behavior)
Improvement: don't hold onto event items until ack, immediately remove them from queue once (fully) written into the tcp send buffer
This code was holding onto the queued events until an ack came back, even after they were written to the tcp send buffer. There is no need to do so, the tcp send buffer is retained by the client for retries (why things were stalling out in the first place) so this just results in the data being duplicated in memory needlessly. Removing them from the queue immediately after write to the client tcp send buffer also makes those queue slots available for new incoming events, potentially reducing the number of dropped events during event bursts, or just lowering memory usage otherwise.
Practical effects for esphome:
1) Logs will show up much more quickly in the web_server dashboard 2) Less logs and events will be dropped en-route to the dashboard in general, this means you are less likely to get the "funky dashboard" effect where bits are randomly missing or not all of your sensors show up - the initial dashboard load is particularly event-bursty as all sensors are enumerated and written so that the javascript can build the UI 3) Less likely that sensor updates will be entirely dropped leaving parts of the dash UX static if you have a lot of sensors publishing, or a lot of log lines filling up the event queue.
And just in general: 1) It's now possible to send a single event larger than the tcp send buffer size since parts of an event can be written iteratively until complete. Not sure why you'd want to do such a thing, but such an event will no-longer stall the event source completely and unrecoverably. 2) Lower CPU usage due to batching (less packets generated). 3) Potentially lower memory consumption due to earlier queue item discard.
Update for ESP32 chips:
I'm now looking at an ESP-C3 on wireshark and can see that the different TCP library used on ESP32s vs ESP8266s does allow the AsyncEventSource to write multiple events to the tcp send buffer even after a send has been initiated. This more advanced TCP stack will simply send additional packets out, processing the send-buffer piecewise as it is written. The act of sending packets incrementally as each event is written will actually reach the receiving side's threshold for sending an ACK (or does on Windows anyway), even with delayed ack in effect. This additional capability and coincidental effect on ack frequency is probably why no-one noticed this undesirable behavior earlier. (Plus performance issues on an 8266 are likely to be blamed on the slower core and dismissed.)
The change in this PR is still an improvement on the ESP32 stack, there is still an increase in event throughput, more efficient use of the full length of the send buffer, less CPU usage due to the batching, and less memory usage due to discarding queue items early, but the improvement will be less noticeable since the improved behavior and capabilities of the new TCP stack mitigated the implementation flaws in AsyncEventSource to a large degree - making it only display truly degenerate behavior on the 8266 class chips.
Additionally, on the 32, since the send buffer is larger than the MTU of ~1400, in case of high event load two packets will still be sent, thus still bypassing delayed ack. Even in case of lower event load, additional events will still generate additional packets (on addition loop()s) with the upgraded tcp library's ability to process the send buffer incrementally, thus also bypassing delayed ack. So this PR won't cause any regression in behavior in that regard due to the "batching effect".
I haven't dug into the TCP libraries themselves to see if this capability to incrementally process the send buffer and send additional packets as new writes happen could be added to the older stack used on the 8266. The improvement in this PR is probably "good enough" in working around the limitation, the 8266 isn't recommended for new designs at this point, and going that low into the network stack carries higher risk of regression while this change is relatively straightforward.