fredlcore / BSB-LAN

LAN/WiFi interface for Boiler-System-Bus (BSB) and Local Process Bus (LPB) and Punkt-zu-Punkt Schnittstelle (PPS) with a Siemens® controller used by Elco®, Brötje® and similar heating systems
220 stars 83 forks source link

[BUG] v3.1.17 (3.2?) causes many timeouts for web-requests/url-commands #586

Closed schlaubstar closed 1 year ago

schlaubstar commented 1 year ago

BSB-LAN Version 3.1.17-20230520012002

Architecture ESP32

Bus system BSB

Describe the bug When upgrading from v.3.1.15 to 3.1.17 (3.2 according to comments in Ino-file) there is suddenly a massive timeout issue, when using URL-commands to set parameters.

I use a RaspberryPI to query the real temperatures and emulate a room-unit, thus setting parameter 10000 every 3 minutes with the current temperature, using a python script. As I forgot to define a connection timeout my PI crashed due to many open connections, eating up available memory. Looking at the problem, I found out that BSB-Lan on my ESP32 is not replying to many of the URL-requests. I can not find out why but approx. 10% of the Requests do never get a response.

To Reproduce Steps to reproduce the behavior:

Use e.g. this small python script with v3.15 and 3.17

#!/usr/bin/python   
# -*- coding: utf-8 -*-
import requests, time               

def test():
    requrl = "http://192.168.xxx.xx/I10000=21.09"
    jsonResp = requests.post(requrl, auth=('user', 'pass'), timeout=15.00)
    print("Response.status_code", jsonResp.status_code)

if __name__ == '__main__':
    for x in range(100):
        test()
        time.sleep(3)

Logs with 3.15 you will get 100times: Response.status_code 200

with 3.17 you will get approx each 10th call:

Traceback (most recent call last): File "/home/pi/.local/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher yield File "/home/pi/.local/lib/python3.7/site-packages/urllib3/response.py", line 566, in read data = self._fp_read(amt) if not fp_closed else b"" File "/home/pi/.local/lib/python3.7/site-packages/urllib3/response.py", line 532, in _fp_read return self._fp.read(amt) if amt is not None else self._fp.read() File "/usr/lib/python3.7/http/client.py", line 461, in read n = self.readinto(b) File "/usr/lib/python3.7/http/client.py", line 505, in readinto n = self.fp.readinto(b) File "/usr/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) socket.timeout: timed out

Expected behavior The reliability of using url-commands should remain the same for version 3.15 and 3.17

Screenshots

Additional context

fredlcore commented 1 year ago

Thanks for this detailed bug report, this really helps. I can reproduce this problem to some extent, but after more than 20 or so calls with a 1 second delay inbetween. Could you do two things please: Test this with release version 3.1 which was published a few days ago under "Releases" and also have a serial monitor running while you are doing this test. I would like to see what messages appear between the last successful call and the moment it fails. Thanks!

fredlcore commented 1 year ago

I have somewhat narrowed down the problem, but I can't explain it: When doing the testing mentioned before, could you change the client.writeline in print2webclient.h's flushToWebClient function to this:

printlnToDebug(PSTR("1"));
    client.write(bigBuff, bigBuffPos);
printlnToDebug(PSTR("2"));

It seems to be that sometimes there is an (unexplicable to me) long delay between the output of 1 and 2, apparently resulting in the timeout you mention. But since this is a general library call, I don't see where I can change that behaviour...

schlaubstar commented 1 year ago

I was just testing so feedback to your initial question:

noMonitoring_BSBCap.txt

I will modify the lines proposed and will check the results

fredlcore commented 1 year ago

Also, are you using WiFi or Ethernet? If you have a possibility to switch (temporarily) to Ethernet, this might also be a helpful indication. However, if you say that it did work rock-solid with a previous version, the only solution I can see right now is to go back in steps and install the software based on each previous pull request and see where the error takes effect. I'm willing to do all the changes necessary to fix this, but since I'm not on site right now, I have no means to do this process.

schlaubstar commented 1 year ago

Maybe your "output 1/2 fix" already gives some indications. Although I did not dig into the code to deep but to me it looks as if the corresponding function sometimes gets called subsequently and "to often"

newlog.txt

The ESP32 has no wired connectivity, so I have to rely on Wifi.

fredlcore commented 1 year ago

No, that the frequent calls are correct because the output is often times too large to fit in the buffer. You can see the contents to be sent if you change the code above to:

printlnToDebug(PSTR("1"));
Serial.write(bigBuff, bigBuffPos);
    client.write(bigBuff, bigBuffPos);
printlnToDebug(PSTR("\r\n2"));

I was a bit suspicious that it could be WiFi-related, but if you say that it worked rock-solid with 3.1.15, then that cannot be it...

schlaubstar commented 1 year ago

It is, for sure, not Wifi related - because I just need to go back to : Version 3.1.15-20230402082920 via OTA without touching anything, and I get 100% reply-rate.

I found out how to retrieve the code for a specific commit, so I am slowly testing the older versions to identify from which to which the issue occurs.

fredlcore commented 1 year ago

You go to this page: https://github.com/fredlcore/BSB-LAN/commits/master And then go to the commit you want to test and click on the <>. Then the repository opens with the code at that time and you can download it as usual via the green download button.

But before you do this you might want to try this: I erased the EEPROM using /NE and then flashed anew using the default BSB_LAN_config.h just modified for the networking parameters. Now I don't have any delays or timeouts anymore. If that is the same with you, it might be easier to isolate the causing factor by re-activating the configuration options from before and see where the problem starts to occur. Of course take note of all your web-configuration settings before erasing the EEPROM :)!

fredlcore commented 1 year ago

I also realized that there is still a bug in the preferred BSSID functionality. If the BSSID parameter is empty, it will lead to a reconnect every 60 seconds. I'll fix this later, but you may want to look out for the serial monitor and see if that is a problem with you, too.

fredlcore commented 1 year ago

And have you disabled Energy Saving for the ESP32 in the web configuration? I just realized that I had it turned on and that also resulted in bad connection.

fredlcore commented 1 year ago

Enery saving seems to be the root cause of this. Try this first before anything else.

schlaubstar commented 1 year ago

ESP32 PowerSave was and is constantly de-activated but I am already on May 16th ;-)

fredlcore commented 1 year ago

Ok, too bad :(. But great that you're making such fast progress :)...

schlaubstar commented 1 year ago

Some insights but no 100% explanation, yet.

schlaubstar commented 1 year ago

Final insight: The problem is caused with the introduction of the ability to lock WiFi to a certain BSSID if you do not really use it and provide the MAC address. As you observed this causes frequent "disconnects" where the ESP32 simply seems to loose connectivity for a short period of time, leaving existing requests "unhandled".

After compiling the latest commit (https://github.com/fredlcore/BSB-LAN/commit/960c326ed723896c0c1841aba5d5332118f3b746) but providing the MAC Address of my 2.4GHz Wifi ... I got 1000 requests with 1 sec. delay without any timeout or loss of connectivity and the Debug output also does not show the Wifi Scans anymore.

From my view you can close this issue, but I think it would make sense to reasonably deal with a "default BSSID"

Thanks for your support to track this issue down. 👍

fredlcore commented 1 year ago

Ok, thanks, the default BSSID issue is fixed in the most recent master repo.