Closed proddy closed 1 year ago
To me it looks like it's the MQTT.
EMS-ESP runs perfectly for 8-10 days, then there are 50-80 MQTT errors and the device goes offline (sometimes there was a simple reboot).
But I only see the number of MQTT errors, no error message in the syslog (log Level=ALL) or in the MQTT Brocker log ?
2023-10-07 15:15:21.439 W 154: [mqtt] MQTT disconnected: TCP
2023-10-07 15:16:30.528 I 155: [mqtt] MQTT connected
2023-10-07 19:30:13.475 E 156: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 88 33 00 1B
2023-10-07 19:55:21.494 E 157: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 88 14 00 1B
2023-10-07 20:32:50.781 W 158: [mqtt] MQTT disconnected: TCP
2023-10-07 21:57:30.615 E 159: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 90 FF 00 19 01 AF
2023-10-07 22:16:53.426 E 160: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 90 FF 00 19 02 1B
2023-10-07 23:54:27.133 E 161: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 88 33 00 1B
2023-10-08 02:13:07.632 E 162: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 90 FF 00 19 01 AF
2023-10-08 02:32:09.663 E 163: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 88 26 00 1B
2023-10-08 06:38:36.920 E 164: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 88 33 00 1B
2023-10-08 07:47:21.055 E 165: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0B 88 16 00 1B
2023-10-08 08:05:34.490 I 166: [mqtt] MQTT connected
2023-10-08 10:06:11.329 I 167: [shower] finished with duration 425005
2023-10-08 10:13:40.145 E 168: [telegram] Last Tx Write operation failed after 3 retries. Ignoring request: 0B 08 1D 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2023-10-08 10:13:42.606 I 169: [shower] finished with duration 425006
2023-10-08 10:17:05.224 I 170: [shower] finished with duration 175240
Syslog level is now set to "ALL", waiting for the next crash. I checked several values from influxDB right before the crash, but could not find anything helpful.
I don't know if my problem is related, my router says it's not connected to ems-esp :thinking:
@proddy Maybe a useful hint, I have two ems-esp, one of them with Lan, on both I have version 3.6.3-dev.1, the one with Lan runs without problems, the one with wifi keeps breaking the connection, the wifi signal is sufficiently strong -8dBm
My ESP has frozen again. It's been staying up for many days since 3.6.2 but then something triggers a free memory drop, it loses contact with the mqtt broker and the web interface goes down. Mesh WiFi with solid connection to nearest AP without specifying BSSID.
Checking homeassistant and AP logging, three other WiFi devices suffered a brief network interruption at 07:25 and it looks like my mesh reconfigured/re-meshed at this time. The ESP freeze appears related to a momentary WiFi/AP drop. Current version is 3.6.0-dev.0.
could be related to https://github.com/emsesp/EMS-ESP32/issues/1324. MQTT goes spinning into a connect-reconnect loop which blocks all other actions since it's no longer asynchronous.
One thing that seems to be common for all with this issue is the use of a wifi-mesh (and not setting bssid). We know from @JokerGermany that the router says that ems-esp is disconnected, but don't know the connection state ems-esp reports (led). Is this also common for all? And what does ems-esp-led signal? What about setting the bssid? Any report if it is still freezing when using? There are only 2 report that bssid is NOT used. I think the mesh reconfigure something every few hours that does not cause a disconnect in esp but cuts the connections (auto channel change, handover to another mesh-AP, ...). I've added some checks to the wifi in https://github.com/MichaelDvP/EMS-ESP32/commit/99992db9ac137e80e3396649a1934f83e9d912d7, maybe it helps, build is here: https://github.com/MichaelDvP/EMS-ESP32/releases/tag/test
@MichaelDvP I installed your Test Build v3.6.3-dev.2c and recognized that wifi scan does not work anymore. I went back to latest dev.
I did some changes within the Mesh setup for testing: I switched off the dynamic bandwidth optimization for the 2.4 GHz frequency band which is used by ems-esp. This resulted in a disconnect of all WiFi devices (approx. 15 connected to 3 ap). All of them reconnected after approx. 5 seconds, just ems-esp not. The LED was off. (BSSID not used yet).
I just recognized that when selecting an ap from wifi scan all network-settings are lost (fixed ip etc). This is not good. I will now test with BSSID set to a specific ap. With BSSID configured the changing bandwidth optimization within trhe mesh network now gives a reconnect of ems-esp after 20-25 seconds.
Hello, I would like to report what I have seen here, hoping it may help to find what is happening.
Wifi, no mesh, dedicated AP for ems-esp GW. (ems esp ordered last month 3.5.2 SW- Version preinstalled). Worked/accessible up to 5+ days without any issues, heating operating in Emergency Mode (probably low limit ems traffic). MQTT not yet configured. But frequent access to WebUI.
After setting up mqtt (and swithing off emergency mode) device lost connectivity multiple times after latest a few hours. Possibly also related to WebAccess.
Heating emergency mode was set by the installer, because temperature sensors where not yet installed.
upgraded to 3.6.2-dev2: (and later test/dev versions)
Connection reliable again, but keeping a browser open or logging onto the device often, seems to lead to: (after only a few hours)
-extreme long ping times (from 30 ms to >200 ms) and packet loss -WEBUi seems to get slower and unresponsive -looses connect to MQTT -telnet login to su and restart command not longer possible. telnet connection breaks before beein able to finish the command.
Disabling / reenabling Wifi AP does not solve the issue. => needs reboot via power.
Currently I avoid keeping web browser tab open, instead I check values via mqtt app on phone. Device now up and running for 32 hours.
MQTT ist installed at remote site and reached via a VPN connection. There might be some packet loss to the mqtt at some time. MQTT failures are not reported however.
Just in case this is relevant...
I had a WiFi mesh before installing EMS-ESP last year and I don't recall my ESP32 ever staying up for more than a week or two. When running v3.4.x, I'd often notice the uptime and counters had reset but since the device was only ever down momentarily, it had no real impact.
Since the recent mqtt memory leak issues were fixed from around v3.6.2, I've experiencing similar uptimes to the 3.4.x series however now my ESP32 freezes or becomes otherwise unresponsive rather than crashing and rebooting as it did running v3.4.x.
What I'm saying is I think one or more mesh Wifi issues may have existed for a long time but used to be hidden by reboots rather than exposed by freezes.
Hello, I would like to report what I have seen here, hoping it may help to find what is happening.
Wifi, no mesh, dedicated AP for ems-esp GW. (ems esp ordered last month 3.5.2 SW- Version preinstalled). Worked/accessible up to 5+ days without any issues, heating operating in Emergency Mode (probably low limit ems traffic). MQTT not yet configured. But frequent access to WebUI.
After setting up mqtt (and swithing off emergency mode) device lost connectivity multiple times after latest a few hours. Possibly also related to WebAccess.
Heating emergency mode was set by the installer, because temperature sensors where not yet installed.
upgraded to 3.6.2-dev2: (and later test/dev versions)
Connection reliable again, but keeping a browser open or logging onto the device often, seems to lead to: (after only a few hours)
-extreme long ping times (from 30 ms to >200 ms) and packet loss -WEBUi seems to get slower and unresponsive -looses connect to MQTT -telnet login to su and restart command not longer possible. telnet connection breaks before beein able to finish the command.
Disabling / reenabling Wifi AP does not solve the issue. => needs reboot via power.
Currently I avoid keeping web browser tab open, instead I check values via mqtt app on phone. Device now up and running for 32 hours.
MQTT ist installed at remote site and reached via a VPN connection. There might be some packet loss to the mqtt at some time. MQTT failures are not reported however.
thanks for providing that feedback @philipherbert - when you say you had the WebUI open, was it on a particular screen? Like the Dashboard or the System Log ?
@MichaelDvP I installed your Test Build v3.6.3-dev.2c and recognized that wifi scan does not work anymore.
There was no change in wifi-scan, on my system it works.
I went back to latest dev.
Can you mention the version an dev number.
Hello,
the open page was mostly the Dashboard, the Boiler Entities opened. Systemlog was not involved.
I did believe, that opening the Webpage from a second device (or keeping it open), caused more issues.
What also happened: When the device rebootet, it did revert to the old Version. (that was causing more issues). Are there two partitions ?
--Philip
Von: @.> Gesendet: Mittwoch, 11. Oktober 2023 00:51 An: @.> Cc: Philip @.>; @.> Betreff: Re: [emsesp/EMS-ESP32] EMS-ESP becomes unresponsive (Issue #1321)
Hello, I would like to report what I have seen here, hoping it may help to find what is happening.
Wifi, no mesh, dedicated AP for ems-esp GW. (ems esp ordered last month 3.5.2 SW- Version preinstalled). Worked/accessible up to 5+ days without any issues, heating operating in Emergency Mode (probably low limit ems traffic). MQTT not yet configured. But frequent access to WebUI.
After setting up mqtt (and swithing off emergency mode) device lost connectivity multiple times after latest a few hours. Possibly also related to WebAccess.
Heating emergency mode was set by the installer, because temperature sensors where not yet installed.
upgraded to 3.6.2-dev2: (and later test/dev versions)
Connection reliable again, but keeping a browser open or logging onto the device often, seems to lead to: (after only a few hours)
-extreme long ping times (from 30 ms to >200 ms) and packet loss -WEBUi seems to get slower and unresponsive -looses connect to MQTT -telnet login to su and restart command not longer possible. telnet connection breaks before beein able to finish the command.
Disabling / reenabling Wifi AP does not solve the issue. => needs reboot via power.
Currently I avoid keeping web browser tab open, instead I check values via mqtt app on phone. Device now up and running for 32 hours.
MQTT ist installed at remote site and reached via a VPN connection. There might be some packet loss to the mqtt at some time. MQTT failures are not reported however.
thanks for providing that feedback @philipherberthttps://github.com/philipherbert - when you say you had the WebUI open, was it on a particular screen? Like the Dashboard or the System Log ?
— Reply to this email directly, view it on GitHubhttps://github.com/emsesp/EMS-ESP32/issues/1321#issuecomment-1756392688, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEYIJNTWMLHVR3TOTILTGNLX6XGQDAVCNFSM6AAAAAA5XT5TXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJWGM4TENRYHA. You are receiving this because you were mentioned.Message ID: @.***>
This is what happened with my installation as well this evening Fritzbox mesh did bandwith optimisation and official build 3.6.2 was gone. Anything I can do to help?
-- @MichaelDvP I installed your Test Build v3.6.3-dev.2c and recognized that wifi scan does not work anymore. I went back to latest dev.
I did some changes within the Mesh setup for testing: I switched off the dynamic bandwidth optimization for the 2.4 GHz frequency band which is used by ems-esp. This resulted in a disconnect of all WiFi devices (approx. 15 connected to 3 ap). All of them reconnected after approx. 5 seconds, just ems-esp not. The LED was off. (BSSID not used yet).
I just recognized that when selecting an ap from wifi scan all network-settings are lost (fixed ip etc). This is not good. I will now test with BSSID set to a specific ap. With BSSID configured the changing bandwidth optimization within trhe mesh network now gives a reconnect of ems-esp after 20-25 seconds.
One thing that seems to be common for all with this issue is the use of a wifi-mesh (and not setting bssid). We know from @JokerGermany that the router says that ems-esp is disconnected, but don't know the connection state ems-esp reports (led).
Pls forget my report, sorry for the inconvenience https://github.com/emsesp/EMS-ESP32/issues/1264#issuecomment-1755899376
This is what happened with my installation as well this evening Fritzbox mesh did bandwith optimisation and official build 3.6.2 was gone.
Just set ems-esp to lower bandwidth (20MHz).
regarding wifi - in my case it is wifi cisco 9800 + 3 x AP and it is controller based wif (not mesh). I logs I can see that ems-esp is connected to one AP
I logs I can see that ems-esp is connected to one AP
The problem is, that you can't see the disconnect in syslog if it does not reconnect. Only a serial log would help.
I will dig a bit in the WLC and check what can I see. I just ticked wifi settings:
Maybe it will solve the issue
Only a serial log would help.
Is there an easy way to connect a TTL-UART adapter to the S32 gateway from bbqkees? I've got one available and could try to log serial data.
Or could this gateway easily be upgraded with an RJ45 ethernet interface to exclude wifi issues?
It did not help (wifi sleep and lower bandwidth). WLC logs show disconnected. Although it was up for at least 8 hrs. How can I get serial logs?
It did not help (wifi sleep and lower bandwidth). WLC logs show disconnected. Although it was up for at least 8 hrs. How can I get serial logs?
I think with a wired serial connection to the gateway, see my post above.
There is a serial/usb driver on the board, you can connect the USB. Or use TTL-UART(3.3V) on the marked rx/tx pins (io03/io01) (also connected to usb). For ETH you can connect a cheap LAN8720 module like this: mdc-io23, mdio-io18, the other pin can be set in ems-esp-custom profile. ETH is only supported on esp32 chips, not on esp32-s3/esp32-c3/esp32-s2!
BTW: Still can not reproduce, my main esp32 (v3.6.2dev2) is 21 days online now, In this time mqtt-server was 2 times down half an hour, wifi was switched off a few hours, wifi channel was changed, always get a reconnect. My fritzbox7530 always use 40MHz, i've tried different settings. Seems i have to few other APs near to trigger a bandwidth switching, don't know how to force to 20MHz.
PS: for logging it could be usefull to compile with EMSESP-DEBUG option. If you need a binary, i can prepare one.
I recommend to switch the dynamic bandwidth change off within FritzBox, to make WiFi to stay permanent on 40 MHz. I recognized that in the meantime a lot of new WiFi networks where using the same channel (4 of them cars of my neighbors ) I switched off dynamic channel selection to avoid disconnects some time ago. But I believe that channel change is not causing any problem, but switch from 40 to 20 MHz seems to cause problems - at least in Mesh, since AP change bandwidth simultaneously but time needed to be back again differs from 3 to 20 seconds. I am now using BSSID to local ap. Let's see if this helps.....
The S32 has a USB port on the inside of the enclosure. If you connect it to a PC, you will get a COM port available for serial connection. https://bbqkees-electronics.nl/wiki/gateway/firmware-update-and-downgrade.html#gateway-s32
The E32 V1.5 also has an internal USB port. However, it has a very simple power supply for it, it may need an external power supply. The oldest E32 has an USB to TTL board included. https://bbqkees-electronics.nl/wiki/gateway/firmware-update-and-downgrade.html#gateway-e32
The S3 has an external USB-C port. https://bbqkees-electronics.nl/wiki/gateway/firmware-update-and-downgrade.html#gateway-s3-and-s3-lr
The EMS-ESP version 3.7.0-dev.2 runs for me for 12-16 days, then EMS-ESP becomes unresponsive.
I can't find any error message in the syslog, only the UniFi Network Application reports that the device has logged out. It doesn't reconnect and it only starts when I restart the power adapter.
Could it be due to the tasmota/platform-espressif32, because I also have this effect on other Tasmota devices?
platform = https://github.com/tasmota/platform-espressif32.git
just a quick update from my side. I do not have mesh, I have tasmota devices running at another site - never failed. I have an ai-at-edge for watermeter running at a remote site with exactly the same ap as for ems-esp32. AI at edge has never (!) failed.
I still wonder, why it was working the first days without any problems (before enabling mqtt).
Before the ems gw disconnects, I see ling ping times, a failing webinterface and now in the syslog mqtt disconnects followed by multiple connects.
If any log I can collect helps to identify the issue, please let me know. I wonder if periodic MARK messages to syslog should include memory information and / or other diagnostics.
--Philip
my 3.6.3-dev.1 it's down second time today
@zibous
Could it be due to the tasmota/platform-espressif32, because I also have this effect on other Tasmota devices?
No, you are using 3.7.0-dev2, later renamed to 3.6.1-dev0. This is espressif platform 6.3.2/IDF4.4.4/arduino 2.0.09, this was also used for 3.6.1 final. v3.6.2 uses tasmota platform 6.4.0/IDF4.4.5/arduino 2.0.11. Now we use tasmota 6.4.0/IDF4.4.6/arduino 2.0.14.
@maciejka104 Could you try with my testbuild, i have added some reconnect checks, maybe this helps.
@MichaelDvP testing
Another feedback: V3.6.2 Using without BBQKees board, already lack of services sometimes: continue ping gives outage. But AP logging does not respond with wifi disconnect !
Using Ruckus R310 without mesh, so I used my old TP-Link Wr1043 : No problems When I set Ruckus "background scanning" option to OFF then also no problems !
@zibous
Could it be due to the tasmota/platform-espressif32, because I also have this effect on other Tasmota devices?
No, you are using 3.7.0-dev2, later renamed to 3.6.1-dev0. This is espressif platform 6.3.2/IDF4.4.4/arduino 2.0.09, this was also used for 3.6.1 final. v3.6.2 uses tasmota platform 6.4.0/IDF4.4.5/arduino 2.0.11. Now we use tasmota 6.4.0/IDF4.4.6/arduino 2.0.14.
@maciejka104 Could you try with my testbuild, i have added some reconnect checks, maybe this helps.
@MichaelDvP There is huge improvemt - it was up for 16hrs and then disconnected
it was up for 16hrs and then disconnected
Please be more specific, disconnected?
Red led on Blue led off
wlc shows disconnected Wlc shows only disconnected, no retry to connect.
quick update from my side: not access the webpage once - only one mqtt reconnect. mqtt fails 0
Uptime now: 40 hours, ping time still low. I still think the issue is related to mqtt connection failures or webaccess.
WLC shows disconnected no reconnects attempts Ems esp - red led on, blue off
Another feedback: V3.6.2 Using without BBQKees board, already lack of services sometimes: continue ping gives outage.
@soulman-web can you reproduce an outage or a freeze? Like the continuous ping'ing?
yes I can, using your testbuild 3.6.3-dev.2f same results. When I put on de option "background scanning" ( which is default on ) on my Ruckus AP there are freezes. Uptime of the board is correct
Sat Oct 14 17:41:31 2023 UIT Sat Oct 14 17:43:19 2023 AAN Sat Oct 14 17:46:31 2023 UIT Sat Oct 14 17:46:47 2023 AAN Sat Oct 14 18:04:51 2023 UIT Sat Oct 14 18:17:58 2023 AAN Sat Oct 14 18:35:51 2023 UIT Sat Oct 14 18:42:51 2023 AAN
yes I can, using your testbuild 3.6.3-dev.2f same results. When I put on de option "background scanning" ( which is default on ) on my Ruckus AP there are freezes. Uptime of the board is correct
Sat Oct 14 17:41:31 2023 UIT Sat Oct 14 17:43:19 2023 AAN Sat Oct 14 17:46:31 2023 UIT Sat Oct 14 17:46:47 2023 AAN Sat Oct 14 18:04:51 2023 UIT Sat Oct 14 18:17:58 2023 AAN Sat Oct 14 18:35:51 2023 UIT Sat Oct 14 18:42:51 2023 AAN
@soulman-web can you disable MQTT on the EMS-ESP and see if it still freezes when background scanning is on? That will help us pin point the root cause
same problem. already thougt so, because when i started , i did not use mqtt. and then there was already outage.
as far as I understand, background scanning on the access point will always cause packet loss in communication to clients. The acces points I mostly work with, leave the current channel and scan all channels for clients and or analysis, which channels are less used. For this reason, I have never worked with background scanning enabled.
Regarding ems-esp, # I have not accessed the webinterface since reboot, mqttconnects is still 2, and it has now been up for 2d and 19 hours. So from my point of view, it must be related to communication errors with the mqtt or related to accessing the webinterface. I do not use api.
mqtt is reached via site-site vpn, but using vodafone cable internet, which has become more and more unreliable in germany. So there might have been some issues communication with the mqtt.
When it was failing: why are there more than one MQTT connect messages in syslog following an MQTT disconnect ?
@philipherbert do not agree with packet loss. indeed some packets could be lost, bot not for 5 or 10 minutes long. monitoring now esp8266 and another esp s32 s2 ( both wit esp-easy )
@proddy is it possible to show "connected since" in network status ?
It's very difficult to guess what is simlar to all these different setups. I've updated my test build with 2 changes:
For all with this issue. Please always include to your messages:
your latest build dev_2i show up as dev_2f , did you forget title ? unfortunately same result, after ticking on "background scanning" lack of services within a few minutes. went away for 13 minutes. The other esp's don't complain.
Quick report. My esp32 hasn't come back after an OTA update of your test build. I get solid red led and the usual triple flash blue led (twice) and then a very brief solid blue (0.5s) and then it stays on double flashing blue led. Power cycle no help. No network presence.
I'll have to pull it out later and manually reflash back to working build.
Sorry to ask in this thread, but 1) I cannot flash esp32 s2, gives Permission error 13 2) I did flash esp32 s3 lolin , board shows pin 38 is onboard rgb led, but led does'n do anything
@soulman-web
your latest build dev_2i show up as dev_2f , did you forget title ?
no, the title is set automatic, i think this is the old version and update has failed, try again.
@gwilford
No network presence.
Double flash means no network, the AP should be active.
..after two ISP connection issues this afternoon, freemem has now gone down from 190 to 147, and max_alloc has increased. ping time still normal, syslog reports: two mqtt disconnects follwed by one MQTT connect mqttconnects has increased by 2 (now 4), mqttfails is 0
very wierd, updated 3 times , web interface still shows 3.6.3-dev.2f
console too?
... after 12 days :( Screenshot from yesterday:
My router says the esp is still connected to wifi, although it does not respond to ping. Interface is S32 from bbqkees. LED is on, not blinking. Wifi connection is good (green). Same IP is assigned every time by Fritzbox DHCP, not static.
I will see if I can get some additional information from the mqtt telegrams written to database and external syslog tomorrow.
Originally posted by @Th0maz in https://github.com/emsesp/EMS-ESP32/issues/1264#issuecomment-1751789259