ems-esp is rebooting twice a day

tp1de commented 3 years ago

@proddy Hi Paul, I recognized that my ESP32 gateway is rebooting approx. after 11-12 hours. I can't see any reason why. Free memory is not the problem. Could it be that having http get requests per device every 15 sec per api is initiating the rebooting after a while?

It's not a major issue for me. I just miss some data for approx 90 secs because of my 60 tx-delay on startup. Could I help to resolve this problem?

proddy commented 3 years ago

I had quite some discussion with Bosch

@tp1de I'm sure you're aware but please try and keep EMS-ESP low profile when you talk to Bosch. Although what we're doing is not illegal, it is a bit naughty and I don't want to get shut down by any copyright infringement

tp1de commented 3 years ago

I had quite some discussion with Bosch

@tp1de I'm sure you're aware but please try and keep EMS-ESP low profile when you talk to Bosch. Although what we're doing is not illegal, it is a bit naughty and I don't want to get shut down by any copyright infringement

Don't worry, I had this data already from KM200 and DS18B20 sensors where I polled 10 values every 15 secs. EMS-ESP was not mentioned at all.

I just tried to understand and discuss if Bosch or an Installer could adjust PID control parameters on my Boiler/Master Controller. Even it is more then 40 years ago when I studied electrical engineering / control systems, I understand that the control parameters are not ideal and my boiler is too large for my two houses, even I bought the boiler with 25% lower nominal power then calculated by an energy consultant. But so far no good response from Bosch and my installer has no glue!

BTW: I got the hint of thermal problems within ESP32 while being in the BBQKees housing. I asked in the ioBroker Forum on the thread of my adapter for rebooting experience and got one reply so far (stable w.o. reboots). I have seen that the ESP32 has an internal temp sensor which could be read and data sent by API / MQTT.

What do you think: Could it be thermal problems (overheating) as well which causes the reboots? Or could it be my board - AZDelivery ESP32 D1 Mini NodeMCU - which is not stable?

tp1de commented 3 years ago

I changed MQTT frequency to 60 secs but I had 3 reboots this night. Now I switched MQTT off.

@proddy I have seen that you updated firmware and added uptime_sec. Thanks. BUT you changed names als well and removed the "#" - exception are #dallas .... Within my ioBroker adapter I take this names 1:1 for creating ioBroker states and SQL-history - now I have them twice and history is on old names. Please do not change too often, since this needs manual intervention. I would advice to change dallas info as well and remove the #.

tp1de commented 3 years ago

@proddy Any chance to read internal temperature sensor and publish values? https://circuits4you.com/2019/01/01/esp32-internal-temperature-sensor-example/

MichaelDvP commented 3 years ago

Internal temperature sensor was only in first silicon, later esp32 do not have a sensor. Check the old datasheet here and actual datasheet here. The internal sensor was not calibrated and only suitible for measuring increase/decrease, not absolut temperatures, as described in the old datasheet. Better you connect a dallas and put it into the case.

BTW: my system is running now for 14h without reboots. IoBroker-Api enabled and mqtt with 5 sec for all.

tp1de commented 3 years ago

@MichaelDvP thanks for feedback. Are you using BBQKees housing? Which ESP32 board?

Somehow I feel that it might be a problem with my board. I will open the housing for better air ventilation and see if this makes a difference. I now opened the case. The ESP32 is not hot at all from outside feeling.

MichaelDvP commented 3 years ago

Are you using BBQKees housing? Which ESP32 board?

Actually BBQKees S32 Gateway prototype in closed original housing, But i switch sometimes to Gateway v1.5 with MH-ET D1 mini32, also in the closed housing with printed side-panel. But i don't think it's temperature, the esp32 modules are specified for -40 - 85 °C environment at full power, If you don't mount it below the boiler-insulation, these temperatures are never reached.

tp1de commented 3 years ago

Actually BBQKees S32 Gateway prototype in closed original housing, But i switch sometimes to Gateway v1.5 with MH-ET D1 mini32, also in the closed housing with printed side-panel. But i don't think it's temperature, the esp32 modules are specified for -40 - 85 °C environment at full power, If you don't mount it below the boiler-insulation, these temperatures are never reached.

Since the ESP32 was handwarm while opening the housing, I think you are right. I started with gateway board v1.6 and used it for some months and since some weeks I am using v1.7 with the updated ems screw terminal.

Rebooting could be caused by my ESP32 - AZDelivery ESP32 D1 Mini NodeMCU - which is not stable or maybe the v1.7 board. Since I do not have a second ESP32 I will test the v1.6 gateway board in the meantime.

tp1de commented 3 years ago

v1.6 gateway board rebooted after approx. 9 1/2 hours. Last try is with jack-powered v1.6 gw-board.

tp1de commented 3 years ago

Before going to bed late I recognized that EMS-ESP lost the network connection (w.o. reboot). WLAN reconnection took 3 minutes ! During reboot just 2-3 seconds ! Any idea what happened?

grafik

proddy commented 3 years ago

As I keep saying turn off NTP, MQTT and any other unnecessary services. If WiFi is reconnecting it smells like a power issue.

MichaelDvP commented 3 years ago

I'm stopping my test with high datarate now, It's 37 hours without reboot, 55000 api calls, 203000 mqtt publishes, 163000 telegrams received (5 crc fails), 41000 telegrams sent, 95000 dallas read. It is not the software, not api, not mqtt that causes the reboots. @tp1de you should check with stable power and if this does not help, change the esp32 module. Maybe it has a ESD-damage or bad soldering.

tp1de commented 3 years ago

As I keep saying turn off NTP, MQTT and any other unnecessary services. If WiFi is reconnecting it smells like a power issue.

I had everything off and even reduced the API polling rate to 60 secs and still had reboots. This time when connected by service jack I do not see reboots anymore but I had twice this night WiFi connection losses. I do not think it's NTP or MQTT services. I do have another idea, what could make the connection losses:

I do have a Mesh WiFi Network consisting of AVM Fritzboxes and Wlan Repeaters. A FB 6591 is connected to Cable-Internet in my Office 1st floor and is the Mesh Master. A second Fritzbox 7590 is conected by LAN cable and located in the cellar and acting as Mesh Repeater. The FB 7590 is very close to my heating room and the EMS-ESP (< 1 m). Signal strength 95% Another AVM repeater is within Mesh and connected by Wifi far away

I had already in the past connection problems to Mesh with EMS-ESP (connecting to poor signal strength repeater) but other users seems to have an issue with Mesh as well - see #762.

My WiFi networks autoselects channels and is reducing bandwith dynamically when recognizing too many other WiFi networks on same channel. I do have 21 WiFi networks arround my house. In the log of my main router I can find an entry, that shortly before EMS-ESP Wifi connection was lost, this happened and Wlan bandwith was reduced from 40 MHz (EMS-ESP: n/Wi-Fi 4, 40 MHz, WPA2, 1 x 1) to 20 MHz. Then connection was lost and it took 3 minutes to reconnect. (no reboot so far). Other WiFi connected devices had no issue. I just see that one of my raspberry pi's connected to same accesspoint always uses 20 MHz bandwith and not 40 like EMS-ESP.

I now switched off this dynamic bandwith adjustment feature and will see what happens. Uptime is now 13.5 hours.

BTW: It would be good to see local time rather then uptime in system-log.

tp1de commented 3 years ago

I now switched off this dynamic bandwith adjustment feature

and had after 6:20 hours another WiFi connection lost. Reconnection took 16 seconds. Since I connected to service jack I only see the WiFi reconnection reporting in system log and my ioBroker adapter misses 1 http get, but I get no rebooting anymore since 17:30 hours.

I do have more then 10 Wlan-devices connected to this accesspoint. As far as I understand the document https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/wifi.html#esp32-wi-fi-configuration recommends in this case to go for HT20 as standard:

_ESP32 supports Wi-Fi bandwidth HT20 or HT40, it doesn’t support HT20/40 coexist. esp_wifi_setbandwidth() can be used to change the default bandwidth of station or AP. The default bandwidth for ESP32 station and AP is HT40.

In station mode, the actual bandwidth is firstly negotiated during the Wi-Fi connection. It is HT40 only if both the station and the connected AP support HT40, otherwise it’s HT20. If the bandwidth of connected AP is changes, the actual bandwidth is negotiated again without Wi-Fi disconnecting.

Theoretically the HT40 can gain better throughput because the maximum raw physicial (PHY) data rate for HT40 is 150Mbps while it’s 72Mbps for HT20. However, if the device is used in some special environment, e.g. there are too many other Wi-Fi devices around the ESP32 device, the performance of HT40 may be degraded. So if the applications need to support same or similar scenarios, it’s recommended that the bandwidth is always configured to HT20.

Any chance to test this?

proddy commented 3 years ago

Can you build the firmware yourself? If so add -DEMSESP_WIFI_TWEAK to the pio_local.ini and I'll make sure HT20 is inforced using the call esp_wifi_set_bandwidth(ESP_IF_WIFI_STA, WIFI_BW_HT20);. Also I noticed you have a RPi in your network. It comes with a syslog service so you could get EMS-ESP to dump all the logs there, which will give you more insight into what is happening instead of using the console or Web System Log.

tp1de commented 3 years ago

I have't compiled by myself since a long time. I downloaded source from last build. Is this complete? I just tried compiling but I am getting errors; grafik

BTW: I just got a reboot even when switched of MQTT, NTP and being connected to servicejack-powering. I am getting frustrated.

Yes I have different syslog services up and running. Which loglevel shall I use to avoid too much entries? Is "error" good enough? I just switched syslog on towards my Synology NAS with Loglevel "Info".

FredericMa commented 3 years ago

I'm having the same issue. Mine restarts at random intervals. For me, the issue started from version 3.1.0 but I didn't had the time to look at it. I thought that it maybe was a memory issue so I cleared the flash memory and reflashed EMS-ESP today but it didn't solve the issue. I reinstalled it around 7PM (yellow mark) on the chart) and as you can see it restarted 15 minutes ago.

Here is a screenshot of the uptime chart:

tp1de commented 3 years ago

I got another reboot too. Syslog Server has no further information other than the restart info.

I just have seen that my Fritzbox Router has changed WLAN channel due to suddenly 16 new WLAN Networks on old channel 1. (Mobile phones some with accesspoints due to some youth party in the neighborhood ... :) )

I switched automatic channel selection out as well. A WiFi reconnect should not reboot the EMS-ESP - or ?

proddy commented 3 years ago

I have't compiled by myself since a long time. I downloaded source from last build. Is this complete?

@tp1de @FredericMa I've added a new PlatformIO target called 'debug' which compiles with debug strings and uses HT20 WiFi. Easiest way is to use Visual Studio Code with Pio and select Project Tasks->debug->Upload and Monitor with EMS-ESP connected to USB. When EMS-ESP crashes it will show the decoded stack dump and point to where line and which object file caused it. Best to turn off MQTT and NTP to save on chatter.

tp1de commented 3 years ago

@proddy Thanks I will wait until tomorrow. I need to install than VSC on my Laptop. I use my Desktop for SW-development.

In my case I am quite confident that I might have found the reason for reboots. Extending the WLAN log in my Fritzbox Router I recognized heavy activity within my Mesh-network of changing WiFi-bandwidth and/or changing channels due to heavy workload of other network-devices. This happens 1-2 times an hour until I switched this off yesterday. (It is on by standard setup since years and I haven't recognized yet).

Actually everthing is stable and I do not have reboots anymore since 17 hours - but I will wait at least another 24 hours.

Could it be that API V3 does not recognize properly if WiFI is temporarily down? Reconnection works within 1-2 seconds. This does not seem to be any problem.

I've added a new PlatformIO target called 'debug' which compiles with debug strings and uses HT20 WiFi.

What is a PlatformIO target? I had problems compiling your code since WWWData.h was missing. Is the code correct now and do I need to compile?

proddy commented 3 years ago

Could it be that API V3 does not recognize properly if WiFI is temporarily down?

I tested, turned off WiFi and EMS-ESP goes into Access Point mode (if configured to do so) and then tries to re-connect. It works and it doesn't crash.

tp1de commented 3 years ago

@proddy

tested, turned off WiFi and EMS-ESP goes into Access Point mode (if configured to do so) and then tries to re-connect. It works and it doesn't crash.

I got another reboot after approx. 20 hours uptime. In the router-log I can see that WiFi was disconnected and reconnected within less then a second. So reconnection of WiFi works but nevertheless there seems to be a whatchdog reset which keeps EMS-ESP to reboot. (Hardware or Software reset I can't judge. But I had similar crashes in my ioBroker adapter when doing async http get/post requests without proper try/catch error handling).

tp1de commented 3 years ago

What is a PlatformIO target? I had problems compiling your code since WWWData.h was missing. Is the code correct now and do I need to compile?

@proddy I have installed VSC and platformio plugin on my laptop. So I am ready to do more tests.

I need a link to software code for compilation. Actually there are references within the code to WWWData.h missing in /lib/framework directory which gives errors on compilation. If I take the file from @MichaelDvP github repository then compilation will work. But I can't check if downloaded files are the actual ones !

And please explain a bit more detailled what is a "platformio target" and which steps I have to to for debugging. I am just used to download code into a directory, open this directory in vsc then going to pio tasks and making an ESP32 build which I then upload to EMS-ESP by web-ui.

FredericMa commented 3 years ago

@proddy I'll try to test it this evening!

tp1de commented 3 years ago

@FredericMa

I'll try to test it this evening!

So you know how to do it and you can compile?

MichaelDvP commented 3 years ago

@tp1de WWWData.h is generated by the prebuild script and contains the web-data. You need to have NPM installed. If you run the build process is it generated automatically. (to build manual type: cd interface, npm install, npm run build)

To use the debug-task you have to rename pio_local.ini_example to pio_local.ini or copy the [env:debug] chapter to your pio_local.ini. Then you can select platformio-icon on the left side, choose debug-task and select upload and monitor.

pio

proddy commented 3 years ago

thanks Michael. And F1->Git: Clone->https://github.com/emsesp/EMS-ESP32.git

and bottom left (where it shows dev2 in Michael's setup) choose the origin/dev branch.

tp1de commented 3 years ago

@proddy @MichaelDvP I got errors with npm: grafik

Do you have an idea what to do?

tp1de commented 3 years ago

to build manual type: cd interface, npm install, npm run build

I found out that I need npm install --force to override the errors on unsupported operating system. I got a couple of error messages of missing dependencies. I hope this has no effect.

Build without debug mode is now working and is building the bin-file. I made a pio_local.ini and I am building now with debug options. Runs through with an error that upload was not working since not yet connected to EMS-ESP. I will connect now and see what is happening .....

proddy commented 3 years ago

ok! so far so good!

tp1de commented 3 years ago

I just need to understand how to select the right usb port for esptool. Automatically the wrong one is choosen. And which screen to watch afterwards Terminal or Debug Console?

proddy commented 3 years ago

I just need to understand how to select the right usb port for esptool. Automatically the wrong one is choosen. And which screen to watch afterwards Terminal or Debug Console?

in the PlatformIO menu (icon on left bar) go to PIO Home->Devices and it'll show which COM ports the USB is using. Then add this to your pio_local.ini like upload_port = COM3. After "Upload and Monitor" it will open a new window with the EMS-ESP console. To test it, in EMS-ESP, type "test crash" and it will force EMS-ESP to crash. Hopefully you'll see the stack dump on the screen and also the line where it crashed (test.cpp).

tp1de commented 3 years ago

in the PlatformIO menu (icon on left bar) go to PIO Home->Devices and it'll show which COM ports the USB is using. Then add this to your pio_local.ini like upload_port = COM3

This worked for upload. Then for monitor I have to select the port manually again. So far so good.

After "Upload and Monitor" it will open a new window with the EMS-ESP console.

This is not happening. On terminal a lot of output so fast that I can't read anything- all telegrams and the EMS-ESP is not connected to Wifi. It seems that the EMS-ESP is permanently rebooting..... I see some register dump but so fast that I can not read it on the screen.

proddy commented 3 years ago

This is not happening. On terminal a lot of output so fast that I can't read anything- all telegrams and the EMS-ESP is not connected to Wifi. It seems that the EMS-ESP is permanently rebooting..... I see some register dump but so fast that I can not read it on the screen

try again without the 'debug' target. Pick the 'esp32' one and see if that works. It uses a different partition.

tp1de commented 3 years ago

Since I had no time yesterday anymore I compiled without debug and uploaded and monitored by COM7. EMS-ESP rebooted in the meantime - on the VSC-console you see (before reboot):

grafik

after new start:

grafik

proddy commented 3 years ago

weird. check that the monitor speed is 115200 in your platformio.ini?

tp1de commented 3 years ago

I got this screen after some hours and before reboot. monitor speed is 115200 in platformio.ini Now with b3 update that's the actual picture. grafik

tp1de commented 3 years ago

Just rebooted again after 15 minutes with an empty screen as above .....

tp1de commented 3 years ago

Finally I have a dump for the rebooting:

grafik

tp1de commented 3 years ago

@proddy @MichaelDvP

Since I have the ESP8266 running in parallel to the ESP32 it's just the ESP32 which crashes after a couple of hours. Do you have any idea what could be the root cause?

tp1de commented 3 years ago

A new dump (2 screens since longer) grafik grafik

MichaelDvP commented 3 years ago

Hm, both crashes are LoadProhibited at PC 0x400014FD, trying to load something from garbage address 0x60c20 with vfprintf.c. I suspect a log message with no format specifier containing a % and misinterpreted as format. Maybe in 'pretty_telegram()`.

@tp1de have you tried with all logs off?

proddy commented 3 years ago

yes it looks like a buffer overflow trying to reference either an empty telegram or a filling a telegram buffer and exceeding the allocated space (32 bytes max, 27 for the message length). We could certainly add more checks in the code to ensure there is no overflow. It would be interesting to see if the crashes still occur with all services turned off (syslog, mqtt, ntp, api) and running only one EMS-ESP on the ems bus. If that still crashes then turn the automatic Tx off.

tp1de commented 3 years ago

@tp1de have you tried with all logs off?

Yes I tried before all services off and then switched syslog on again as recommended by Paul. I switched off everything now again except of the API V3.

The second gateway I just connected 3 days ago to test the ioBroker Adapter with V2 too - it seems to work now as well. The reboots I have since a long time like versions > 3.1.

I will be out a week from Sunday onwards, but I will watch remote just without Laptop console. What is the difference in sence of telegram handlings compared to old v2 which is stable? Or might it be a V3 API topic?

proddy commented 3 years ago

Not entirely related, but I was reading on https://tasmota.github.io/docs/Energy-Saving/ that perhaps putting back the delay(1) we had in emsesp::loop() would save some power consumption

MichaelDvP commented 3 years ago

I think i've finally found the cause, a typo that was not detected by code-checker and compiler (since 23.4.2021).

tp1de commented 3 years ago

I loaded the new firmware and will see if it is stable now. Rebooting was between 1 and 24 hours. I will report tomorrow.

FredericMa commented 3 years ago

I've also uploaded the new firmware. TX-fails is 1 now while it remained 0 before so this looks good. 1h40min uptime and counting.

MichaelDvP commented 3 years ago

@FredericMa, afair you have a Junkers thermostat. As mentioned in #77 i have changes the junkers-mode published values to match the command values (also to make the selection-box in web possible). Can you check if this is all ok?

FredericMa commented 3 years ago

Uptime is now 24h without a reboot. I would consider this as solved since I had at least 5 reboots within 24h before. Thanks!

emsesp / EMS-ESP32

ems-esp is rebooting twice a day #78