emsesp / EMS-ESP32

ESP32 firmware to read and control EMS and Heatronic compatible equipment such as boilers, thermostats, solar modules, and heat pumps
https://emsesp.github.io/docs
GNU Lesser General Public License v3.0
627 stars 107 forks source link

ems-esp is rebooting twice a day #78

Closed tp1de closed 3 years ago

tp1de commented 3 years ago

@proddy Hi Paul, I recognized that my ESP32 gateway is rebooting approx. after 11-12 hours. I can't see any reason why. Free memory is not the problem. Could it be that having http get requests per device every 15 sec per api is initiating the rebooting after a while?

It's not a major issue for me. I just miss some data for approx 90 secs because of my 60 tx-delay on startup. Could I help to resolve this problem?

proddy commented 3 years ago

This is going to hard to find. It's probably the web code (API) but could you try and narrow it down to get it easily reproducible. For starters

proddy commented 3 years ago

well spotted @MichaelDvP !

tp1de commented 3 years ago

@proddy @MichaelDvP I will test and report.

tp1de commented 3 years ago

@proddy @MichaelDvP

I installed the updated version with the fixed memory leak. Now it is getting worse. I get a reboot now approx. every 13 minutes. It is a reboot of the esp32 and not just a connection lost for ems-bus.

I will try to watch the log if I can see anything ....

proddy commented 3 years ago

The fix @MichaelDvP did is valid. The async response buffer is only free'd after a send (https://github.com/me-no-dev/ESPAsyncWebServer#request-life-cycle).

tp1de commented 3 years ago

Now it is running since > 30 minutes. But I recognized the following in respect to web-ui system log:

I switched off the web-ui system log and will wait if this has an influence.

proddy commented 3 years ago

When loglevel is set to error I get 1-2 crc errors shown per minute

That is really bad. What does show ems tell you from the console?

Screenshot 2021-06-29 130333

tp1de commented 3 years ago

What do you mean with bad? Too many crc errors? But why aren't the errors shown in debug mode? Actually the esp is running since 1:20 hours and system log-level is off:

grafik

proddy commented 3 years ago

What do you mean with bad? Too many crc errors? But why aren't the errors shown in debug mode?

1-2 CRC errors every minute is not good. Are you sure you are seeing this? Errors are also shown when the log level is DEBUG. If you're not seeing any probably there aren't any! It's hard to tell without real logs, screenshots etc

MichaelDvP commented 3 years ago

Reboots and crc-errors can be induced by a weak/unstable powersupply, thin cables, etc. Last in #72, and many times before.

tp1de commented 3 years ago

Reboots and crc-errors can be induced by a weak/unstable powersupply, thin cables, etc. Last in #72, and many times before.

Yes I am certainly aware, but I haven't changed anything on wirering and/or power supply. I haven't had this "instability" in older versions for some months. I will test in former builds if I will have the same behaviour. 0.9% crc errors on rx should not be a major problem as far as I can judge.

tp1de commented 3 years ago

The gateway rebooted again after approx 3,5 hours. I had log level on "off". After a reboot it is reset to "info".

grafik

proddy commented 3 years ago

I would ignore the CRC errors and how they are reported. It's normal to see CRC errors. Logs are only shown when the browser window is open and it's always set to INFO. So all normal there.

To find out why it's rebooting, try playing with the API calls (lower the number of calls, turn it off) until you narrow down what is causing it

MichaelDvP commented 3 years ago

The log looks like bus-Id is set to 0x0A and all crc-errors are on the master-echo. Strangely the stat above shows no tx-fails, so the other devices reply to emsesp, only emsesp itself is disturbed while sending. Maybe bus-connection is not good.

I'm trying to reproduce: I have activated the tp1de-ioBroker-adapter in actual version running now for seven and a half hours without reboot (after the fix). Also mqtt is active with my single-value extension and qos 1, both giving a lot of more traffic. Jack-powered Gateway S32 prototype, Jack-cable 0,5m from BBQKees. Rx: 35159, crc-fails 1 Tx: 3915, fails 0 Dallas-reads: 18701, fails 0 Mqtt publishes 14873, fails 0 Devices: Boiler, Controller, Mixer, Solar, Thermostat, 7 Dallassensors powered from 3.3V, 1 analog sensor powered from 5V. Syslog info enabled, AP disabled, NTP enabled, OTA disabled Heap 191k, free block 107k (unchanged in this 7,5 hours)

All looks stable to me, i don't think i will get a reboot in next hours, but we'll see tomorrow.

MichaelDvP commented 3 years ago

Uptime: 001+01:35:15 Rx: 14089, 17 fails Tx: 23549, 0 fails mqtt 33015, 0 fails heap 191k, 107k free block

I can not reproduce the reboots. Software (mqtt/api/ioBroker-adapter) seems stable, i think it's somewhere in hardware/connection/power.

proddy commented 3 years ago

there's a guy on Discord that's also getting reboots. Perhaps we should re-introduce the code that kept the reboot reason and part of the stack to help identify which is causing it. If it's a Watch Dog reset then I would be worried

tp1de commented 3 years ago

I updated again with the fixed version from @MichaelDvP and made a hard reboot. I can confirm that it seems stable now for more then 24 hours.

The higher crc error rate should not be the cause for reboots. My 0.8% crc error rate is related to connection of the ems-esp gateway to the ems-bus outbound of mm100 mixer module (2 m twisted pair cable) to keep service jack free for maintenance activities. That's the reason why I have selected a different bus id to allow the installer on yearly maintenance to connect by service jack and laptop.

But I believe the suggestion of @proddy is good to keep track of reboot reasons. BTW yesterday I could initiate reboots by webui from my mobile phone (android 10 / chrome browser) by requesting a couple of times a request to refresh system status screen. Today it is working.

tp1de commented 3 years ago

Just a small observation: freemem info by api is always lower then by mqtt heartbeat (174 vs. 180).

proddy commented 3 years ago

Just a small observation: freemem info by api is always lower then by mqtt heartbeat (174 vs. 180).

https://github.com/emsesp/EMS-ESP/issues/569

MichaelDvP commented 3 years ago

(2 m twisted pair cable)

how thin? Try with 0,75mm^2. The crc-errors only while sending indicating a voltage drop over the line.

I see that your KM200 produces a lot of traffic, i have ~75 messages per minute, you receive 175 per minute. That is something i can not reproduce in my system.

tp1de commented 3 years ago

how thin? Try with 0,75mm^2. The crc-errors only while sending indicating a voltage drop over the line.

Yes I am aware but this only tempory. (.... since 5 months ... ) I will install the ems-esp close to the mm100 and then cable will then be 10 cm.

I see that your KM200 produces a lot of traffic, i have ~75 messages per minute, you receive 175 per minute. That is something i can not reproduce in my system.

What kind of data for trafic are you refering to Michael? ems telegrams?

MichaelDvP commented 3 years ago

What kind of data for trafic are you refering to Michael? ems telegrams?

Yes, traffic on the ems-bus. But i don't think that the higher telegram-rate can cause resets after a couple of hours. The number of polls on the ems-bus is higher and does not overload the emsesp. I only mentioned it, because it is a difference between your system and mine.

tp1de commented 3 years ago

i have ~75 messages per minute, you receive 175 per minute

Just for interest. Where do you get the figures from and what do you mean with KM200? KM200 gateway with device id 0x48 or the km200 polling out of the ems-esp ioBroker adapter? Where do you identify 175 messages per minute on my system?

MichaelDvP commented 3 years ago

You wrote yesterday:

Actually the esp is running since 1:20 hours and system log-level is off:

and the screenshot shows 14212 received messages, giving 14212/80 = 177 messages per minute.

Because our systems are not very different (me: GB125-MC10, RC35, MM10, SM10, you GB125-MC110, RC300, MM100) i think the km200 makes the difference, reading active the devices.

tp1de commented 3 years ago

OK understood. Actually longterm it is rather 193 messages per minute.

I believe this is the result of polling km200 every 90 seconds within ioBroker Adapter for 119 fields w/o recordings. So approx 80 per minute + km200 is polling all recordable fields every minute to create recording statistics (power + temp). These are another 20 messages.

So km200 is increasing message traffic .

No reboots on ems gateway so far anymore!

tp1de commented 3 years ago

I got an other reboot at 3 am this night. Or could it be that something else is resetting uptime?

proddy commented 3 years ago

when EMS-ESP boots it will send out an MQTT topic called info with a payload like

{"event":"start","version":"3.1.2b0","ip":"192.168.1.134"}

This is a good indication on whether the device has physically rebooted itself.

MichaelDvP commented 3 years ago

@tp1de Have you tested a different power supply or jack-power? Most of the other users don't report reboots, it's pobably not the software, but something in your system.

tp1de commented 3 years ago

when EMS-ESP boots it will send out an MQTT topic called info with a payload like

{"event":"start","version":"3.1.2b0","ip":"192.168.1.134"}

This is a good indication on whether the device has physically rebooted itself.

I will watch this. Since I have restarted manually afterwards I see this mqtt entry from the manual start. I was wondering that my ioBroker Adapter with 15 sec polling have not recognized any errors. I will recheck my code to report http get errors during reboot..

But I believe it would be usefull to have the last restart status and time beeing reported in mqtt and in api v3 in http://ems-esp/api/system unless it is just the uptime which is reset when rebooted / restarted..

tp1de commented 3 years ago

@tp1de Have you tested a different power supply or jack-power? Most of the other users don't report reboots, it's pobably not the software, but something in your system.

I will do starting today with the following steps:

  1. Connect with thicker and shorter wires to mm100 ems-bus out and watch crc errors
  2. Connect to jack but keep it usb powered
  3. connect to jack and jack-powered

I will report crc errors effect and will watch if this has an impact on rebooting. This might take a couple of days. Nevertheless the rebooting does not effect long-term stability of my ioBroker adapter I think we should find the root cause.

tp1de commented 3 years ago

I will report my crc results in #80 In the meantime I have updated my ioBroker adapter to show an error message when no devices found during reboot.

tp1de commented 3 years ago

80 is understood and no issue for me anymore. So let me come back to the rebooting issue.

I just had a reboot after approx. 16 hours. ems-esp now connected with 10 cm wiring to mm100 and usb powered. I will change usb power adapter but I do not believe that this is the reason for reboots.

What else can I do / test?

tp1de commented 3 years ago

I had mqtt explorer running during reboot. The only thing I recognized was a small negative spike on freemem shortly before rebooting: grafik

BTW: It would be good to have uptime in seconds being available in API V3 as well. Will you implement in API or shall I do it in the ioBroker adapter?

proddy commented 3 years ago

We could add the uptime_sec to /api/system/info. It's just one line to add.

proddy commented 3 years ago

I just had a reboot after approx. 16 hours. ems-esp now connected with 10 cm wiring to mm100 and usb powered. I will change usb power adapter but I do not believe that this is the reason for reboots.

I was looking into how to capture the reboot reason and backstack trace on an ESP32 and it's not easy, unlike the ESP8266.

tp1de commented 3 years ago

We could add the uptime_sec to /api/system/info. It's just one line to add.

This would be good. Any idea how to find the reason for rebooting?

MichaelDvP commented 3 years ago

I was looking into how to capture the reboot reason and backstack trace on an ESP32 and it's not easy, unlike the ESP8266.

For the reason there is an example here.

proddy commented 3 years ago

This would be good. Any idea how to find the reason for rebooting?

No idea. As I mentioned in the earlier comment https://github.com/emsesp/EMS-ESP32/issues/78#issuecomment-870266749 the way to find it is by process of elimination, i.e. simply stopping services and slowing down API calls to see if it makes a difference. I suspect its probably the # API calls causing the AsyncWebServer to WDT. There are many reported issues in that's projects GitHub repo. But...we have around 1700 EMS-ESP users out there and no one else has reported reboots like this

tp1de commented 3 years ago

I suspect its probably the # API calls causing the AsyncWebServer to WDT.

@MichaelDvP Since you have installed the ioBroker adapter as well. Do you have rebooting issues?

Since my ESP32 is powered by USB and yours as far as I have understood by service jack, this is the only key difference - isn't it? What else could bring the ESP32 to reboot? Could it depend on different BBQKees board version?

proddy commented 3 years ago

@tp1de if you turn off the API polling, does it still crash? If so we can experiment with lowering the Tx power to help on voltage fluacations and I can create a small stress test. For that test I would need to know which API calls are you sending and the frequency,

MichaelDvP commented 3 years ago

@MichaelDvP Since you have installed the ioBroker adapter as well. Do you have rebooting issues?

No. As written here

Could it depend on different BBQKees board version?

Thanks to BBQKees he has spend some boards to me. I've tested gateway v1.5 with MH-ET32S, v1.7 with a different MH-ET, Gateway S32 prototype and S32 series. Only eth connection is untested with my system. No reboots.

tp1de commented 3 years ago

if you turn off the API polling, does it still crash?

I really do not want to do this, because I will loose the historic data in ioBroker. But I could change polling frequency.

For that test I would need to know which API calls are you sending and the frequency,

The ioBroker adapter has two phases. First all single fields are read per http get but only once on start. Then there is a fixed polling every 15 seconds with /ems-esp/api/system --> returns json with info, status and devices (in my case 3) /ems-esp/api/thermostat /ems-esp/api/boiler /ems-esp/api/mixer

So in total 4 API get requests per 15 seconds. Between each http get request and writing states into iobroker there is an additional 100msec waiting time. I do not get any http request errors returned before rebooting.

Total loop duration with http gets and writing the states into ioBroker (Redis) is approx 1 sec.

MichaelDvP commented 3 years ago

I think the API polling frequency is given by the ioBroker-Adapter and should be the same for me. Have you mqtt enabled? What sending frequency? If it is a power issue, i mostly compile with board_build.f_cpu = 160000000L in pio_local.ini, saving a bit power and is fast enough. (afair current for esp32: 80MHz-40mA, 160Mhz-50mA, 240MHz-70mA)

tp1de commented 3 years ago

Since @MichaelDvP does not have any rebooting issue with the same polling frequency API polling should not be the reason. MQTT is enabled with 10 seconds intervals.

tp1de commented 3 years ago

If it is a power issue, i mostly compile with board_build.f_cpu = 160000000L in pio_local.ini, saving a bit power and is fast enough. (afair current for esp32: 80MHz-40mA, 160Mhz-50mA, 240MHz-70mA)

I do not compile by myself. I use the ready to use builds. https://github.com/emsesp/EMS-ESP32/releases/tag/latest

I will test now further on with the following steps:

  1. I already changed USB power adapter - let's see if this has an influence
  2. I will switch off MQTT
  3. I will increase API polling to 60 sec
MichaelDvP commented 3 years ago

For what do you need api and mqtt in this high frequency? Wifi-sending is power-hungry. Switching off mqtt when using api is a good idea, I don't see any usecase for this, except for testing purpose.

tp1de commented 3 years ago

For what do you need api and mqtt in this high frequency? Wifi-sending is power-hungry. Switching off mqtt when using api is a good idea, I don't see any usecase for this, except for testing purpose.

@MichaelDvP Yes you are right. I basically do not need MQTT anymore. Wasn't 10 seconds set as standard when switching on MQTT?

I started with MQTT and then switched to API V3 to get the single field infos (enum /units etc) and some reconnection problems I had with MQTT. I still have some MQTT-Parts in Nodered - but I can switch them off.

The 15 secs polling in within the Adapter is a result to review the modulation control mechanismen of my boiler. I had quite some discussion with Bosch, since my Boiler always starts with 52% modulation for approx 2 minutes and mixer control is extremly slow. So boiler starts and mixer is not open for 2-4 minutes. As a result boiler is takting much more then necessary. To have "realtime" data I used the graphs from ioBroker and sent them to Bosch/Buderus. No solution from them yet. I will try to influence the mechanismen by myself within the ioBroker adapter to reduce takting. Testing is postponed anyhow until heating period starts again ....

But let's see what happens if I go for (2) and (3). I will report.

proddy commented 3 years ago

something to try later is compiling with the -DEMSESP_WIFI_TWEAK flag which lowers the WiFi power & strength, reducing power compensation of the circuit

proddy commented 3 years ago

and another thing @tp1de if you connect to a laptop via serial, when it crashes you can capture the trace which is printed to the serial console. https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/fatal-errors.html

tp1de commented 3 years ago

I started with new USB power supply and reducing MQTT frequency to 60 seconds. I will report the outcome.

The rebooting itself is not a severe issue for me since it is so fast that the ioBroker adapter only recognize the tx delay in respect to the km200 adapter (now 30 seconds).