emsesp / EMS-ESP32

ESP32 firmware to read and control EMS and Heatronic compatible equipment such as boilers, thermostats, solar modules, and heat pumps
https://emsesp.github.io/docs
GNU Lesser General Public License v3.0
641 stars 109 forks source link

EMS-ESP 3.6.0 crashes after hours/days - no response from ems-esp #1264

Closed yazend closed 1 year ago

yazend commented 1 year ago

I have updated to dev.3.7.0 as a test, after a certain time n 1-3 hours the ems-esp is no longer available, no response, no putty, no ping the whole thing can be reproduced, if I switch back to dev 3.6.17, this behavior is gone . Is this problem known

BR Alex

proddy commented 1 year ago

Strange, haven't seen this behaviour or any other reports. Did you build yourself or take the .bin from the GH release page? Also any other info on your setup? If it's memory you can try capturing the data and seeing what happens over time.

yazend commented 1 year ago

Proddy, I'm using the ems-esp with WiFi and LAN, I downloaded the firmware via the UI so I assume the .bin is the same as on GIT. I just reinstalled the 3.0.7.dev0 and set the syslog to all. As soon as the device does not respond again, I will send the logs.

See different on sysinfo "free RAM"

Sysninfo_V3.0.6dev17.JPG --> free RAM 105KB/61KB Sysninfo_V3 0 6dev17

Sysninfo_V3.0.7dev0.JPG --> free RAM 107KB/59KB Sysninfo_V3 0 7dev0

yazend commented 1 year ago

Now I have the ems-esp freez the situation, if you need further information let me know.

no response anymore... grafik

log.txt log.txt

giovanne123 commented 1 year ago

For information: I had also today 17.08.2023 (~13:45) a freeze, the ems-esp was not reqonding anymore. Have seen it few minutes ago. Had to cut power to get it running again.

I'm on 3.6.0, release installed on 15.08.2023 (19:40) .bin from GH, so was running for ~2days until freeze.

3.5.x: Never had such freeze before the last years with previous versions like 3.5.x 3.6.x: Started with 3.6.x/dev some days ago:

3.7.x Haven't installed 3.7.x/dev so far

Will observe if it will occur again...

image

proddy commented 1 year ago

Thanks for reporting. I'll compare both the old 3.6.0-dev17 with the latest 3.6.0 to try and see what is different

proddy commented 1 year ago

update: I can't see any noticeable changes in the code between 3.6.0_dev.17 and either 3.7.0-dev-0 or the main 3.6.0

are there any specific actions you are performing when it 'freezes'? Like is the Web UI open and logging everything to the System Log window? Or does it crash when there is no web UI or telnet present?

giovanne123 commented 1 year ago

from my side there was no activity in WebUI or telnet. I only sometimes have a look via HA to it (HA via mqtt from ems-esp). But nothing special was done when the freeze occured. Since yesterday it is running: image

Will observe how it behaves the next days and if there will be a freeze again...

vmonkey commented 1 year ago

Hi, I have the very same issue with 3.6.0.

proddy commented 1 year ago

damn. Is there anything particular to your setups? Like are using the API a lot (e.g. via iobroker), or exporting data from the web, or having the Web UI open looking at the web logs, or logging with syslog. I need to trace down why it's not working for you guys.

In the meantime can you try out these two builds:

EMS-ESP-3_6_0-dev_17-ESP32_4M.zip

EMS-ESP-3_6_0-dev_18-ESP32_4M.zip

vmonkey commented 1 year ago

@proddy actually, nothing really special. I am just using it in home assistant via mqtt.

I would be really happy to help but I will not have a physical access to the device for few days. If the issue is not found, I will get back to this on Thursday.

Thanks a lot!

giovanne123 commented 1 year ago

In the meantime can you try out these two builds:

EMS-ESP-3_6_0-dev_17-ESP32_4M.zip

EMS-ESP-3_6_0-dev_18-ESP32_4M.zip

Currently after last restart 3.6.0 is running without freeze since 2 days now... I will wait until tomorrow before flashing one of your linked builds to see if there won't be a freeze until tomorrow with 3.6.0... What is changed in the linked builds? (There was already dev17 before)?

Edit 21.08.2023: Today morning 8:30 after 3 days 3.6.0 crashed again.

Edit 23.08.2023 ~11:30: 3.6.0 freeze again after 2days Now 12:45 installed dev18...

Edit 25.08.2023 ~6:50: dev18 freeze after ~2 days, restarted device again with dev18...

Edit 29.08.2023 15:30: Installed 3.7.0-dev.1 because was pointed out below., will check.... (dev18 wasn't freezed since last comment)

Edit 29.08.2023 18:45: revert back to dev18, because have some other trouble with 3.7.0-dev.1, not related to our issue analysing here... (thermostat data and maybe other are not published/send to mqtt, e.g. damped outdoor temp was missing in HA -> no time to investigate currently so going back to dev18 for a running system)

proddy commented 1 year ago

There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.

yazend commented 1 year ago

There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.

I update second device cross the street dev-17 is working, dev-18 as well 3.7.0-dev-0 freez System over time. Two different device my wifi and lan, other one at my parents only wifi.

proddy commented 1 year ago

There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.

I update second device cross the street dev-17 is working, dev-18 as well 3.7.0-dev-0 freez System over time. Two different device my wifi and lan, other one at my parents only wifi.

Ok! Thanks for testing this, it's super helpful. So something sneeked in between dev-17 and dev-18 so I'll compare the source for those two again and try and figure out what is causing this.

kebabuschi commented 1 year ago

Same issue on my device, too. ESP32, wifi. Updated from 3.5.x latest official to 3.6.0 official. I have 8 digital thermometers attached via one wire. I only use the Web GUI, for OTA updates or customizations. Apart from that I use Home Assistant via MQTT.

JokerGermany commented 1 year ago

Same issue after a little bit more than 25 hours, was runnning fine with EMS-ESP-3_6_0-dev_12-ESP32.bin will try the new dev and if this isn't working will go pack to 3_6_0 dev. (It crashed with the dev version seldom too, but it recovered after some time.)

ESP32 with Lan but deactivated (to save memory) and using wifi instead and HA grafik

€dit: This time it recovered before beeing 1 hour not available and then crashed again... €dit2: Can't get it back to life at the moment... It only shows a white screen when trying to reach the webinterface. But it looks like some sort of connections is still working because firefox is trying to connect.

proddy commented 1 year ago

Sorry @JokerGermany to screw up your system. There is something fishy with the 3.6.0 main build but I'm not sure what it is. Could you try one of the dev17 or dev18 builds from the earlier posts?

JokerGermany commented 1 year ago

Sorry @JokerGermany to screw up your system. There is something fishy with the 3.6.0 main build but I'm not sure what it is. Could you try one of the dev17 or dev18 builds from the earlier posts?

will do when i get it back to live. at the moment the web interface isn't reachable at all. Will try to cut off power.

€dit: okay downgraded to EMS-ESP-3_6_0-dev_18-ESP32_4M.zip

€dit2: Looks like same "white screen interface" with dev 18, but don't know if 3.6.0 saved something which let dev 18 crash, too. Will now try EMS-ESP-3_6_0-dev_17-ESP32_4M.zip

€dit3: After power off/on dev 18 is working now, will test it before trying dev 17. MQTT looks better now, after the downgrade it was connected but in the queue were 90 and orange....

yazend commented 1 year ago

Proddy, I agree with you, this is a very strange behavior, I changed the mqtt setting, since then it has been running without problems, whether 3.0.6-dev-17 or dev-18 also the 3.0.7-def-0 all run without freeze, since I changed the settings.

mqtt setting with freez User: mqtt Password: mqtt now I changed it according to the homeassistant mqtt information

mqtt settings without freez user: homeassistant Password: hvxrdatj6t3r8oggg3689jhrcgs268kjv (random PW) as side efect Homeassistant Information disapered mqtt sensor name changed to none.

I did it as well at my parents device, both are running since I changed the mqtt, I don't know if this helps to find the different on your code, but someone can use it as an work around.

hlavki commented 1 year ago

I have the same issue using HomeAssistant via MQTT on Wifi and Buderus Logamax plus GB192i

JokerGermany commented 1 year ago

Feedback: dev18 is working for me since 1 Day and 18 hours. Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too. grafik They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.

vmonkey commented 1 year ago

@proddy Before leaving for vacation, I reverted to dev18. It has been stable since then (5 days), so I believe the issue has been introduced afterwards.

giovanne123 commented 1 year ago

@proddy , for me dev18 also freezed after ~2days (https://github.com/emsesp/EMS-ESP32/issues/1264#issuecomment-1685074139)

yazend commented 1 year ago

Proddy, just for your information... since I changed the mqtt settings, 3.7.0dev0 is running 7 days and 17 hours

Screenshot_20230825_070740_Samsung Internet

JokerGermany commented 1 year ago

Feedback: dev18 is working for me since 1 Day and 18 hours. Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too. grafik They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.

crashed 6 hours ago now, restarted it, to look if it was a fluke

Proddy, just for your information... since I changed the mqtt settings

What do you exactly changed?

yazend commented 1 year ago

I changed the mqtt username and password, see six post above

MichaelDvP commented 1 year ago

I think the freezes are memory related. If mqtt disconnects the queue is filled consuming memory until a minimum is reached,

  1. this https://github.com/emsesp/EMS-ESP32/blob/86e29515e789b26384eed4dae210b65d50f6adc4/lib/espMqttClient/src/Config.h#L32 is to small, i think in v3.5 we had ~40k as minimum free before deleting queued messages.
  2. We queue messages while mqtt is disconnected https://github.com/emsesp/EMS-ESP32/blob/86e29515e789b26384eed4dae210b65d50f6adc4/lib/espMqttClient/src/Config.h#L40 in v3.5. publish/queue was stopped on disconnect.

Maybe better to change both values

zibous commented 1 year ago

Same Problem here with EMS-ESP Version 3.7.0-dev.0

Bildschirmfoto 2023-08-28 um 16 08 55

JokerGermany commented 1 year ago

Feedback: dev18 is working for me since 1 Day and 18 hours. Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too. grafik They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.

crashed 6 hours ago now, restarted it, to look if it was a fluke

not responsive again since more than 24 hours, will try dev 17

I changed the mqtt username and password, see six post above

okay I have already other credentials

proddy commented 1 year ago

I think the freezes are memory related. If mqtt disconnects the queue is filled consuming memory until a minimum is reached,

  1. this https://github.com/emsesp/EMS-ESP32/blob/86e29515e789b26384eed4dae210b65d50f6adc4/lib/espMqttClient/src/Config.h#L32 is to small, i think in v3.5 we had ~40k as minimum free before deleting queued messages.
  2. We queue messages while mqtt is disconnected https://github.com/emsesp/EMS-ESP32/blob/86e29515e789b26384eed4dae210b65d50f6adc4/lib/espMqttClient/src/Config.h#L40 in v3.5. publish/queue was stopped on disconnect.

Maybe better to change both values

I think we must. Shall I make the change?

MichaelDvP commented 1 year ago

I've set min_free_mem to 40k in my dev. Maybe one of the guys reporting this issues can test if it works. @JokerGermany you seems to have this issue often, could you check? https://github.com/MichaelDvP/EMS-ESP32/releases/tag/latest

My esp32 is now 16 days online without any error, i can not reproduce this issue.

BTW: we had 60k heap check in v3.5, but espMqttClient checks maxAlloc and PSRAM, so it's a bit different, let's try with 40k.

yazend commented 1 year ago

MichaelDeV, I will upgrade the prerelease 3.7.0dev-1, Let you guys know if something happens.

vmonkey commented 1 year ago

I updated too. I will report back in few days. Nevertheless, dev18 was really working for me without a glitch for more than a week.

JokerGermany commented 1 year ago

I've set min_free_mem to 40k in my dev. Maybe one of the guys reporting this issues can test if it works. @JokerGermany you seems to have this issue often, could you check? https://github.com/MichaelDvP/EMS-ESP32/releases/tag/latest

updated to v3.7.0-dev.1

proddy commented 1 year ago

I updated too. I will report back in few days. Nevertheless, dev18 was really working for me without a glitch for more than a week.

I went through line by line comparing dev18 with the main 3.6.0 release and there's nothing in the code that would suggest a memory issue. I think it was just by chance that your MQTT broker was down for a short while (perhaps < 1 min) and the EMS-ESP queue quickly filled up.

MichaelDvP commented 1 year ago

I could reproduce the freeze/crash by stopping the mqtt broker. Heap goes down and somewhere below 40k maxAlloc the esp drops the wifi connection is is not responsable. With setting 40k minMem there seems to be a bug in the espMqtt library, reaching the limit restarts the esp. Also each message allocs ram, but is not set to the outbox, so the queue size display does not count up.

Continue testing, but it always takes more than 2 hours to reach the limit.

proddy commented 1 year ago

I could reproduce the freeze/crash by stopping the mqtt broker. Heap goes down and somewhere below 40k maxAlloc the esp drops the wifi connection is is not responsable. With setting 40k minMem there seems to be a bug in the espMqtt library, reaching the limit restarts the esp. Also each message allocs ram, but is not set to the outbox, so the queue size display does not count up.

Continue testing, but it always takes more than 2 hours to reach the limit.

That's great that you found the root cause Michael. For how long did you stop the broker before the maxAlloc went too low? We should let Bert (espmqttclient) know.

JokerGermany commented 1 year ago

@MichaelDvP crashed 7 hours ago, will go back testing dev17

proddy commented 1 year ago

@MichaelDvP crashed 7 hours ago, will go back testing dev17

I'm 100% sure dev17 will be the same. it's the same code.

MichaelDvP commented 1 year ago

@JokerGermany

@MichaelDvP crashed 7 hours ago, will go back testing dev17

Just a restart or freeze? As mentioned i see the restart too afer reaching the mem limit. I'll update to stop queuing while disconnected. Also i'll change to not resetting the mqtt-errorcounts on reconnect, so you'll see a lot of errors after a mqtt-disconnect. Please try v3.7.0-dev1a from my repo.

@proddy

For how long did you stop the broker before the maxAlloc went too low

As mentioned ~2h with all mqtt publishes set to 5 sec. Depends on free heap, my boiler does not have so much entites.

We should let Bert (espmqttclient) know.

Yes, if we could specify the error better. Maybe with serial logging in the lib enabled. I'm not sure if this in the lib or in ems-esp when the publish is blocked by the lib. The queue count is an enhancement from me,

JokerGermany commented 1 year ago

@JokerGermany

@MichaelDvP crashed 7 hours ago, will go back testing dev17

Just a restart or freeze? As mentioned i see the restart too afer reaching the mem limit. I'll update to stop queuing while disconnected. Also i'll change to not resetting the mqtt-errorcounts on reconnect, so you'll see a lot of errors after a mqtt-disconnect. Please try v3.7.0-dev1a from my repo.

a freeze, restart would be okay for me. grafik

Will make update to v3.7.0-dev1a when i go in the basement, at the moment it's still freezed because i am too lazy^^

@MichaelDvP crashed 7 hours ago, will go back testing dev17

I'm 100% sure dev17 will be the same. it's the same code.

dev12 runned fine :thinking: It never completly freezed

proddy commented 1 year ago

We should let Bert (espmqttclient) know.

Yes, if we could specify the error better. Maybe with serial logging in the lib enabled. I'm not sure if this in the lib or in ems-esp when the publish is blocked by the lib. The queue count is an enhancement from me,

I think we could easily take that small sample app (https://github.com/proddy/mqtt-test) and adjust it to spam lots of publish messages to the MQTT broker very quickly, with serial showing heap/maxalloc, and then we switch off the broker and watch it burn down. And show it to Bert.

MichaelDvP commented 1 year ago

dev12 runned fine 🤔

Dev12 was before changing the mqtt-lib. We also had freezes with the old lib and set a memory limit to 60k for heap, I've now changed to the same for the new lib (check heap instead of maxAlloc (dev1c). Because of the restarts when reaching the limit we have to be carefull not to get in a bootloop if the HA discovery sends out lot of messages on start.

MichaelDvP commented 1 year ago

I think we could easily take that small sample app (https://github.com/proddy/mqtt-test) and adjust it to spam lots of publish messages to the MQTT broker very quickly, with serial showing heap/maxalloc, and then we switch off the broker and watch it burn down. And show it to Bert.

Tested that with:

    if (currentMillis - lastMillis > 10)
    {
        lastMillis = currentMillis;
        static uint32_t messageNo = 0;
        char payload[120];
        sprintf(payload, "MessageNo: %u qwertzuiopüasdfghjklöyxcvbnmqwertzuioasdfghj", ++messageNo);
        uint16_t id = mqttClient->publish("Testmessage", 1, 0, payload);
        Serial.printf("messageNo: %u, publish: %s, heap: %u, maxAlloc: %u\n", messageNo, id > 0 ? "ok" : "fail", ESP.getFreeHeap(), ESP.getMaxAllocHeap());
    }

Result with set limit to 60k heap, but it stops at 53k, strange.

messageNo: 1384, publish: ok, heap: 54860, maxAlloc: 53236
messageNo: 1385, publish: ok, heap: 54796, maxAlloc: 53236
messageNo: 1386, publish: fail, heap: 54796, maxAlloc: 53236
messageNo: 1387, publish: fail, heap: 54796, maxAlloc: 53236

New test with debug set for the library gives stranger things:

messageNo: 1248, publish: ok, heap: 61576, maxAlloc: 53236
[ 15277][I][Packet.cpp:319] _allocate(): Alloc (l:79)
messageNo: 1249, publish: ok, heap: 61416, maxAlloc: 53236
[ 15289][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1250, publish: ok, heap: 61352, maxAlloc: 53236
[ 15302][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1251, publish: ok, heap: 61288, maxAlloc: 53236
[ 15315][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory

After linit is reached the buffer is not allocated, but publish gives valid message_id back and heap goes down? After a while we get socket error and the message fail (id = 0) back and some other strange thing. After that again alloc fails but heap still shrinks:

[ 15562][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1271, publish: ok, heap: 59944, maxAlloc: 53236
[ 15600][E][WiFiClient.cpp:268] connect(): socket error on fd 48, errno: 104, "Connection reset by peer"
messageNo: 1272, publish: fail, heap: 59904, maxAlloc: 53236
messageNo: 1273, publish: fail, heap: 59904, maxAlloc: 53236
[ 15615][I][MqttClient.cpp:717] _clearQueue(): clearing queue (clear session: 0)
Disconnected from MQTT: 7.
[ 15629][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1274, publish: ok, heap: 61312, maxAlloc: 53236
[ 15642][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1275, publish: ok, heap: 61248, maxAlloc: 53236
[ 15655][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1276, publish: ok, heap: 61184, maxAlloc: 53236```

At last there is no debug message from espMqttClient anymore and message-ID is returned as 0-fail:

messageNo: 1375, publish: ok, heap: 54888, maxAlloc: 53236
[ 16955][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1376, publish: ok, heap: 54824, maxAlloc: 53236
messageNo: 1377, publish: ok, heap: 54760, maxAlloc: 53236
messageNo: 1378, publish: ok, heap: 54696, maxAlloc: 53236
messageNo: 1379, publish: fail, heap: 54696, maxAlloc: 53236
messageNo: 1380, publish: fail, heap: 54696, maxAlloc: 53236

I think i'll open an issue for the espMqttClient.

JokerGermany commented 1 year ago

@JokerGermany

@MichaelDvP crashed 7 hours ago, will go back testing dev17

Just a restart or freeze? As mentioned i see the restart too afer reaching the mem limit. I'll update to stop queuing while disconnected. Also i'll change to not resetting the mqtt-errorcounts on reconnect, so you'll see a lot of errors after a mqtt-disconnect. Please try v3.7.0-dev1a from my repo.

Updated to 1c now

yazend commented 1 year ago

Update... 3.7.0-dev-1d It seems like this version is much more stable, no more queue problems since the update

MichaelDvP commented 1 year ago

Bert has found an error in queueing messages, this allows to use EMC_ALLOW_NOT_CONNECTED_PUBLISH again (was disabled in dev1c). But there is still an issue when reaching the limit defined in espMqttClient, in dev1d i use an own limit in ems-esp (as for asyncMqttClient) and set the client limit much lower, so this is never reeached. This was working for me some hours with 1000 messages queued and ~500 skipped due to low memory without a freeze or restart.

MichaelDvP commented 1 year ago

Lates update is now in the official dev2, please check this version (earlier downloads of dev2 from my repo is maybe a intermediate version).

JokerGermany commented 1 year ago

Shall i update to this version or first look if 1c will crash? Is running since 17 hours now

MichaelDvP commented 1 year ago

Shall i update to this version or first look if 1c will crash? Is running since 17 hours now

Please update to this dev2 version. 1c used the buggy library with a workaround, we will not go back to this stage.