Closed yazend closed 1 year ago
Strange, haven't seen this behaviour or any other reports. Did you build yourself or take the .bin from the GH release page? Also any other info on your setup? If it's memory you can try capturing the data and seeing what happens over time.
Proddy, I'm using the ems-esp with WiFi and LAN, I downloaded the firmware via the UI so I assume the .bin is the same as on GIT. I just reinstalled the 3.0.7.dev0 and set the syslog to all. As soon as the device does not respond again, I will send the logs.
See different on sysinfo "free RAM"
Sysninfo_V3.0.6dev17.JPG --> free RAM 105KB/61KB
Sysninfo_V3.0.7dev0.JPG --> free RAM 107KB/59KB
Now I have the ems-esp freez the situation, if you need further information let me know.
no response anymore...
log.txt log.txt
For information: I had also today 17.08.2023 (~13:45) a freeze, the ems-esp was not reqonding anymore. Have seen it few minutes ago. Had to cut power to get it running again.
I'm on 3.6.0, release installed on 15.08.2023 (19:40) .bin from GH, so was running for ~2days until freeze.
3.5.x: Never had such freeze before the last years with previous versions like 3.5.x 3.6.x: Started with 3.6.x/dev some days ago:
3.7.x Haven't installed 3.7.x/dev so far
Will observe if it will occur again...
Thanks for reporting. I'll compare both the old 3.6.0-dev17 with the latest 3.6.0 to try and see what is different
update: I can't see any noticeable changes in the code between 3.6.0_dev.17
and either 3.7.0-dev-0
or the main 3.6.0
are there any specific actions you are performing when it 'freezes'? Like is the Web UI open and logging everything to the System Log window? Or does it crash when there is no web UI or telnet present?
from my side there was no activity in WebUI or telnet. I only sometimes have a look via HA to it (HA via mqtt from ems-esp). But nothing special was done when the freeze occured. Since yesterday it is running:
Will observe how it behaves the next days and if there will be a freeze again...
Hi, I have the very same issue with 3.6.0.
damn. Is there anything particular to your setups? Like are using the API a lot (e.g. via iobroker), or exporting data from the web, or having the Web UI open looking at the web logs, or logging with syslog. I need to trace down why it's not working for you guys.
In the meantime can you try out these two builds:
@proddy actually, nothing really special. I am just using it in home assistant via mqtt.
I would be really happy to help but I will not have a physical access to the device for few days. If the issue is not found, I will get back to this on Thursday.
Thanks a lot!
In the meantime can you try out these two builds:
Currently after last restart 3.6.0 is running without freeze since 2 days now... I will wait until tomorrow before flashing one of your linked builds to see if there won't be a freeze until tomorrow with 3.6.0... What is changed in the linked builds? (There was already dev17 before)?
Edit 21.08.2023: Today morning 8:30 after 3 days 3.6.0 crashed again.
Edit 23.08.2023 ~11:30: 3.6.0 freeze again after 2days Now 12:45 installed dev18...
Edit 25.08.2023 ~6:50: dev18 freeze after ~2 days, restarted device again with dev18...
Edit 29.08.2023 15:30: Installed 3.7.0-dev.1 because was pointed out below., will check.... (dev18 wasn't freezed since last comment)
Edit 29.08.2023 18:45: revert back to dev18, because have some other trouble with 3.7.0-dev.1, not related to our issue analysing here... (thermostat data and maybe other are not published/send to mqtt, e.g. damped outdoor temp was missing in HA -> no time to investigate currently so going back to dev18 for a running system)
There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.
There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.
I update second device cross the street dev-17 is working, dev-18 as well 3.7.0-dev-0 freez System over time. Two different device my wifi and lan, other one at my parents only wifi.
There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.
I update second device cross the street dev-17 is working, dev-18 as well 3.7.0-dev-0 freez System over time. Two different device my wifi and lan, other one at my parents only wifi.
Ok! Thanks for testing this, it's super helpful. So something sneeked in between dev-17 and dev-18 so I'll compare the source for those two again and try and figure out what is causing this.
Same issue on my device, too. ESP32, wifi. Updated from 3.5.x latest official to 3.6.0 official. I have 8 digital thermometers attached via one wire. I only use the Web GUI, for OTA updates or customizations. Apart from that I use Home Assistant via MQTT.
Same issue after a little bit more than 25 hours, was runnning fine with EMS-ESP-3_6_0-dev_12-ESP32.bin will try the new dev and if this isn't working will go pack to 3_6_0 dev. (It crashed with the dev version seldom too, but it recovered after some time.)
ESP32 with Lan but deactivated (to save memory) and using wifi instead and HA
€dit: This time it recovered before beeing 1 hour not available and then crashed again... €dit2: Can't get it back to life at the moment... It only shows a white screen when trying to reach the webinterface. But it looks like some sort of connections is still working because firefox is trying to connect.
Sorry @JokerGermany to screw up your system. There is something fishy with the 3.6.0 main build but I'm not sure what it is. Could you try one of the dev17 or dev18 builds from the earlier posts?
Sorry @JokerGermany to screw up your system. There is something fishy with the 3.6.0 main build but I'm not sure what it is. Could you try one of the dev17 or dev18 builds from the earlier posts?
will do when i get it back to live. at the moment the web interface isn't reachable at all. Will try to cut off power.
€dit: okay downgraded to EMS-ESP-3_6_0-dev_18-ESP32_4M.zip
€dit2:
Looks like same "white screen interface" with dev 18, but don't know if 3.6.0 saved something which let dev 18 crash, too.
Will now try EMS-ESP-3_6_0-dev_17-ESP32_4M.zip
€dit3: After power off/on dev 18 is working now, will test it before trying dev 17. MQTT looks better now, after the downgrade it was connected but in the queue were 90 and orange....
Proddy, I agree with you, this is a very strange behavior, I changed the mqtt setting, since then it has been running without problems, whether 3.0.6-dev-17 or dev-18 also the 3.0.7-def-0 all run without freeze, since I changed the settings.
mqtt setting with freez User: mqtt Password: mqtt now I changed it according to the homeassistant mqtt information
mqtt settings without freez user: homeassistant Password: hvxrdatj6t3r8oggg3689jhrcgs268kjv (random PW) as side efect Homeassistant Information disapered mqtt sensor name changed to none.
I did it as well at my parents device, both are running since I changed the mqtt, I don't know if this helps to find the different on your code, but someone can use it as an work around.
I have the same issue using HomeAssistant via MQTT on Wifi and Buderus Logamax plus GB192i
Feedback: dev18 is working for me since 1 Day and 18 hours. Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too. They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.
@proddy Before leaving for vacation, I reverted to dev18. It has been stable since then (5 days), so I believe the issue has been introduced afterwards.
@proddy , for me dev18 also freezed after ~2days (https://github.com/emsesp/EMS-ESP32/issues/1264#issuecomment-1685074139)
Proddy, just for your information... since I changed the mqtt settings, 3.7.0dev0 is running 7 days and 17 hours
Feedback: dev18 is working for me since 1 Day and 18 hours. Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too. They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.
crashed 6 hours ago now, restarted it, to look if it was a fluke
Proddy, just for your information... since I changed the mqtt settings
What do you exactly changed?
I changed the mqtt username and password, see six post above
I think the freezes are memory related. If mqtt disconnects the queue is filled consuming memory until a minimum is reached,
Maybe better to change both values
Same Problem here with EMS-ESP Version 3.7.0-dev.0
Feedback: dev18 is working for me since 1 Day and 18 hours. Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too. They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.
crashed 6 hours ago now, restarted it, to look if it was a fluke
not responsive again since more than 24 hours, will try dev 17
I changed the mqtt username and password, see six post above
okay I have already other credentials
I think the freezes are memory related. If mqtt disconnects the queue is filled consuming memory until a minimum is reached,
- this https://github.com/emsesp/EMS-ESP32/blob/86e29515e789b26384eed4dae210b65d50f6adc4/lib/espMqttClient/src/Config.h#L32 is to small, i think in v3.5 we had ~40k as minimum free before deleting queued messages.
- We queue messages while mqtt is disconnected https://github.com/emsesp/EMS-ESP32/blob/86e29515e789b26384eed4dae210b65d50f6adc4/lib/espMqttClient/src/Config.h#L40 in v3.5. publish/queue was stopped on disconnect.
Maybe better to change both values
I think we must. Shall I make the change?
I've set min_free_mem to 40k in my dev. Maybe one of the guys reporting this issues can test if it works. @JokerGermany you seems to have this issue often, could you check? https://github.com/MichaelDvP/EMS-ESP32/releases/tag/latest
My esp32 is now 16 days online without any error, i can not reproduce this issue.
BTW: we had 60k heap check in v3.5, but espMqttClient checks maxAlloc and PSRAM, so it's a bit different, let's try with 40k.
MichaelDeV, I will upgrade the prerelease 3.7.0dev-1, Let you guys know if something happens.
I updated too. I will report back in few days. Nevertheless, dev18 was really working for me without a glitch for more than a week.
I've set min_free_mem to 40k in my dev. Maybe one of the guys reporting this issues can test if it works. @JokerGermany you seems to have this issue often, could you check? https://github.com/MichaelDvP/EMS-ESP32/releases/tag/latest
updated to v3.7.0-dev.1
I updated too. I will report back in few days. Nevertheless, dev18 was really working for me without a glitch for more than a week.
I went through line by line comparing dev18 with the main 3.6.0 release and there's nothing in the code that would suggest a memory issue. I think it was just by chance that your MQTT broker was down for a short while (perhaps < 1 min) and the EMS-ESP queue quickly filled up.
I could reproduce the freeze/crash by stopping the mqtt broker. Heap goes down and somewhere below 40k maxAlloc the esp drops the wifi connection is is not responsable. With setting 40k minMem there seems to be a bug in the espMqtt library, reaching the limit restarts the esp. Also each message allocs ram, but is not set to the outbox, so the queue size display does not count up.
Continue testing, but it always takes more than 2 hours to reach the limit.
I could reproduce the freeze/crash by stopping the mqtt broker. Heap goes down and somewhere below 40k maxAlloc the esp drops the wifi connection is is not responsable. With setting 40k minMem there seems to be a bug in the espMqtt library, reaching the limit restarts the esp. Also each message allocs ram, but is not set to the outbox, so the queue size display does not count up.
Continue testing, but it always takes more than 2 hours to reach the limit.
That's great that you found the root cause Michael. For how long did you stop the broker before the maxAlloc went too low? We should let Bert (espmqttclient) know.
@MichaelDvP crashed 7 hours ago, will go back testing dev17
@MichaelDvP crashed 7 hours ago, will go back testing dev17
I'm 100% sure dev17 will be the same. it's the same code.
@JokerGermany
@MichaelDvP crashed 7 hours ago, will go back testing dev17
Just a restart or freeze? As mentioned i see the restart too afer reaching the mem limit. I'll update to stop queuing while disconnected. Also i'll change to not resetting the mqtt-errorcounts on reconnect, so you'll see a lot of errors after a mqtt-disconnect. Please try v3.7.0-dev1a
from my repo.
@proddy
For how long did you stop the broker before the maxAlloc went too low
As mentioned ~2h with all mqtt publishes set to 5 sec. Depends on free heap, my boiler does not have so much entites.
We should let Bert (espmqttclient) know.
Yes, if we could specify the error better. Maybe with serial logging in the lib enabled. I'm not sure if this in the lib or in ems-esp when the publish is blocked by the lib. The queue count is an enhancement from me,
@JokerGermany
@MichaelDvP crashed 7 hours ago, will go back testing dev17
Just a restart or freeze? As mentioned i see the restart too afer reaching the mem limit. I'll update to stop queuing while disconnected. Also i'll change to not resetting the mqtt-errorcounts on reconnect, so you'll see a lot of errors after a mqtt-disconnect. Please try
v3.7.0-dev1a
from my repo.
a freeze, restart would be okay for me.
Will make update to v3.7.0-dev1a when i go in the basement, at the moment it's still freezed because i am too lazy^^
@MichaelDvP crashed 7 hours ago, will go back testing dev17
I'm 100% sure dev17 will be the same. it's the same code.
dev12 runned fine :thinking: It never completly freezed
We should let Bert (espmqttclient) know.
Yes, if we could specify the error better. Maybe with serial logging in the lib enabled. I'm not sure if this in the lib or in ems-esp when the publish is blocked by the lib. The queue count is an enhancement from me,
I think we could easily take that small sample app (https://github.com/proddy/mqtt-test) and adjust it to spam lots of publish messages to the MQTT broker very quickly, with serial showing heap/maxalloc, and then we switch off the broker and watch it burn down. And show it to Bert.
dev12 runned fine 🤔
Dev12 was before changing the mqtt-lib. We also had freezes with the old lib and set a memory limit to 60k for heap, I've now changed to the same for the new lib (check heap instead of maxAlloc (dev1c). Because of the restarts when reaching the limit we have to be carefull not to get in a bootloop if the HA discovery sends out lot of messages on start.
I think we could easily take that small sample app (https://github.com/proddy/mqtt-test) and adjust it to spam lots of publish messages to the MQTT broker very quickly, with serial showing heap/maxalloc, and then we switch off the broker and watch it burn down. And show it to Bert.
Tested that with:
if (currentMillis - lastMillis > 10)
{
lastMillis = currentMillis;
static uint32_t messageNo = 0;
char payload[120];
sprintf(payload, "MessageNo: %u qwertzuiopüasdfghjklöyxcvbnmqwertzuioasdfghj", ++messageNo);
uint16_t id = mqttClient->publish("Testmessage", 1, 0, payload);
Serial.printf("messageNo: %u, publish: %s, heap: %u, maxAlloc: %u\n", messageNo, id > 0 ? "ok" : "fail", ESP.getFreeHeap(), ESP.getMaxAllocHeap());
}
Result with set limit to 60k heap, but it stops at 53k, strange.
messageNo: 1384, publish: ok, heap: 54860, maxAlloc: 53236
messageNo: 1385, publish: ok, heap: 54796, maxAlloc: 53236
messageNo: 1386, publish: fail, heap: 54796, maxAlloc: 53236
messageNo: 1387, publish: fail, heap: 54796, maxAlloc: 53236
New test with debug set for the library gives stranger things:
messageNo: 1248, publish: ok, heap: 61576, maxAlloc: 53236
[ 15277][I][Packet.cpp:319] _allocate(): Alloc (l:79)
messageNo: 1249, publish: ok, heap: 61416, maxAlloc: 53236
[ 15289][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1250, publish: ok, heap: 61352, maxAlloc: 53236
[ 15302][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1251, publish: ok, heap: 61288, maxAlloc: 53236
[ 15315][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
After linit is reached the buffer is not allocated, but publish gives valid message_id back and heap goes down? After a while we get socket error and the message fail (id = 0) back and some other strange thing. After that again alloc fails but heap still shrinks:
[ 15562][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1271, publish: ok, heap: 59944, maxAlloc: 53236
[ 15600][E][WiFiClient.cpp:268] connect(): socket error on fd 48, errno: 104, "Connection reset by peer"
messageNo: 1272, publish: fail, heap: 59904, maxAlloc: 53236
messageNo: 1273, publish: fail, heap: 59904, maxAlloc: 53236
[ 15615][I][MqttClient.cpp:717] _clearQueue(): clearing queue (clear session: 0)
Disconnected from MQTT: 7.
[ 15629][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1274, publish: ok, heap: 61312, maxAlloc: 53236
[ 15642][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1275, publish: ok, heap: 61248, maxAlloc: 53236
[ 15655][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1276, publish: ok, heap: 61184, maxAlloc: 53236```
At last there is no debug message from espMqttClient anymore and message-ID is returned as 0-fail:
messageNo: 1375, publish: ok, heap: 54888, maxAlloc: 53236
[ 16955][W][Packet.cpp:305] _allocate(): Packet buffer not allocated: low memory
messageNo: 1376, publish: ok, heap: 54824, maxAlloc: 53236
messageNo: 1377, publish: ok, heap: 54760, maxAlloc: 53236
messageNo: 1378, publish: ok, heap: 54696, maxAlloc: 53236
messageNo: 1379, publish: fail, heap: 54696, maxAlloc: 53236
messageNo: 1380, publish: fail, heap: 54696, maxAlloc: 53236
I think i'll open an issue for the espMqttClient.
@JokerGermany
@MichaelDvP crashed 7 hours ago, will go back testing dev17
Just a restart or freeze? As mentioned i see the restart too afer reaching the mem limit. I'll update to stop queuing while disconnected. Also i'll change to not resetting the mqtt-errorcounts on reconnect, so you'll see a lot of errors after a mqtt-disconnect. Please try
v3.7.0-dev1a
from my repo.
Updated to 1c now
Update... 3.7.0-dev-1d It seems like this version is much more stable, no more queue problems since the update
Bert has found an error in queueing messages, this allows to use EMC_ALLOW_NOT_CONNECTED_PUBLISH again (was disabled in dev1c). But there is still an issue when reaching the limit defined in espMqttClient, in dev1d i use an own limit in ems-esp (as for asyncMqttClient) and set the client limit much lower, so this is never reeached. This was working for me some hours with 1000 messages queued and ~500 skipped due to low memory without a freeze or restart.
Lates update is now in the official dev2, please check this version (earlier downloads of dev2 from my repo is maybe a intermediate version).
Shall i update to this version or first look if 1c will crash? Is running since 17 hours now
Shall i update to this version or first look if 1c will crash? Is running since 17 hours now
Please update to this dev2 version. 1c used the buggy library with a workaround, we will not go back to this stage.
I have updated to dev.3.7.0 as a test, after a certain time n 1-3 hours the ems-esp is no longer available, no response, no putty, no ping the whole thing can be reproduced, if I switch back to dev 3.6.17, this behavior is gone . Is this problem known
BR Alex