Closed proddy closed 3 years ago
In 1.9.5 its not there. here it started somewhere around V2b9. I have about 2 to 5% incomplete telegrams. As it does not happen frequently, it maybe a bit hard to catch with a scope.
@bbqkees I only tested from b9 onwards. I'll try b8
Tested the latest travis built 8266 without any error for 4 h.
Full version: SDK:2.2.2-dev(38a443e)/Core:2.7.3-3-g2843a5ac=20703003/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-30-g92add50/BearSSL:5c771be
is exactly what i get if compile here.
Tx mode: 4
Bus protocol: Buderus
#telegrams received: 10167
#read requests sent: 2007
#write requests sent: 0
#incomplete telegrams: 0 (0%)
#tx fails (after 3 retries): 0
before i was running v2_cmd in tx_mode 1 compiled here over 12h without errors.
ok, thanks for testing. Same error in b8. In all earlier versions the error was suppressed which is why we never saw it before. So I guess it's always been like that. Same on the ESP32 where the first byte of each telegram would be a 0xFE. Need to get the logic analyzer out and leave it running for 1-2hrs to capture the log
with ESP32 b11 and online for around one week:
With B13 running for 19 hours i have:
I also see a massive increase in TX telegrams, before it was only minimal.
I have tested v2b12 and v2b13 with esp32 for longer times with logging.
To check if we lose telegrams without rx-queue, i added a error conter in rxservice::add
before rx_telegram
is set.
if (rx_telegram) {
increment_telegram_error_count();
}
It was 4 errors in 40000 telegrams, not much, but i think its better to use a small queue again. I added a 3x queue for the next tests.
With the esp32 i see some crc-errors, around 1 per 10000 telegrams, but not as @proddy shortend telegrams mainly as reply, i see other errors mainly in broadcast telegrams:
FE
in front of normal telegramFE
between <src>
and <dst>
i don't see this with esp8266 before.
I get a mqtt disconnect every 12032 seconds. The broker logs: Client [ems-esp] connection closed: Error: read ECONNRESET
I havn't recognized this before, I'll test now with esp8266 to see if it's also there.
This is the log: esp32.log
Before I removed the Rx queue and ran some tests to see the max size of the queue. It never got above 1 which was I removed it. But this was only on the ESP8266. I guess the ESP32 can process faster. I'll add back the queue, no problem.
I'm seeing the same behavior on the ESP32. Garbage telegrams after startup and FE is front of normal telegrams. I don't remember seeing this before and not sure where it is coming from.
That MQTT disconnect isn't good. I'll run some tests here and check the logs to see if I see the same.
Seems garbage-telegrams on startup and mqtt-disconnets are ESP32 specific. Im running now esp8266 for 3h40min without disconnect and no rx-errors until now.
@MichaelDvP
Seems garbage-telegrams on startup and mqtt-disconnets are ESP32 specific.
try going back to use the stable version of the core. In platformio.ini
use platform = espressif32
as opposed to platform = https://github.com/platformio/platform-espressif32.git
.
Tried, but same result, garbage on startup and mqtt reconnect after 3h20 :
Aug 28 12:03:07 ems - 000+00:00:00.000 I 0: [syslog] Log level set to INFO
Aug 28 12:03:07 ems - 000+00:00:00.000 I 1: [emsesp] EMS Device library loaded with 70 records
Aug 28 12:03:07 ems - 000+00:00:00.450 E 2: [telegram] Rx: 10 FC 80 FE FE 80 FE (CRC FE != A7)
Aug 28 12:03:07 ems - 000+00:00:00.503 E 3: [telegram] Rx: 88 11 35 07 00 FC (CRC FC != 1A)
Aug 28 12:03:07 ems - 000+00:00:00.513 I 4: [mqtt] MQTT connected
Aug 28 12:03:07 ems - 000+00:00:00.563 E 5: [telegram] Rx: F8 08 10 11 35 01 02 18 2B 00 01 4B F2 (CRC F2 != 59)
Aug 28 12:03:07 ems - 000+00:00:00.639 E 6: [telegram] Rx: FF 10 08 1A 00 00 00 00 00 91 (CRC 91 != AE)
Aug 28 12:30:57 ems - 000+00:27:50.907 I 7: [boiler] Setting boiler warm water circulation on
Aug 28 12:34:07 ems - 000+00:31:00.902 I 8: [boiler] Setting boiler warm water circulation on
Aug 28 14:38:07 ems - 000+02:35:00.890 E 9: [telegram] Rx: 6F 18 08 00 34 00 30 02 8E 7D 00 21 00 00 03 00 01 19 45 00 18 B8 65 (CRC 65 != 96)
Aug 28 15:09:36 ems - 000+03:06:29.552 I 10: [boiler] Setting boiler warm water circulation on
Aug 28 15:12:41 ems - 000+03:09:34.589 I 11: [boiler] Setting boiler warm water circulation on
Aug 28 15:23:39 ems - 000+03:20:32.465 I 12: [mqtt] MQTT disconnected: TCP
Aug 28 15:23:44 ems - 000+03:20:37.482 I 13: [mqtt] MQTT connected
With the 8266 before i don't get reconnects and had only one rx-error after 20h.
Aug 26 11:02:09 ems - 000+00:00:00.000 I 0: [syslog] Log level set to INFO
Aug 26 11:02:09 ems - 000+00:00:00.000 I 1: [emsesp] EMS Device library loaded with 70 records
Aug 26 11:02:09 ems - 000+00:00:16.988 I 2: [mqtt] MQTT connected
Aug 26 11:03:42 ems - 000+00:01:51.059 I 3: [boiler] Setting boiler warm water circulation on
Aug 26 11:06:52 ems - 000+00:05:00.979 I 4: [boiler] Setting boiler warm water circulation on
Aug 26 11:49:41 ems - 000+00:47:49.925 I 5: [boiler] Setting boiler warm water circulation on
Aug 26 11:52:56 ems - 000+00:51:04.939 I 6: [boiler] Setting boiler warm water circulation on
Aug 26 14:33:30 ems - 000+03:31:39.082 I 7: [telnet] New connection from [192.168.0.26]:62778 accepted
Aug 26 14:33:30 ems - 000+03:31:39.082 I 8: [shell] Allocated console pty0 for connection from [192.168.0.26]:62778
Aug 26 14:33:30 ems - 000+03:31:39.082 I 9: [shell] User session opened on console pty0
Aug 26 14:33:34 ems - 000+03:31:42.556 I 10: [shell] User session closed on console pty0
Aug 26 14:33:34 ems - 000+03:31:42.563 I 11: [telnet] Connection from [192.168.0.26]:62778 closed
Aug 26 14:33:34 ems - 000+03:31:42.563 I 12: [shell] Shutdown console pty0 for connection from [192.168.0.26]:62778
Aug 26 14:49:39 ems - 000+03:47:47.436 I 13: [boiler] Setting boiler warm water circulation on
Aug 26 14:52:54 ems - 000+03:51:02.411 I 14: [boiler] Setting boiler warm water circulation on
Aug 26 15:10:17 ems - 000+04:08:25.985 I 15: [telnet] New connection from [192.168.0.26]:62974 accepted
Aug 26 15:10:17 ems - 000+04:08:25.985 I 16: [shell] Allocated console pty0 for connection from [192.168.0.26]:62974
Aug 26 15:10:17 ems - 000+04:08:25.985 I 17: [shell] User session opened on console pty0
Aug 26 15:19:34 ems - 000+04:17:42.499 I 18: [shell] User session closed on console pty0
Aug 26 15:19:34 ems - 000+04:17:42.505 I 19: [telnet] Connection from [192.168.0.26]:62974 closed
Aug 26 15:19:34 ems - 000+04:17:42.505 I 20: [shell] Shutdown console pty0 for connection from [192.168.0.26]:62974
Aug 26 15:41:50 ems - 000+04:39:58.788 I 21: [boiler] Setting boiler warm water circulation on
Aug 26 15:44:55 ems - 000+04:43:03.774 I 22: [boiler] Setting boiler warm water circulation on
Aug 27 06:58:44 ems - 000+19:56:50.455 E 23: [telegram] Rx: 21 00 18 00 00 18 1B 00 00 00 00 00 00 00 0B 00 00 FF (CRC FF != D5)
Regarding the mqtt-disconnects i've found this issue of AsyncMqttClient
and starting now with a test with the lib https://github.com/dx168b/async-mqtt-client.git
No change, also with ths lib, Mqtt reconnects every 3h20m.
Aug 28 22:09:48 ems - 000+00:00:00.000 I 0: [syslog] Log level set to INFO
Aug 28 22:09:48 ems - 000+00:00:00.000 I 1: [emsesp] EMS Device library loaded with 70 records
Aug 28 22:09:48 ems - 000+00:00:00.000 E 2: [telegram] Rx: 7F 2D 1E 85 AF 7F 14 90 (CRC 90 != C2)
Aug 28 22:09:48 ems - 000+00:00:00.491 E 3: [telegram] Rx: C0 80 F0 2E (CRC 2E != C2)
Aug 28 22:09:48 ems - 000+00:00:00.518 I 4: [mqtt] MQTT connected
Aug 28 22:11:47 ems - 000+00:02:00.140 E 5: [telegram] Rx: 19 00 AF 00 00 E4 00 C9 (CRC C9 != CD)
Aug 29 01:30:20 ems - 000+03:20:32.518 I 6: [mqtt] MQTT disconnected: TCP
Aug 29 01:30:25 ems - 000+03:20:37.534 I 7: [mqtt] MQTT connected
Aug 29 02:41:58 ems - 000+04:32:11.158 E 8: [telegram] Rx: 10 FC 00 A2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 9B (CRC 9B != 64)
Aug 29 04:50:52 ems - 000+06:41:04.896 I 9: [mqtt] MQTT disconnected: TCP
Aug 29 04:50:57 ems - 000+06:41:09.911 I 10: [mqtt] MQTT connected
Aug 29 06:13:58 ems - 000+08:04:11.160 E 11: [telegram] Rx: FC 08 00 14 00 3E 5B 1A 3F (CRC 3F != 35)
Aug 29 06:59:49 ems - 000+08:50:02.161 E 12: [telegram] Rx: FC FC FC 89 (CRC 89 != C6)
I looked at dx168b's fork a while back. Another one worth trying is https://github.com/bertmelis/async-mqtt-client although I neither really address the disconnects. And another is https://github.com/kleini/async-mqtt-client/commit/f1b42054815ad82fed4519bce7febb7f1601560f.
I have been also actively following https://github.com/philbowles/PangolinMQTT which attempts to fix a lot of the short-comings of Marvin's original library and perhaps worth a try but would require a re-write in some areas. I lost interest in Pangolin because or the maintainer's arrogance, poor code quality and his library is not open to feedback or contributions unless you join one of his facebook clubs.
If MQTT is disconnecting because of TCP errors it may be related to the AsyncTCP library and not MQTT. Do you get the same errors when NTP is disabled (NTP=0 in build). It could be the UDP interrupting transmission. https://github.com/me-no-dev/AsyncTCP/issues/94
Disable NTP (FT_NTP=0) does not change the mqtt disconnects. Since the period is 3h20, i.e. 200 minutes it can be 200 hearbeats (every 60 sec) or 400 publishes (every 30 sec). I check now with hearbeat disabled and 10 sec publish time if it influences the reconnects.
Should i make a new issue for the esp32-mqtt-disconnects? I have no idea what causes it. What i know about this issue:
since mqtt reconnects immediately it does not harm much.
I'm still getting 10% Rx Errors (BBQKees) so going to reserve some time to look again into the Rx logic. And perhaps create a new TxMode based on the 1.9 way of working without interrupts.
Hm, i'm getting zero rx and tx errors with 8266 and only 0.01% rx errors on esp32. It's difficult for me to reproduce and understand what is different. I only looked at the 1.9.5 logic, but this was also interrupt based, creates a buffer-full-irq on every byte, reads this to buffer, but completes the telegram also on brk-irq. The difference is only the uses of the hardware-fifo and less interrupt-calls. I have no idea what can cause this rx-errors, except a cold soldering on the boards rx circuit.
I'll start today by looking at previous UART Rx logic until I get one the works then figure out the difference. I've tried 10 versions up until July 27th and they all produce the same results. This is using the same wiring & same circuit as I used in 1.9 so that's at least consistent. The telegrams are cut short, probably because of a false break detection. Interestingly the results are more stable when using USB powered instead of Bus-powered.
USB powered, 25268 and 10 fails in a 12 hour period BUS powered, 25510 and 52 fails in a 12 hour period
It's only 0.2% so no huge deal.
If it is always the missing break. it can be a timing issue: The break interrupt is triggered, but the 0x00 is not actual in the fifo. The next received telegram, which is normaly a poll, starts than with 0 and is ignored.
We can check the telegram for the ending 0x00, than it's length-1
or missing 0, than send full length
. And also skipping a leading 0x00 in all telegrams. Worth a try?
that may work. first I need to do some debugging and see what is happening. didn't have any time today
I saw the rx break change. I'm out for the next few days but will check this weekend when I'm back.
I've updated the uart branch to this check (and lift the branch to b4). Please check if it helps. If it is a timing issue in esp-hardware, than it's maybe better to use the fifo-full-irq like in 1.9.5, but not enable the break-irq , only poll for break after fifo is filled. This produces a lot of interrupt calls (factor 2 for poll, factor 32 for UBAMonitorFast), but in clear structure, whereas this is a workaround. That is what i mean: emsuart_esp8266.zip
for info, today I updated from 1.9.5 to 2.1.0b7 (https://github.com/proddy/EMS-ESP/tree/firmware/firmware) and was wondering about the tx-/rxfails and found this issue-report here. Board: Wemos D1 Mini / esp8266
@giovanne123: What tx_mode do you use? Have you tried other modes? (You can change from web-settings, save, no reboot required). A small number of rx_fails is common and were not measured in v1.9.5, butnumber of tx_fails is much to high.
was using mode 1 before. now switched to 2 (with restart), but already errors came up after few seconds :-(
What can I do?
Edit: switched back to mode 1, because with mode 2 I have more errors like before. So mode 1 seems better, but nevertheless much tx errors :-(
As far as i see in #554 you get the data. Are all your devices detected? Maybe a device is not correct recognized and ems-esp sends conmmades that are rejected by the device, this will also giv tx-errors.
Can you go to terminal and look with log debug
what causes the tx errors.
devices:
ems-esp:/$ show devices
These EMS devices are currently active:
Boiler: BK13/BK15/Smartline/GB1x2 (DeviceID:0x08, ProductID:xx, Version:02.07)
This Boiler will respond to telegram type IDs: 0x10 0x11 0x14 0x15 0x16 0x18 0x19 0x1A 0x1C 0x2A 0x33 0x34 0x35 0xD1 0xE3 0xE4 0xE5 0xE6 0xE9 0xEA
Thermostat: RC30 (DeviceID:0x10, ProductID:xx, Version:02.08) ** master device **
This Thermostat will respond to telegram type IDs: 0xA3 0x06 0xA2 0x3E 0x3D 0x48 0x47 0x52 0x51 0x5C 0x5B 0xA5 0x37
Mixing Module: MM10 (DeviceID:0x21, ProductID:xx, Version:02.00)
This Mixing Module will respond to telegram type IDs: 0xAA 0xAB 0xAC
Switching Module: WM10 (DeviceID:0x11, ProductID:xx, Version:02.00)
Controller: BC10/RFM20 (DeviceID:0x09, ProductID:xx, Version:02.00)
two tx errors in log debug:
000+06:22:44.615 D 37: [emsesp] Received UBAMonitorFast
000+06:22:44.956 D 38: [telegram] Sending read Tx [#71], telegram: 0B 8D 02 00 20 94
000+06:22:45.206 E 39: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request.
000+06:22:47.630 D 40: [emsesp] Received UBAMonitorFast
000+06:22:47.938 D 41: [emsesp] Received RCTime
...
000+06:23:44.160 D 173: [telegram] Sending read Tx [#87], telegram: 0B 8D 02 00 20 94
000+06:23:44.782 D 174: [telegram] Sending read Tx [#88], telegram: 0B 8D 02 00 20 94
000+06:23:45.032 E 175: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request.
000+06:23:46.459 D 176: [emsesp] Received UBAMonitorFast
000+06:23:47.455 D 177: [emsesp] Received UBAMonitorFast
...
Ok, you have a modem device at 0x0D and ems-esp requests version information of this device every minute, but the device does not answer. You can ignore these errors. The bus communication is fine.
You are right, I have an Easycom in the ems-bus. Where have you seen that I have the modem (older threads?)?
Ist there a possibility to prevent the calls and therefore the errors? Looks not so fine to have errors ;-) Maybe new feature to configure devices not to include?
Where have you seen that I have the modem (older threads?)?
0B 8D 02 00 20 94
is a version request from ems-esp to a modem-device.
Is this an original easycom from a Bosch-brand or a 3.party device like hometop?
Original from Buderus with serial connection to PC running Eco-Soft Software. Was used in the past to monitor the heating. I was listening in parallel on the serial interface for analysing data from/to Easycom(Eco-Soft) and writing to mysql db. But I have switched to EMS-ESP to not have the PC running 24/7. Now Easycom/Ecosoft is only used sometimes.
Ok, original Buderus does not reply to version request. Than it makes sense to skip the version request for this device.
@giovanne123: I've made the change, update to get rid of these tx-errors.
looks much better:
tx quality was only 85% before
FYI, not sure If you have changed something from b7 to 2.1.0b10 but after the update looks much more better for me. After 3h still zero errors :-) Edit: 30h still zero errors. Or aren't they counted anymore and there is a new bug 😁 Edit2: 75h still zero errors.
On the ESP8266 and ESP32 I'm seeing the occasional Rx error with the latest v2.
In the logs I get
So I went back to earlier v2 releases, to b9 from July and the same error is there too. Why it wasn't noticed before baffles me.
what I've tried, but didn't help:
Some users are not seeing this behavior. Like Michael and some HT3 users.
what I need to do is use a scope and see why the BREAK is being detected too early on.