(v2) occasional Rx incomplete telegrams

proddy commented 4 years ago

On the ESP8266 and ESP32 I'm seeing the occasional Rx error with the latest v2.

Capture

In the logs I get

Aug 18 00:58:55 ems-esp - 000+04:57:02.114 E 11: [telegram] Rx: 17 0B A8 00 01 00 FF FF 8F (CRC 8F != 75)
Aug 18 01:21:55 ems-esp - 000+05:20:02.241 E 12: [telegram] Rx: 17 0B A8 00 01 00 FF F6 01 06 00 01 0D 01 00 FF FF 01 02 02 03 97 (CRC 97 != DD)
Aug 18 01:41:14 ems-esp - 000+05:39:21.362 E 13: [telegram] Rx: 17 08 1F 89 (CRC 89 != 53)
Aug 18 01:42:55 ems-esp - 000+05:41:02.366 E 14: [telegram] Rx: 17 0B A8 00 01 00 FF F6 01 06 00 01 0D 01 00 FF FF 01 02 02 02 00 00 05 1F 05 1F 02 0E 00 FF 89 (CRC 89 != DF)
Aug 18 02:50:55 ems-esp - 000+06:49:02.351 E 15: [telegram] Rx: 17 0B A8 00 01 00 FF F6 01 06 00 01 0D 01 00 FF FF 01 02 02 02 00 00 05 1F 05 1F 02 0E 00 FF 89 (CRC 89 != DF)
Aug 18 03:12:56 ems-esp - 000+07:11:02.754 E 16: [telegram] Rx: 17 0B A8 00 01 00 08 00 2A 00 00 00 00 00 00 00 00 01 06 00 00 80 00 00 80 00 80 00 80 00 00 B6 (CRC B6 != ED)
Aug 18 03:38:55 ems-esp - 000+07:37:02.253 E 17: [telegram] Rx: 17 0B A8 00 01 00 FF F6 01 06 00 01 0D 01 00 FF FF 01 02 02 02 00 00 05 1F 05 1F 02 0E 00 FF 89 (CRC 89 != DF)
Aug 18 03:54:37 ems-esp - 000+07:52:43.987 E 18: [telegram] Rx: FE 08 00 1C 00 00 2F 24 AC 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 38 (CRC 38 != 05)
Aug 18 04:11:55 ems-esp - 000+08:10:02.375 E 19: [telegram] Rx: 17 0B A8 00 01 00 FF F6 01 07 89 (CRC 89 != EC)
Aug 18 05:23:55 ems-esp - 000+09:22:02.050 E 20: [telegram] Rx: 17 0B 91 00 80 CA 89 (CRC 89 != 5D)

So I went back to earlier v2 releases, to b9 from July and the same error is there too. Why it wasn't noticed before baffles me.

what I've tried, but didn't help:

running the ESP8266 at 80Hz or 160Hz
turning off Tx (tx_mode 0) and physically disconnecting the Tx line into the ESP
rolling back espressif core to 2.6.1 and 2.6.0
using older versions of the UART library (from b9)
bus-powered vs usb-powered, same difference

Some users are not seeing this behavior. Like Michael and some HT3 users.

what I need to do is use a scope and see why the BREAK is being detected too early on.

bbqkees commented 4 years ago

In 1.9.5 its not there. here it started somewhere around V2b9. I have about 2 to 5% incomplete telegrams. As it does not happen frequently, it maybe a bit hard to catch with a scope.

proddy commented 4 years ago

@bbqkees I only tested from b9 onwards. I'll try b8

MichaelDvP commented 4 years ago

Tested the latest travis built 8266 without any error for 4 h. Full version: SDK:2.2.2-dev(38a443e)/Core:2.7.3-3-g2843a5ac=20703003/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-30-g92add50/BearSSL:5c771be is exactly what i get if compile here.

  Tx mode: 4
  Bus protocol: Buderus
  #telegrams received: 10167
  #read requests sent: 2007
  #write requests sent: 0
  #incomplete telegrams: 0 (0%)
  #tx fails (after 3 retries): 0

before i was running v2_cmd in tx_mode 1 compiled here over 12h without errors.

proddy commented 4 years ago

ok, thanks for testing. Same error in b8. In all earlier versions the error was suppressed which is why we never saw it before. So I guess it's always been like that. Same on the ESP32 where the first byte of each telegram would be a 0xFE. Need to get the logic analyzer out and leave it running for 1-2hrs to capture the log

SpaceTeddy commented 4 years ago

with ESP32 b11 and online for around one week:

bbqkees commented 4 years ago

With B13 running for 19 hours i have:

I also see a massive increase in TX telegrams, before it was only minimal.

MichaelDvP commented 4 years ago

I have tested v2b12 and v2b13 with esp32 for longer times with logging. To check if we lose telegrams without rx-queue, i added a error conter in rxservice::add before rx_telegram is set.

if (rx_telegram) {
     increment_telegram_error_count();
 }

It was 4 errors in 40000 telegrams, not much, but i think its better to use a small queue again. I added a 3x queue for the next tests.

With the esp32 i see some crc-errors, around 1 per 10000 telegrams, but not as @proddy shortend telegrams mainly as reply, i see other errors mainly in broadcast telegrams:

complete garbage after startup (one or two telegrams)
a FE in front of normal telegram
a FE between <src> and <dst>

i don't see this with esp8266 before.

I get a mqtt disconnect every 12032 seconds. The broker logs: Client [ems-esp] connection closed: Error: read ECONNRESET I havn't recognized this before, I'll test now with esp8266 to see if it's also there.

This is the log: esp32.log

proddy commented 4 years ago

Before I removed the Rx queue and ran some tests to see the max size of the queue. It never got above 1 which was I removed it. But this was only on the ESP8266. I guess the ESP32 can process faster. I'll add back the queue, no problem.

I'm seeing the same behavior on the ESP32. Garbage telegrams after startup and FE is front of normal telegrams. I don't remember seeing this before and not sure where it is coming from.

That MQTT disconnect isn't good. I'll run some tests here and check the logs to see if I see the same.

MichaelDvP commented 4 years ago

Seems garbage-telegrams on startup and mqtt-disconnets are ESP32 specific. Im running now esp8266 for 3h40min without disconnect and no rx-errors until now.

proddy commented 4 years ago

@MichaelDvP

Seems garbage-telegrams on startup and mqtt-disconnets are ESP32 specific.

try going back to use the stable version of the core. In platformio.ini use platform = espressif32 as opposed to platform = https://github.com/platformio/platform-espressif32.git.

MichaelDvP commented 4 years ago

Tried, but same result, garbage on startup and mqtt reconnect after 3h20 :

Aug 28 12:03:07 ems - 000+00:00:00.000 I 0: [syslog] Log level set to INFO
Aug 28 12:03:07 ems - 000+00:00:00.000 I 1: [emsesp] EMS Device library loaded with 70 records
Aug 28 12:03:07 ems - 000+00:00:00.450 E 2: [telegram] Rx: 10 FC 80 FE FE 80 FE (CRC FE != A7)
Aug 28 12:03:07 ems - 000+00:00:00.503 E 3: [telegram] Rx: 88 11 35 07 00 FC (CRC FC != 1A)
Aug 28 12:03:07 ems - 000+00:00:00.513 I 4: [mqtt] MQTT connected
Aug 28 12:03:07 ems - 000+00:00:00.563 E 5: [telegram] Rx: F8 08 10 11 35 01 02 18 2B 00 01 4B F2 (CRC F2 != 59)
Aug 28 12:03:07 ems - 000+00:00:00.639 E 6: [telegram] Rx: FF 10 08 1A 00 00 00 00 00 91 (CRC 91 != AE)
Aug 28 12:30:57 ems - 000+00:27:50.907 I 7: [boiler] Setting boiler warm water circulation on
Aug 28 12:34:07 ems - 000+00:31:00.902 I 8: [boiler] Setting boiler warm water circulation on
Aug 28 14:38:07 ems - 000+02:35:00.890 E 9: [telegram] Rx: 6F 18 08 00 34 00 30 02 8E 7D 00 21 00 00 03 00 01 19 45 00 18 B8 65 (CRC 65 != 96)
Aug 28 15:09:36 ems - 000+03:06:29.552 I 10: [boiler] Setting boiler warm water circulation on
Aug 28 15:12:41 ems - 000+03:09:34.589 I 11: [boiler] Setting boiler warm water circulation on
Aug 28 15:23:39 ems - 000+03:20:32.465 I 12: [mqtt] MQTT disconnected: TCP
Aug 28 15:23:44 ems - 000+03:20:37.482 I 13: [mqtt] MQTT connected

With the 8266 before i don't get reconnects and had only one rx-error after 20h.

Aug 26 11:02:09 ems - 000+00:00:00.000 I 0: [syslog] Log level set to INFO
Aug 26 11:02:09 ems - 000+00:00:00.000 I 1: [emsesp] EMS Device library loaded with 70 records
Aug 26 11:02:09 ems - 000+00:00:16.988 I 2: [mqtt] MQTT connected
Aug 26 11:03:42 ems - 000+00:01:51.059 I 3: [boiler] Setting boiler warm water circulation on
Aug 26 11:06:52 ems - 000+00:05:00.979 I 4: [boiler] Setting boiler warm water circulation on
Aug 26 11:49:41 ems - 000+00:47:49.925 I 5: [boiler] Setting boiler warm water circulation on
Aug 26 11:52:56 ems - 000+00:51:04.939 I 6: [boiler] Setting boiler warm water circulation on
Aug 26 14:33:30 ems - 000+03:31:39.082 I 7: [telnet] New connection from [192.168.0.26]:62778 accepted
Aug 26 14:33:30 ems - 000+03:31:39.082 I 8: [shell] Allocated console pty0 for connection from [192.168.0.26]:62778
Aug 26 14:33:30 ems - 000+03:31:39.082 I 9: [shell] User session opened on console pty0
Aug 26 14:33:34 ems - 000+03:31:42.556 I 10: [shell] User session closed on console pty0
Aug 26 14:33:34 ems - 000+03:31:42.563 I 11: [telnet] Connection from [192.168.0.26]:62778 closed
Aug 26 14:33:34 ems - 000+03:31:42.563 I 12: [shell] Shutdown console pty0 for connection from [192.168.0.26]:62778
Aug 26 14:49:39 ems - 000+03:47:47.436 I 13: [boiler] Setting boiler warm water circulation on
Aug 26 14:52:54 ems - 000+03:51:02.411 I 14: [boiler] Setting boiler warm water circulation on
Aug 26 15:10:17 ems - 000+04:08:25.985 I 15: [telnet] New connection from [192.168.0.26]:62974 accepted
Aug 26 15:10:17 ems - 000+04:08:25.985 I 16: [shell] Allocated console pty0 for connection from [192.168.0.26]:62974
Aug 26 15:10:17 ems - 000+04:08:25.985 I 17: [shell] User session opened on console pty0
Aug 26 15:19:34 ems - 000+04:17:42.499 I 18: [shell] User session closed on console pty0
Aug 26 15:19:34 ems - 000+04:17:42.505 I 19: [telnet] Connection from [192.168.0.26]:62974 closed
Aug 26 15:19:34 ems - 000+04:17:42.505 I 20: [shell] Shutdown console pty0 for connection from [192.168.0.26]:62974
Aug 26 15:41:50 ems - 000+04:39:58.788 I 21: [boiler] Setting boiler warm water circulation on
Aug 26 15:44:55 ems - 000+04:43:03.774 I 22: [boiler] Setting boiler warm water circulation on
Aug 27 06:58:44 ems - 000+19:56:50.455 E 23: [telegram] Rx: 21 00 18 00 00 18 1B 00 00 00 00 00 00 00 0B 00 00 FF (CRC FF != D5)

MichaelDvP commented 4 years ago

Regarding the mqtt-disconnects i've found this issue of AsyncMqttClient and starting now with a test with the lib https://github.com/dx168b/async-mqtt-client.git

MichaelDvP commented 4 years ago

No change, also with ths lib, Mqtt reconnects every 3h20m.

Aug 28 22:09:48 ems - 000+00:00:00.000 I 0: [syslog] Log level set to INFO
Aug 28 22:09:48 ems - 000+00:00:00.000 I 1: [emsesp] EMS Device library loaded with 70 records
Aug 28 22:09:48 ems - 000+00:00:00.000 E 2: [telegram] Rx: 7F 2D 1E 85 AF 7F 14 90 (CRC 90 != C2)
Aug 28 22:09:48 ems - 000+00:00:00.491 E 3: [telegram] Rx: C0 80 F0 2E (CRC 2E != C2)
Aug 28 22:09:48 ems - 000+00:00:00.518 I 4: [mqtt] MQTT connected
Aug 28 22:11:47 ems - 000+00:02:00.140 E 5: [telegram] Rx: 19 00 AF 00 00 E4 00 C9 (CRC C9 != CD)
Aug 29 01:30:20 ems - 000+03:20:32.518 I 6: [mqtt] MQTT disconnected: TCP
Aug 29 01:30:25 ems - 000+03:20:37.534 I 7: [mqtt] MQTT connected
Aug 29 02:41:58 ems - 000+04:32:11.158 E 8: [telegram] Rx: 10 FC 00 A2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 9B (CRC 9B != 64)
Aug 29 04:50:52 ems - 000+06:41:04.896 I 9: [mqtt] MQTT disconnected: TCP
Aug 29 04:50:57 ems - 000+06:41:09.911 I 10: [mqtt] MQTT connected
Aug 29 06:13:58 ems - 000+08:04:11.160 E 11: [telegram] Rx: FC 08 00 14 00 3E 5B 1A 3F (CRC 3F != 35)
Aug 29 06:59:49 ems - 000+08:50:02.161 E 12: [telegram] Rx: FC FC FC 89 (CRC 89 != C6)

proddy commented 4 years ago

I looked at dx168b's fork a while back. Another one worth trying is https://github.com/bertmelis/async-mqtt-client although I neither really address the disconnects. And another is https://github.com/kleini/async-mqtt-client/commit/f1b42054815ad82fed4519bce7febb7f1601560f.

I have been also actively following https://github.com/philbowles/PangolinMQTT which attempts to fix a lot of the short-comings of Marvin's original library and perhaps worth a try but would require a re-write in some areas. I lost interest in Pangolin because or the maintainer's arrogance, poor code quality and his library is not open to feedback or contributions unless you join one of his facebook clubs.

If MQTT is disconnecting because of TCP errors it may be related to the AsyncTCP library and not MQTT. Do you get the same errors when NTP is disabled (NTP=0 in build). It could be the UDP interrupting transmission. https://github.com/me-no-dev/AsyncTCP/issues/94

MichaelDvP commented 4 years ago

Disable NTP (FT_NTP=0) does not change the mqtt disconnects. Since the period is 3h20, i.e. 200 minutes it can be 200 hearbeats (every 60 sec) or 400 publishes (every 30 sec). I check now with hearbeat disabled and 10 sec publish time if it influences the reconnects.

MichaelDvP commented 4 years ago

Should i make a new issue for the esp32-mqtt-disconnects? I have no idea what causes it. What i know about this issue:

it's only on ESP32
mqtt disconnects every 12032 seconds from boot, reason TCP
disable NTP has no effect
changing publish intervals has no effect
disable heartbeat has no effect
changing QoS has no effect
restarting the mqtt broker in between has no effect
restarting mqtt in between on the esp has no effect.

since mqtt reconnects immediately it does not harm much.

proddy commented 4 years ago

I'm still getting 10% Rx Errors (BBQKees) so going to reserve some time to look again into the Rx logic. And perhaps create a new TxMode based on the 1.9 way of working without interrupts.

MichaelDvP commented 4 years ago

Hm, i'm getting zero rx and tx errors with 8266 and only 0.01% rx errors on esp32. It's difficult for me to reproduce and understand what is different. I only looked at the 1.9.5 logic, but this was also interrupt based, creates a buffer-full-irq on every byte, reads this to buffer, but completes the telegram also on brk-irq. The difference is only the uses of the hardware-fifo and less interrupt-calls. I have no idea what can cause this rx-errors, except a cold soldering on the boards rx circuit.

proddy commented 4 years ago

I'll start today by looking at previous UART Rx logic until I get one the works then figure out the difference. I've tried 10 versions up until July 27th and they all produce the same results. This is using the same wiring & same circuit as I used in 1.9 so that's at least consistent. The telegrams are cut short, probably because of a false break detection. Interestingly the results are more stable when using USB powered instead of Bus-powered.

USB powered, 25268 and 10 fails in a 12 hour period BUS powered, 25510 and 52 fails in a 12 hour period

It's only 0.2% so no huge deal.

MichaelDvP commented 4 years ago

If it is always the missing break. it can be a timing issue: The break interrupt is triggered, but the 0x00 is not actual in the fifo. The next received telegram, which is normaly a poll, starts than with 0 and is ignored. We can check the telegram for the ending 0x00, than it's length-1 or missing 0, than send full length. And also skipping a leading 0x00 in all telegrams. Worth a try?

proddy commented 4 years ago

that may work. first I need to do some debugging and see what is happening. didn't have any time today

proddy commented 3 years ago

I saw the rx break change. I'm out for the next few days but will check this weekend when I'm back.

MichaelDvP commented 3 years ago

I've updated the uart branch to this check (and lift the branch to b4). Please check if it helps. If it is a timing issue in esp-hardware, than it's maybe better to use the fifo-full-irq like in 1.9.5, but not enable the break-irq , only poll for break after fifo is filled. This produces a lot of interrupt calls (factor 2 for poll, factor 32 for UBAMonitorFast), but in clear structure, whereas this is a workaround. That is what i mean: emsuart_esp8266.zip

giovanne123 commented 3 years ago

for info, today I updated from 1.9.5 to 2.1.0b7 (https://github.com/proddy/EMS-ESP/tree/firmware/firmware) and was wondering about the tx-/rxfails and found this issue-report here. Board: Wemos D1 Mini / esp8266

Unbenannt2

MichaelDvP commented 3 years ago

@giovanne123: What tx_mode do you use? Have you tried other modes? (You can change from web-settings, save, no reboot required). A small number of rx_fails is common and were not measured in v1.9.5, butnumber of tx_fails is much to high.

giovanne123 commented 3 years ago

was using mode 1 before. now switched to 2 (with restart), but already errors came up after few seconds :-(

What can I do?

Edit: switched back to mode 1, because with mode 2 I have more errors like before. So mode 1 seems better, but nevertheless much tx errors :-(

MichaelDvP commented 3 years ago

As far as i see in #554 you get the data. Are all your devices detected? Maybe a device is not correct recognized and ems-esp sends conmmades that are rejected by the device, this will also giv tx-errors. Can you go to terminal and look with log debug what causes the tx errors.

giovanne123 commented 3 years ago

devices:

ems-esp:/$ show devices
These EMS devices are currently active:

Boiler: BK13/BK15/Smartline/GB1x2 (DeviceID:0x08, ProductID:xx, Version:02.07)
 This Boiler will respond to telegram type IDs: 0x10 0x11 0x14 0x15 0x16 0x18 0x19 0x1A 0x1C 0x2A 0x33 0x34 0x35 0xD1 0xE3 0xE4 0xE5 0xE6 0xE9 0xEA

Thermostat: RC30 (DeviceID:0x10, ProductID:xx, Version:02.08) ** master device **
 This Thermostat will respond to telegram type IDs: 0xA3 0x06 0xA2 0x3E 0x3D 0x48 0x47 0x52 0x51 0x5C 0x5B 0xA5 0x37

Mixing Module: MM10 (DeviceID:0x21, ProductID:xx, Version:02.00)
 This Mixing Module will respond to telegram type IDs: 0xAA 0xAB 0xAC

Switching Module: WM10 (DeviceID:0x11, ProductID:xx, Version:02.00)

Controller: BC10/RFM20 (DeviceID:0x09, ProductID:xx, Version:02.00)

two tx errors in log debug:


000+06:22:44.615 D 37: [emsesp] Received UBAMonitorFast
000+06:22:44.956 D 38: [telegram] Sending read Tx [#71], telegram: 0B 8D 02 00 20 94
000+06:22:45.206 E 39: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request.
000+06:22:47.630 D 40: [emsesp] Received UBAMonitorFast
000+06:22:47.938 D 41: [emsesp] Received RCTime
...

000+06:23:44.160 D 173: [telegram] Sending read Tx [#87], telegram: 0B 8D 02 00 20 94
000+06:23:44.782 D 174: [telegram] Sending read Tx [#88], telegram: 0B 8D 02 00 20 94
000+06:23:45.032 E 175: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request.
000+06:23:46.459 D 176: [emsesp] Received UBAMonitorFast
000+06:23:47.455 D 177: [emsesp] Received UBAMonitorFast
...

MichaelDvP commented 3 years ago

Ok, you have a modem device at 0x0D and ems-esp requests version information of this device every minute, but the device does not answer. You can ignore these errors. The bus communication is fine.

giovanne123 commented 3 years ago

You are right, I have an Easycom in the ems-bus. Where have you seen that I have the modem (older threads?)?

Ist there a possibility to prevent the calls and therefore the errors? Looks not so fine to have errors ;-) Maybe new feature to configure devices not to include?

MichaelDvP commented 3 years ago

Where have you seen that I have the modem (older threads?)?

0B 8D 02 00 20 94 is a version request from ems-esp to a modem-device.

Is this an original easycom from a Bosch-brand or a 3.party device like hometop?

giovanne123 commented 3 years ago

Original from Buderus with serial connection to PC running Eco-Soft Software. Was used in the past to monitor the heating. I was listening in parallel on the serial interface for analysing data from/to Easycom(Eco-Soft) and writing to mysql db. But I have switched to EMS-ESP to not have the PC running 24/7. Now Easycom/Ecosoft is only used sometimes.

MichaelDvP commented 3 years ago

Ok, original Buderus does not reply to version request. Than it makes sense to skip the version request for this device.

MichaelDvP commented 3 years ago

@giovanne123: I've made the change, update to get rid of these tx-errors.

giovanne123 commented 3 years ago

looks much better:

tx quality was only 85% before

giovanne123 commented 3 years ago

FYI, not sure If you have changed something from b7 to 2.1.0b10 but after the update looks much more better for me. After 3h still zero errors :-) Edit: 30h still zero errors. Or aren't they counted anymore and there is a new bug 😁 Edit2: 75h still zero errors.

emsesp / EMS-ESP

(v2) occasional Rx incomplete telegrams #460