emsesp / EMS-ESP

ESP8266 firmware to read and control EMS and Heatronic compatible equipment such as boilers, thermostats, solar modules, and heat pumps
https://emsesp.github.io/docs
GNU Lesser General Public License v3.0
300 stars 96 forks source link

Protocol / Queue handling issue #151

Closed susisstrolch closed 4 years ago

susisstrolch commented 4 years ago

Today my Logic Analyzer arrived, so I found the time to look what's going on on EMS-Bus. Environment:

Issue 1 - EMS-ESP continues sending, even not being polled again

1. MC10:    08 0B 16 00 FF 5A 64 00 06 FA 0A 01 0F 64 64 02 
            08 F8 0F 0F 0F 0F 1E 05 04 09 09 00 6D 00
2. ESP: 0B 00
3. MC10:       0B 00 89 00 85
4. ESP:       0B 88 14 00 20 E4 00
<200ms pause>
  1. MC10 sends a MC10Parameter message to us.
  2. We acknowledge with "0B " -> ok, fini
  3. MC10 echos our acknowledge??? The from MC10 looks fine, it's approx. 1.035ms
  4. We continue to poll MC10, which is simply ignored by the busmaster by not sending an echo. Selection_023

Issue 2 - pretty similiar to Issue 1

1. RC35:    10 0B 3E 00 00 00 00 7D 00 00 00 00 00 00 00 00
            00 11 05 00 E9 00
2. ESP: 0B 00
3. RC35:       0B 00
4. ESP:       0B 90 3D 00 20 80 00
5. RC35:             0B 90 7A
<200ms pause>
  1. RC35 sends a HK1MonitorMessage to us
  2. We respond with "0B " -> ok, fini
  3. RC35 echos our acknowledge
  4. We start sending w/o request
  5. Busmaster stops echo after 2nd byte and stays silent for 200ms. Selection_020
proddy commented 4 years ago

interesting, if you use tx_mode=2 (your implementation) does it start to clear up a little?

susisstrolch commented 4 years ago

That will be the next step.. first, I have to create triggers for the LA - especially on Rx / Tx BRK. digitalWrite doesn't work in emsuart - that leads to reboot loops from watchdog.

proddy commented 4 years ago

getting digitalWrite to work in an ISR is tricky. although it is atomic its slow. I've managed to do this before in earlier versions using the LED to show BRKs using

WRITE_PERI_REG(PERIPHS_GPIO_BASEADDR + (state ? 4 : 8), (1 << EMSESP_Status.led_gpio)); // 4 is on, 8 is off

susisstrolch commented 4 years ago

Jep.. it was more a "which one to use and how/where to mark the actions". Anyway... Added two markers to visualize whats going on. One is related to Rx, the other to Tx. Short Tx mark is set when we fill the Tx FIFO. Long Tx mark is set when we generate a BRK, after waiting for Tx-FIFO empty. Rx mark (long) is set as soon as we detect a BRK on Rx.

Now, Iet's have a look at the protocol problem (still in tx_mode 0):

  1. MC10 sends a MC10Parameter message to us. Selection_025

Selection_026

  1. At the blue marker you see that ESP acknowledges with a 0B <BRK> - which is imho wrong, because the command isn't a poll request. So we should answer with ACK or NACK.
  2. EMS acknowledges with 0B <BRK> - so far ok - and, because the <BRK terminates a communication - EMS assumes that the bus is free. Selection_027
  3. ESP however continues to send - which is politely ignored by EMS. Selection_028

So, I suggest the following extensions to protocol handler: If we are targeted by the message AND if bit7 isn't set we have a direct telegram, which should be ACKed with the 01 - and after that by sending a 0B <BRK. In addition, ESP should wait for the next poll request before sending the next telegram.

The tx_mode 2 is my next target...

susisstrolch commented 4 years ago

Uups - seems my JPEG rendering is a bit too conservative...

susisstrolch commented 4 years ago

Ok, here's the next one... Switching to alternate Tx/Rx is much too late after reboot. The EMS-Bus keeps pulled low to long. Green Marker: echo from reboot Blue Marker: Tx moved over to alternate Tx As you can see the EMS is pulled down for >4s after reboot... Selection_030

Zoomed in before blue marker: Selection_031

Imho switching over to D8/D7 should occur before WIFI and other stuff...

proddy commented 4 years ago

the uart being initialized and the interrupts enabled at the last minute was by design as to not interfere with anything else that was happening during bootup (serial transmissions, wifi, mqtt, websockets etc). It seemed the sensible thing to do at the time as I didn't want to fill up the Rx buffers. The logic is in line 1569 of ems-esp.cpp. It should be safe to move it forward as you suggested although I'm not sure what advantage it will have?

proddy commented 4 years ago

btw how you finding the dslogic? I have one too from https://www.dreamsourcelab.com/product/dslogic-plus but found it quite cumbersome to use. I often switch over to a 4 EUR USB logic analyzer and using Saleae's free software

susisstrolch commented 4 years ago

Advantage is not to block the EMS bus - there's nothing on the line during this time...

proddy commented 4 years ago

I see. I would have expected it wouldn't pull down the line as the D7/D8 pins are not activated though ?

susisstrolch commented 4 years ago

Next one... found in tx_mode 2... Between receiving a "ITS_TO_ME" telegram and processing we could be interrupted / delayed by WIFI/MQTT transmission. Currently, the tx doesn't care about the Rx status. This will lead to overlapping / misleading operations on the EMS bus. Green marker: EMS is sending a poll request Dark green marker: Poll request is processed Overview: EMS sends a poll request which is handled ~130ms after the request. Meanwhile, the next device is sending, which gets disturbed by us.

Screenshot from 2019-07-18 23-28-05 Request details: Screenshot from 2019-07-18 23-29-42

Response disturbing bus: Screenshot from 2019-07-18 23-31-41

IMHO, Tx should honour the Rx-Status - simply fail if Rx != IDLE...

susisstrolch commented 4 years ago

btw how you finding the dslogic

It's a really great tool - needs a bit time to find out how to work, but after that you find anything you like!!

IMG_5959

IMG_5960

proddy commented 4 years ago

jealous !

susisstrolch commented 4 years ago

I see. I would have expected it wouldn't pull down the line as the D7/D8 pins are not activated though ?

I also don't understand why they are pulled down. My first thought was that it happens in the Arduino core - at least the traces looks alike...

susisstrolch commented 4 years ago

jealous !

Playground :) accepted by Susi :)) (won't talk about the 3D printer in background )

IMG_5961

susisstrolch commented 4 years ago

Somehow we must handle the duedate of the received package. Maybe by a timestamp, If delta > 100ms we won't process the Tx anymore...

proddy commented 4 years ago

jealous !

Playground :) accepted by Susi :)) (won't talk about the 3D printer in background )

IMG_5961

3D printer's on my wish list too. Velleman make some nice kits.

susisstrolch commented 4 years ago

Yep... but it needs some tweeks until it's really usable... F.e. the sliders are suboptimal... Simply ping me if you go to K8800...

susisstrolch commented 4 years ago

Totaly off-topic: iPhone8p holder for the bicycle.. printed in PLA+Carbon... IMG_5962 [fixed horrible typo]

bbqkees commented 4 years ago

I see. I would have expected it wouldn't pull down the line as the D7/D8 pins are not activated though ?

I also don't understand why they are pulled down. My first thought was that it happens in the Arduino core - at least the traces looks alike...

Is the EMS bus itself actually blocked? If I recall correctly when the TX input is floating or 0 the TX transistor is active (so the current is drawn through the resistors). Could be the I/O of the ESP is pulled low at boot so the analyzer sees a logic 0 at those pins but the EMS bus itself might be just fine.

susisstrolch commented 4 years ago

It‘s simply because GPIO15 has a pulldown resistor connected. Otherwise (GPIO15 open) the ESP tries to boot from SD-Card...

Sent by mobile device

Am 25.07.2019 um 10:31 schrieb Kees notifications@github.com:

I see. I would have expected it wouldn't pull down the line as the D7/D8 pins are not activated though ?

I also don't understand why they are pulled down. My first thought was that it happens in the Arduino core - at least the traces looks alike...

Is the EMS bus itself actually blocked? If I recall correctly when the TX input is floating or 0 the TX transistor is active (so the current is drawn through the resistors). Could be the I/O of the ESP is pulled low at boot so the analyzer sees a logic 0 at those pins but the EMS bus itself might be just fine.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

proddy commented 4 years ago

GPIO15 (D8) which we use for Tx is automatically pulled high by the ESP8266 during boot up, which causes the block on the EMS bus. In the code its forced down so there's only a very short window now (3 bytes I think).

susisstrolch commented 4 years ago

GPIO15 (D8) which we use for Tx is automatically pulled high by the ESP8266 during boot up

No... GPIO15 has a pull down (10k) on (nearly) all boards (Wemos, NodeMCU). The adapter board itself is inverting the Tx signal - pulling current from EMS-Bus when Tx (GPIO15) is low. So, during boot the EMS-Bus was pulled low for (my impression) much to long.

bbqkees commented 4 years ago

Yes indeed. But you only block slave TX for a short moment, the EMS master can still send uninterrupted.

susisstrolch commented 4 years ago

@proddy we have a protocol handling issue when sending to EMS bus. It happens with tx_mode 0 and also with tx_mode 2. Look at those traces:

(00:09:16.948) ems_parseTelegram: 00 01: 8B
(00:09:16.948) emsuart_tx_buffer: 00 01: 0B 90 3E 00 20 8C
(00:09:16.964) ems_parseTelegram: 00 02: 0B 90 3E 00 20 8C
(00:09:16.964) 0B 90 3E 00 20 8C
(00:09:16.964) echo:telegram: 0B 90 3E 00 20 (CRC=8C) #data=1
(00:09:17.078) ems_parseTelegram: 00 02: 10 0B 3E 00 00 00 00 7D 00 00 00 00 00 00 00 00 00 00 05 00 AD
(00:09:17.078) 10 0B 3E 00 00 00 00 7D 00 00 00 00 00 00 00 00 00 00 05 00 AD
(00:09:17.080) emsuart_tx_buffer: 00 01: 0B
(00:09:17.085) ems_parseTelegram: 00 01: 0B
(00:09:17.085) emsuart_tx_buffer: 00 01: 0B 90 3D 00 20 80
[557087] ** error sending buffer: BRK
(00:09:17.297) ems_parseTelegram: 00 01: 89 00 86 90 

(00:09:16.948) ems_parseTelegram: 00 01: 8B Bus master sends poll request (00:09:16.948) emsuart_tx_buffer: 00 01: 0B 90 3E 00 20 8C We send a query for 0x10 (00:09:16.964) ems_parseTelegram: 00 02: 0B 90 3E 00 20 8C We get the echo (00:09:17.078) ems_parseTelegram: 00 02: 10 0B .... We process the data (00:09:17.080) emsuart_tx_buffer: 00 01: 0B We tell Busmaster we're ready (00:09:17.085) ems_parseTelegram: 00 01: 0B Busmaster echo (00:09:17.085) emsuart_tx_buffer: 00 01: 0B 90 3D 00 20 80 We send again, w/o invitation/poll [557087] ** error sending buffer: BRK Bus master signals "shut up"

The issue is that we interpret the echo as a poll request because of masking out the "poll bit" from Target ID.

proddy commented 4 years ago

funny, I noticed that too this morning when I was trying to debug why the 0x19 MonitorSlow messages were getting blocked (buffer of 32 was too small). I saw many telegrams with a length of 1 and a value of 0x19 which was a recursive echo->poll

susisstrolch commented 4 years ago

Seems I have a more stable version at the moment. Much less Rx-idle warnings and no Tx Brk or Tx Timeout during the last hour. Still have to clean up a bit and

Sent by mobile device

Am 26.07.2019 um 15:21 schrieb Proddy notifications@github.com:

funny, I noticed that too this morning when I was trying to debug why the 0x19 MonitorSlow messages were getting blocked (buffer of 32 was too small). I saw many telegrams with a length of 1 and a value of 0x19 which was a recursive echo->poll

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

proddy commented 4 years ago

great, thanks Juergen. Amazing how you can code in these extreme heats. Shall I take a look at fixing the echo/poll issue or have you already squished that bug?

susisstrolch commented 4 years ago

I‘ve already fixed that one. Must be crosschecked by Junkers owner.

Sent by mobile device

Am 26.07.2019 um 18:25 schrieb Proddy notifications@github.com:

great, thanks Juergen. Amazing how you can code in these extreme heats. Shall I take a look at fixing the echo/poll issue or have you already squished that bug?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

susisstrolch commented 4 years ago

02:01:02.556) echo:telegram: 0B 90 3E 00 20 (CRC=8C) #data=1 (02:01:03.290) echo:telegram: 0B 90 3D 00 20 (CRC=80) #data=1 (02:01:03.730) echo:telegram: 0B 90 06 00 20 (CRC=6C) #data=1 (02:01:04.049) echo:telegram: 0B 88 18 00 20 (CRC=D4) #data=1 (02:01:12.102) echo:telegram: 0B 88 19 00 20 (CRC=D0) #data=1 (02:01:12.626) echo:telegram: 0B 88 33 00 20 (CRC=78) #data=1 (02:01:12.886) echo:telegram: 0B 88 16 00 20 (CRC=EC) #data=1 (02:01:13.176) echo:telegram: 0B 88 14 00 20 (CRC=E4) #data=1

Sent by mobile device

Am 26.07.2019 um 18:25 schrieb Proddy notifications@github.com:

great, thanks Juergen. Amazing how you can code in these extreme heats. Shall I take a look at fixing the echo/poll issue or have you already squished that bug?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

proddy commented 4 years ago

I remember from https://github.com/proddy/EMS-ESP/issues/103#issuecomment-487408779 the Junkers/HT3 sent out Polls without the MSB and that seemed to also work on Buderus so I changed it for everyone. What's peculiar for Junkers is if the Poll has no MSB set how does it distinguish from an acknowledgement as they are now identical?

When you're happy with the changes do a PR and I'll ask a few of the Junkers boys to give it a test.

susisstrolch commented 4 years ago

After reading through #103 it‘s more that Junkers always (read, write, poll, ack) has an inverted ID bit.

proddy commented 4 years ago

Yes, on startup it pings for known Junkers models and if one is detected EMS_Sys_Status.emsReverse is set which is used through out the code to unset the MSB.

susisstrolch commented 4 years ago

I‘m thinking of changing the Junkers detection to a more protokol related approach.

This flag will consistently used for Poll, Write, Read and Ack. In Addition I want to send this info, including the txmode, as a MQTT info.

proddy commented 4 years ago

That's a good approach. Do you want some help implementing this?

susisstrolch commented 4 years ago

No, got playtime permission 😁 Will cleanup my changes and start implement it later.

BTW: uptime 1d 3h 38m in txmode 2…

Von meinem iPad gesendet

Am 28.07.2019 um 12:41 schrieb Paul notifications@github.com:

That's a good approach. Do you want some help implementing this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

proddy commented 4 years ago

Ok. I'd like wrap up 1.8.1 and push it to Master so we can work on the next major 1.9 version with a more optimized web interface. Any objections on finalizing the dev branch today/tomorrow ?

susisstrolch commented 4 years ago

No, should work out so far...

susisstrolch commented 4 years ago

Just pushed a intermediate version, with most of my changes (also Junkers) in. Please have a look...

susisstrolch commented 4 years ago

... also a forgotten initialization and logmode j - like jabber. Tomorrow I‘ll rebase the branch so it‘s easier for you to merge to dev and Master.

proddy commented 4 years ago

ok. You have access to the EMS-ESP github so feel free to push any changes direct to dev

proddy commented 4 years ago

Actually, shall I release 1.8.1 as it is to master now and merge your new changes to the next dev version (1.9) ?

susisstrolch commented 4 years ago

Yes, makes sense

Sent by mobile device

Am 28.07.2019 um 20:20 schrieb Paul notifications@github.com:

Actually, shall I release 1.8.1 as it is to master now and merge your new changes to the next dev version (1.9) ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

proddy commented 4 years ago

ok, dev branch is now v1.9 and the next playground to experiment with. You can merge your txmode2 branch anytime you want.

susisstrolch commented 4 years ago

I think keeping it separate for a whole will ease debuggung for Junkers devices.

Sent by mobile device

Am 29.07.2019 um 14:53 schrieb Paul notifications@github.com:

ok, dev branch is now v1.9 and the next playground to experiment with. You can merge your txmode2 branch anytime you want.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

susisstrolch commented 4 years ago

@proddy TxMode2 and Junkers autodetect should work so far. So we need a volunteer compiling and testing the txmode2 branch. In this branch you will also find a Description for HT3 and a lot EMS2 types.

Ah, for easier debug I‘ve added ‚log j‘, which also shows raw Tx/RX buffers (Junkers debug).

proddy commented 4 years ago

wonderful! I'll test tonight and reach out the the HT3 guys to see if they can test

proddy commented 4 years ago

Been running the txmode2 branch for 12hrs now with no crashes and hardly any corrupt telegrams. Looking good!

susisstrolch commented 4 years ago

Great! Now we need the result of the Junkers users. About the reboots - I suspect MQTT or telnet. Maybe we should ad a watchdog feed before calling them.

Sent by mobile device

Am 30.07.2019 um 09:47 schrieb Paul notifications@github.com:

Been running the txmode2 branch for 12hrs now with no crashes and hardly any corrupt telegrams. Looking good!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

proddy commented 4 years ago

with 1.8.1 I would get reboots every 10-15 mins with tx_mode 2, but none with tx_mode 0 so I don't expect its telnet/mqtt related. you can switch all that off (set mqtt_ip and set publish_time 0) and it still happens. Anyway seems better in 1.9