calimero-project / calimero-core

Core library for KNX network access and management
Other
130 stars 64 forks source link

ManagementClientImpl: disconnect after groupwrite with same IP Gateway #96

Closed olterion closed 3 years ago

olterion commented 3 years ago

Hello,

we habe adopted the MagagementClientImpl to update our DIY KNX components. On the µC runs a KNX bootloader to write the new software into its memory. Here is the modifed source code: Selfbus Updater

In a test system with only the IP Gateway and the flashed device everything runs ok. The complete Software gets over the KNX Bus and can be flashed.

But if the Bus updater is connected to a real system and the IP Gatway has a additional tunnel open (e.g. for ETS, EDOMI, ...) a groupwrite in the KNX System let the connection between Updater and flashed device fail.

So I got deep in the system and now I am at the point to understand the calimero lib. Additionally I debugged the bootloader on the µC and logged the ethernet messages with wireshark.

My observations: The bootloader gets a disconnect message (get it from calimero?) With wireshark the same direction of the disconnect message was observed. This disconnect message is send direct after a groupwrite telegram was written on the KNX Bus.

The problem is in the function waitForResponse(...) which calls the funtion indications.wait(remaining). In this waiting time, the groupwrite telegram is received from calimero lib (which calls the funktion group() from class TLListener). After that directly (without expiring the timeout time) the disconnect message is send to the flashed device. So I think, the "problem" is in the calimero lib?!

Here is a shreenshot of the point, where a groupwrite distubes the connection: (192.168.178.164 is my PC with the Updater respectively the calimero lib. 192.168.178.3 is the IP Gateway) At the marked line you can see the disconnect message from PC (.164) to IP Gateway (.3) wireshark_updater_problem

I hope, you understand my problem and my previous analysis. Maybe you can help me to get this bug fixed, so we can use the updater in a real KNX system.

By the way: Thank you very much to develop and share this awesome lib!

Best regards, Olli

bmalinowsky commented 3 years ago

Hello,

on first glance, isn't the problem that there are 2 ACKs from 15.15.192? No. 4779, and No. 4783. The transport layer will send a disconnect in that case:

https://github.com/calimero-project/calimero-core/blob/e8a8fccecff849ed99190a8effeb75edc342b9f9/src/tuwien/auto/calimero/mgmt/TransportLayerImpl.java#L512

olterion commented 3 years ago

Hello,

thank you very much for your quick answer. I checked touble time the code at the µC with the bootloader and conntected a serial trace interface to get messages if a ACK message is send. But there is only one ACK send after the Data (eg. line 4777) was send. So the line 4779 is the real ACK, but I don't know why there is a second ACK.

Do you have any idea how I can trace the messages on the KNX Bus directly?

Maybe the IP Interface makes this mistakes? Do you know about such behavior? Is there a reason to disconnect the connection if a ACK is send more then one time?

Best regards, Olli

bmalinowsky commented 3 years ago

Why there is a second ACK by itself I don't know by heart. First wild guess would be a repetition on the bus which isn't filtered down the line (or within calimero). The delay seems to be 13 ms. (Note that the ACK thing I mentioned might also just be some unrelated observation.)

For the two ACKs, you can check in the details pane at the bottom of wshark, if both of those have the same L4 sequence number?

Do you have any idea how I can trace the messages on the KNX Bus directly?

Message tracing works well if you have a knx interface which supports bus monitoring (link layer + bus monitor). For example, in ETS, click Bus Monitor using that interface, or with calimero tools & gradle, something like ./gradlew run --args="monitor 192.168.x.y".

Is there a reason to disconnect the connection if a ACK is send more then one time?

In general, if the remote L4 endpoint is open and not waiting for an ACK, a disconnect shall be sent upon receiving an ACK.

olterion commented 3 years ago

Hello,

first of all I want to thank you very much for your quick responses. With you explatation I can avoid the disconnection at flash process. For a work around I commented out the disconnect line 512 in TransportLayerImpl.java Now I get no disconnets while transmitting the data.

But I investigated the behavoir deeper with wireshark. There is a difference between the first and the second ACK. Both ACKs are for the same Sequence Number (same as the data write message) But the first ACK has the property "Repeat on Error: No" and the second one has the property "Repeat on Error: Yes"

Maybe this can also be ovserved before the disconnect is done?

I made some Screenshots: one of the data message, one of first ACK and one of second ACK

Best regards, Olli MemWrite_msg

first_ACK

second_ACK

bmalinowsky commented 3 years ago

Commenting out the disconnect is fine, L4 endpoints auto-disconnect anyway after some timeout. Good it works now!

The second ACK is a repetition on the bus, and the outcome of receiving it at the client-side (calimero) is exactly the problematic sequence you observe: the L4 FSM transitions to closed state if a mismatching seq.no is received (which is mismatching because the first ACK wasn't faulty and the second time the seq.no is already outdated). I also think 15.15.192 is resending it (I know you said there is only one send), but the gateway or any other device shouldn't do that, and calimero is on the other side, so can't be it.

Maybe this can also be ovserved before the disconnect is done?

Can you elaborate? I'm not sure I follow.

One more thing, maybe this helps: group communication does not use L4. There are no connects/disconnects for those.

olterion commented 3 years ago

Hello,

I wish I could see the messages on the Bus directly, but I have no device with bus monitor functions. On the other side (in the DIY selfbus KNX device) is a big lib working to get the Bus communication done. Unfortunately I have no Idea where I can search to find the place which will send a second ACK with "Repeat on Error: Yes"

However, I'm happy to get this project working. I think, I can live with this work around. Thank you very much for the good support.

Best regards, Olli

StefanSverige commented 3 years ago

Hello, maybe Olli and me can elaborate this further with my setup on a couple of different KNX interfaces. However, I need to test if my IP Interfaces produces the same behavior, since Olli's is a different brand. If we manage to find something that seems to be a bug related to Calimero, we will report back.

Thanks for the support and great KNX lib that gets our tool on the bus!

Regards Stefan

bmalinowsky commented 3 years ago

Please reopen if there is any news on this, as the current behavior from the client side seems ok.