Serial protocol flooding messages are lost

petero-dk commented 4 years ago

When pushing a lot of messages through a Serial FT1.2 connection at the same time only a few will actually go through to the network.

I.e. when in openHAB sending OFF to a (openHAB) group the binding will send multiple KNX group commands at the same time. The debug log looks like:

03-Jan-2020 21:13:24.021 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/3/10, wait for ack
03-Jan-2020 21:13:24.057 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/3/30, wait for ack
03-Jan-2020 21:13:24.066 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/2/10, wait for ack
03-Jan-2020 21:13:24.099 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/2/20, wait for ack
03-Jan-2020 21:13:24.104 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/3/50, wait for ack
03-Jan-2020 21:13:24.105 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/1/10, wait for ack
03-Jan-2020 21:13:24.113 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/7/10, wait for ack
03-Jan-2020 21:13:24.137 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/4/10, wait for ack
03-Jan-2020 21:13:24.147 [DEBUG] [calimero.link./dev/serial0                        ] - confirmation of 2/3/10
03-Jan-2020 21:13:24.159 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/5/10, wait for ack
03-Jan-2020 21:13:24.166 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/2/30, wait for ack
03-Jan-2020 21:13:24.177 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/6/10, wait for ack
03-Jan-2020 21:13:24.207 [DEBUG] [calimero.link./dev/serial0                        ] - confirmation of 2/3/30
03-Jan-2020 21:13:24.223 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/7/20, wait for ack
03-Jan-2020 21:13:24.236 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/5/20, wait for ack
03-Jan-2020 21:13:24.255 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/6/20, wait for ack
03-Jan-2020 21:13:24.289 [DEBUG] [calimero.link./dev/serial0                        ] - send message to 2/4/20, wait for ack
03-Jan-2020 21:13:24.357 [DEBUG] [calimero.link./dev/serial0                        ] - confirmation of 2/6/20

when running through a TCP connection, this was not an issue. Should calimero buffer the sending of telegrams or should the user of the library (in this case the openHAB binding)

petero-dk commented 4 years ago

Only the messages that received the confirmation actually went through

bmalinowsky commented 4 years ago

You can check traces to see whether the BAU resets itself or sends another function code (although this won't help the issue).

Your link is configured to wait for a successful ack from the device, but from the sequence in your log it is likely several messages are dropped after the ack before the .con is created. What I can do is let the ft1.2 connection stay in the blocking send to check the .con. This will slow down a fast sender (as long as the send is done blocking).

Although queueing within calimero seems tempting, it does not really solve that much (at least a simple approach).

petero-dk commented 4 years ago

For openHAB I know that all sends are blocking, so if you let the ft1.2 connection stay in the blocking, then all will be sent sequentially after a (send, confirmed) pair?

bmalinowsky commented 4 years ago

yes, each ongoing send will either complete with a positive confirmation or a timeout, before the next send request is processed.

petero-dk commented 4 years ago

As soon as you have a change I can immediately test it. I have setup CI on the add-on with your latest code ;-)

bmalinowsky commented 4 years ago

I changed the send method and corresponding notifications to wait for a .con when sending a L-Data.req. As not all frames fit into that category, I also revamped the locking. So this actually needs testing. So please, go ahead and test :)

petero-dk commented 4 years ago

Now it will not connect to the knx bus: I just get:

04-Jan-2020 22:29:37.726 [INFO ] [calimero.serial                                   ] - no serialcom in java.library.path
04-Jan-2020 22:29:37.734 [DEBUG] [calimero.serial.ft12:/dev/serial0                 ] - using rxtx library for serial port access
04-Jan-2020 22:29:37.739 [INFO ] [calimero.serial.ft12:/dev/serial0                 ] - open rxtx (RXTX-2.2pre2) serial port connection for /dev/serial0
04-Jan-2020 22:30:37.946 [DEBUG] [calimero.serial.ft12:/dev/serial0                 ] - using rxtx library for serial port access
04-Jan-2020 22:30:37.951 [INFO ] [calimero.serial.ft12:/dev/serial0                 ] - open rxtx (RXTX-2.2pre2) serial port connection for /dev/serial0
04-Jan-2020 22:31:38.154 [DEBUG] [calimero.serial.ft12:/dev/serial0                 ] - using rxtx library for serial port access
04-Jan-2020 22:31:38.158 [INFO ] [calimero.serial.ft12:/dev/serial0                 ] - open rxtx (RXTX-2.2pre2) serial port connection for /dev/serial0

No change in configuration simply drop in replacement. I will do a little more testing. I have set the log level to DEBUG.

Restart makes no difference. My previous setup was compiled two days ago. I am going to check if something very strange happened anywhere

petero-dk commented 4 years ago

I have now made sure that the rest of my add-on is the same commit, so nothing has been introduced there. Simply changing the calimero library seems to break the connection

bmalinowsky commented 4 years ago

Ok, I remember now that you create the ft1.2 link for cEMI, right? To access a BAOS device (which uses cEMI). I probably should have mentioned the following to avoid confusion:

Implementation notes (commit 9f4101b2bd8810897ebd4c113d097c428431f82c):

frame repetition for ack is as before: repeat as required until ACK
currently EMI1/EMI2 frame destinations are checked for .con (but not cEMI)
no frame repetition on no or negative .con

Any errors in setting up the link are now quite certainly from calimero, and not caused by some other library. I will see to make cEMI frames also work with blocking for .con . But (in theory), while cEMI should fail with a timeout (therefore, still enforcing the spacing between sent frames), it should be able to continue.

The other point is: i found a precondition violation of the lock in resetting the device, so that should be fixed.

P.S. Please post trace levels with low-layer communication of calimero; it helps.

petero-dk commented 4 years ago

Yes, I use cEMI with a BAOS device. Does that mean there is nothing I can test for now?

bmalinowsky commented 4 years ago

You can retest with the latest commit. It is not visible from your log but I think the initial reset failed (therefore, the repeated "using" and "open" entries). That should be fixed.

petero-dk commented 4 years ago

The latest build also stalls. I have enabled TRACE and this is the log:

05-Jan-2020 16:50:31.313 [DEBUG] [calimero.serial.ft12:/dev/serial0                 ] - using rxtx library for serial port access
05-Jan-2020 16:50:31.318 [INFO ] [calimero.serial.ft12:/dev/serial0                 ] - open rxtx (RXTX-2.2pre2) serial port connection for /dev/serial0
05-Jan-2020 16:50:31.326 [DEBUG] [calimero.serial.ft12:/dev/serial0                 ] - setup serial port: baudrate 19200, even parity, 8 databits, 1 stopbits, flow control 0
05-Jan-2020 16:50:31.327 [TRACE] [calimero.serial.ft12:/dev/serial0                 ] - send reset to BCU
05-Jan-2020 16:50:31.389 [TRACE] [calimero.serial.ft12:/dev/serial0                 ] - send reset to BCU
05-Jan-2020 16:50:31.449 [TRACE] [calimero.serial.ft12:/dev/serial0                 ] - send reset to BCU

bmalinowsky commented 4 years ago

So the reset does not fail; it is repeated, but I guess that also happened with the old connection (the ack timeout is somewhat short when doing a device reset).

I fixed the state for sending (this is the stall, I observed it here, too), and added a workaround for cEMI .con detection.

petero-dk commented 4 years ago

I have now tested the new lib, it seems to work perfectly. I have noticed though one thing. I now get one error message from the library:

06-Jan-2020 12:02:54.231 [WARN ] [ab.binding.knx.internal.handler.DeviceThingHandler] - An error occurred on channel knx:device:bridge:bedroom1_lights:Group2_Light: no confirmation reply received for 2/4/20
tuwien.auto.calimero.KNXTimeoutException: no confirmation reply received for 2/4/20
        at tuwien.auto.calimero.serial.FT12Connection.send(Unknown Source) ~[calimero-core-2.5-SNAPSHOT.jar:?]
        at tuwien.auto.calimero.link.KNXNetworkLinkFT12.onSend(Unknown Source) ~[calimero-core-2.5-SNAPSHOT.jar:?]
        at tuwien.auto.calimero.link.AbstractLink.send(Unknown Source) ~[calimero-core-2.5-SNAPSHOT.jar:?]
        at tuwien.auto.calimero.link.AbstractLink.sendRequestWait(Unknown Source) ~[calimero-core-2.5-SNAPSHOT.jar:?]
        at tuwien.auto.calimero.process.ProcessCommunicatorImpl.write(Unknown Source) ~[calimero-core-2.5-SNAPSHOT.jar:?]
        at tuwien.auto.calimero.process.ProcessCommunicatorImpl.write(Unknown Source) ~[calimero-core-2.5-SNAPSHOT.jar:?]
        at org.openhab.binding.knx.internal.client.AbstractKNXClient.sendToKNX(AbstractKNXClient.java:458) ~[bundleFile:?]
        at org.openhab.binding.knx.internal.client.AbstractKNXClient.writeToKNX(AbstractKNXClient.java:418) ~[bundleFile:?]
        at org.openhab.binding.knx.internal.handler.DeviceThingHandler.lambda$7(DeviceThingHandler.java:232) ~[bundleFile:?]
        at org.openhab.binding.knx.internal.handler.DeviceThingHandler.withKNXType(DeviceThingHandler.java:127) [bundleFile:?]
        at org.openhab.binding.knx.internal.handler.DeviceThingHandler.withKNXType(DeviceThingHandler.java:121) [bundleFile:?]
        at org.openhab.binding.knx.internal.handler.DeviceThingHandler.handleCommand(DeviceThingHandler.java:227) [bundleFile:?]
        at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_222]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_222]
        at org.eclipse.smarthome.core.internal.common.AbstractInvocationHandler.invokeDirect(AbstractInvocationHandler.java:152) [bundleFile:?]
        at org.eclipse.smarthome.core.internal.common.InvocationHandlerSync.invoke(InvocationHandlerSync.java:59) [bundleFile:?]
        at com.sun.proxy.$Proxy3106.handleCommand(Unknown Source) [?:?]
        at org.eclipse.smarthome.core.thing.internal.profiles.ProfileCallbackImpl.handleCommand(ProfileCallbackImpl.java:74) [bundleFile:?]
        at org.eclipse.smarthome.core.thing.internal.profiles.SystemDefaultProfile.onCommandFromItem(SystemDefaultProfile.java:48) [bundleFile:?]
        at sun.reflect.GeneratedMethodAccessor111.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_222]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_222]
        at org.eclipse.smarthome.core.internal.common.AbstractInvocationHandler.invokeDirect(AbstractInvocationHandler.java:152) [bundleFile:?]
        at org.eclipse.smarthome.core.internal.common.Invocation.call(Invocation.java:52) [bundleFile:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

But checking my setup this KNX group does not actually have a device (it was disconnected). Could this have been the root cause all along? That one group address did not exist or reply?

In either case I think the new feedback is great.

It works
It throws an error on a missing device, which is actually nice in order to find errors.

bmalinowsky commented 4 years ago

Yes, the timeout exception in case of no .con is expected (with blocking send).

I don't think 2/4/20 is the root cause because it is the second to last message in your initial log, so everything before that address should have worked.

petero-dk commented 4 years ago

Cool :-) I am happy now, and this should be closed. Or is there anything else I can do? Specifically test?

bmalinowsky commented 4 years ago

I pushed some cleanup, and increased the timeout for the initial reset (ideally, there is only one reset now). If there is no regression from your side, this can be closed.

calimero-project / calimero-core

Serial protocol flooding messages are lost #91