Open dallmair opened 2 months ago
Are there timestamps available somewhere for this sequence? The current restart
implementation, after getting the A_restart.res
actually waits for a remote T_disconnect
first (if any), before sending its own disconnect based on the TL4 connection timeout.
You bet! With best intentions, I had left out all the potentially confusing information of our low level log. But now that you requested the information, here we go :)
We're using a DIY Selfbus device with a "busmonitor" firmware to capture what we see on the bus. The device records the timing information in microseconds and sends the log lines via UART and a USB adapter to a RasPi which runs cat /dev/ttyACM0 | ts '"%Y-%m-%d %H:%M:%.S"'
Thus, the timestamp in double quotes at the beginning of each line is the timestamp when the RasPi received the line, not necessarily the exact time when the respective frame was sent on the bus. For most accurate timing information with microsecond precision, use the S, E, and dt RX-RX columns. Especially dt RX-RX is tremendously helpful to see what's going on.
Column | Meaning |
---|---|
Timestamp | When the line was received by the RasPi |
S | MCU clock time when transmission of the frame started (start of first byte's start bit), in microseconds since the start of the MCU |
E | MCU clock time when the frame was finished (end of last byte's stop bit), in microseconds since the start of the MCU |
dt RX-RX | Delay between end of last and start of current frame |
ok | For data frames only, whether there was a transmission error (parity error, checksum error) |
Content | The frame data |
The last line is FF
, because our devices pull the bus low for a few microseconds when they restart. Our "busmonitor" categorizes this as start bit and therefore sees a single-byte frame. Not nice from a conformance perspective, but very helpful for low level debugging :/
"2024-05-01 21:34:51.210844" RX : (S1382475000 E1382485623 dt RX-RX:01204182 ok: 0x0000) B0 10 3E 11 32 50 80 92 calimero -> device: T_Connect
"2024-05-01 21:34:51.212912" RX : (S1382487185 E1382488326 dt RX-RX:00001562 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.244011" RX : (S1382503712 E1382518396 dt RX-RX:00015386 ok: 0x0000) BC 10 3E 11 32 53 43 81 07 FF 27 calimero -> device: A_Restart (Master Reset)
"2024-05-01 21:34:51.245937" RX : (S1382519959 E1382521100 dt RX-RX:00001563 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.263506" RX : (S1382526311 E1382536924 dt RX-RX:00005211 ok: 0x0000) B0 11 32 10 3E 60 C2 E0 device -> calimero: T_Ack
"2024-05-01 21:34:51.267802" RX : (S1382538491 E1382539631 dt RX-RX:00001567 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.288008" RX : (S1382545153 E1382561179 dt RX-RX:00005522 ok: 0x0000) BC 11 32 10 3E 64 43 A1 00 00 01 C9 device -> calimero: A_Restart_Response (Master Reset Response)
"2024-05-01 21:34:51.290953" RX : (S1382562744 E1382563885 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.305182" RX : (S1382569095 E1382579719 dt RX-RX:00005210 ok: 0x0000) B0 10 3E 11 32 50 81 93 calimero -> device: T_Disconnect
"2024-05-01 21:34:51.306400" RX : (S1382581285 E1382582425 dt RX-RX:00001566 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.322257" RX : (S1382587636 E1382598249 dt RX-RX:00005211 ok: 0x0000) B0 11 32 10 3E 60 81 A3 device -> calimero: T_Disconnect
"2024-05-01 21:34:51.326083" RX : (S1382599814 E1382600955 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.341203" RX : (S1382606377 E1382616998 dt RX-RX:00005422 ok: 0x0000) B0 10 3E 11 32 50 81 93 calimero -> device: T_Disconnect (!)
"2024-05-01 21:34:51.344190" RX : (S1382618563 E1382619703 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.366379" RX : (S1382631334 E1382641956 dt RX-RX:00011631 ok: 0x0000) B0 10 3E 11 32 50 C2 D0 calimero -> device: T_Ack (!)
"2024-05-01 21:34:51.370010" RX : (S1382643520 E1382644660 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.503153" RX : (S1382777629 E1382778771 dt RX-RX:00132969 ok: 0x0000) FF device restarts
And here is a TRACE level log of the Updater invocation (including calimero entries) in HTML format. I've kept everything until 3sec after calimero's T_Ack, and stripped everything afterwards as it's not relevant.
Here is another log, from another user (unfortunately, the timestamps are without milliseconds). In this case, the device sent the T_Disconnect first, but calimero still sent two T_Disconnect and one T_Ack, in that order.
"2024-05-01 06:39:11" RX : (S3389721773 E3389732379 dt RX-RX:00005710 ok: 0x0000) B0 10 FB 11 29 60 80 7C calimero -> device: T_Connect
"2024-05-01 06:39:11" RX : (S3389733942 E3389735083 dt RX-RX:00001563 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389786982 E3389802996 dt RX-RX:00051899 ok: 0x0000) BC 10 FD 58 0A D4 00 80 66 27 0B E2 (unrelated)
"2024-05-01 06:39:11" RX : (S3389804561 E3389805702 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389811388 E3389826050 dt RX-RX:00005686 ok: 0x0000) BC 10 FB 11 29 63 43 81 07 FF C9 calimero -> device: A_Restart (Master Reset)
"2024-05-01 06:39:11" RX : (S3389827614 E3389828755 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389834143 E3389844757 dt RX-RX:00005388 ok: 0x0000) B0 11 29 10 FB 60 C2 3E device -> calimero: T_Ack
"2024-05-01 06:39:11" RX : (S3389846322 E3389847463 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389853017 E3389869042 dt RX-RX:00005554 ok: 0x0000) BC 11 29 10 FB 64 43 A1 00 00 01 17 device -> calimero: A_Restart_Response (Master Reset Response)
"2024-05-01 06:39:11" RX : (S3389870608 E3389871749 dt RX-RX:00001566 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389877134 E3389887748 dt RX-RX:00005385 ok: 0x0000) B0 11 29 10 FB 60 81 7D device -> calimero: T_Disconnect
"2024-05-01 06:39:11" RX : (S3389889313 E3389890455 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389896072 E3389908040 dt RX-RX:00005617 ok: 0x0000) BC 11 5F 31 6F E1 00 81 33 (unrelated)
"2024-05-01 06:39:11" RX : (S3389909605 E3389910746 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389916299 E3389928266 dt RX-RX:00005553 ok: 0x0000) BC 11 5F 31 6E E1 00 80 33 (unrelated)
"2024-05-01 06:39:11" RX : (S3389929830 E3389930971 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389936907 E3389948865 dt RX-RX:00005936 ok: 0x0000) BC 10 FD 12 6D D1 00 00 00 (unrelated)
"2024-05-01 06:39:11" RX : (S3389950429 E3389951570 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389957296 E3389973310 dt RX-RX:00005726 ok: 0x0000) BC 10 FD 58 0B D4 00 80 01 05 18 B5 (unrelated)
"2024-05-01 06:39:11" RX : (S3389974874 E3389976015 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389981329 E3389991936 dt RX-RX:00005314 ok: 0x0000) B0 10 FB 11 29 60 81 7D calimero -> device: T_Disconnect
"2024-05-01 06:39:11" RX : (S3389993499 E3389994640 dt RX-RX:00001563 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390000162 E3390012142 dt RX-RX:00005522 ok: 0x0000) BC 11 34 12 6D E1 00 40 B8 (unrelated)
"2024-05-01 06:39:11" RX : (S3390013707 E3390014848 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390020566 E3390032533 dt RX-RX:00005718 ok: 0x0000) BC 11 5F 31 65 E1 00 81 39 (unrelated)
"2024-05-01 06:39:11" RX : (S3390034097 E3390035238 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390041073 E3390053040 dt RX-RX:00005835 ok: 0x0000) BC 11 5F 31 66 E1 00 81 3A (unrelated)
"2024-05-01 06:39:11" RX : (S3390054604 E3390055745 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390061707 E3390073666 dt RX-RX:00005962 ok: 0x0000) BC 10 FD 12 6C D1 00 00 01 (unrelated)
"2024-05-01 06:39:11" RX : (S3390075230 E3390076372 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390082089 E3390094048 dt RX-RX:00005717 ok: 0x0000) BC 10 FD 12 68 D1 00 00 05 (unrelated)
"2024-05-01 06:39:11" RX : (S3390095613 E3390096754 dt RX-RX:00001565 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390101683 E3390102824 dt RX-RX:00004929 ok: 0x0000) FF device restarts
"2024-05-01 06:39:11" RX : (S3390104395 E3390105536 dt RX-RX:00001571 ok: 0x0000) LL_NACK
"2024-05-01 06:39:11" RX : (S3390110993 E3390121599 dt RX-RX:00005457 ok: 0x0000) B0 10 FB 11 29 60 81 7D calimero -> device: T_Disconnect (!)
"2024-05-01 06:39:11" RX : (S3390126968 E3390137575 dt RX-RX:00005369 ok: 0x0000) 90 10 FB 11 29 60 81 5D repeat
"2024-05-01 06:39:11" RX : (S3390142992 E3390153598 dt RX-RX:00005417 ok: 0x0000) 90 10 FB 11 29 60 81 5D repeat
"2024-05-01 06:39:11" RX : (S3390158973 E3390169579 dt RX-RX:00005375 ok: 0x0000) 90 10 FB 11 29 60 81 5D repeat
"2024-05-01 06:39:11" RX : (S3390175100 E3390187079 dt RX-RX:00005521 ok: 0x0000) BC 11 34 12 68 E1 00 40 BD (unrelated)
"2024-05-01 06:39:11" RX : (S3390188643 E3390189784 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390196060 E3390208018 dt RX-RX:00006276 ok: 0x0000) BC 10 FD 12 6B D1 00 00 06 (unrelated)
"2024-05-01 06:39:11" RX : (S3390209582 E3390210723 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390216109 E3390226714 dt RX-RX:00005386 ok: 0x0000) B0 10 FB 11 29 60 C2 3E calimero -> device: T_Ack (!)
"2024-05-01 06:39:11" RX : (S3390232106 E3390242712 dt RX-RX:00005392 ok: 0x0000) 90 10 FB 11 29 60 C2 1E repeat
"2024-05-01 06:39:11" RX : (S3390248075 E3390258681 dt RX-RX:00005363 ok: 0x0000) 90 10 FB 11 29 60 C2 1E repeat
"2024-05-01 06:39:11" RX : (S3390264081 E3390274687 dt RX-RX:00005400 ok: 0x0000) 90 10 FB 11 29 60 C2 1E repeat
"2024-05-01 06:39:11" RX : (S3390280316 E3390292292 dt RX-RX:00005629 ok: 0x0000) BC 11 32 12 6C E1 00 40 BF (unrelated)
Corresponding TRACE level log in HTML format, again stripped after the relevant part:
Thank you for the detailed logs! I first thought it might have to do with some odd timeout sequence, but the disconnect happens almost immediately. The reason you are seeing this behavior is because there are two competing TL instances here. It's a little unfortunate, the javadoc for the behavior of L4 connections like in that case is sparse.
You can add another constructor like this
public SBManagementClientImpl(final KNXNetworkLink link, final SBManagementClientImpl mc)
throws KNXLinkClosedException {
super(link, mc.transportLayer());
...
and use it to create the local restart instance mcDevice
; it shares the TL logic and will fix it.
Thanks for the feedback! As I've mostly worked on the device firmware so far and still need to get a better understanding of the Updater implementation, I did not see that we're actually creating another Transport Layer instance for the restart, so your pointer is of course spot-on and highly appreciated.
When I implemented your suggestion in commit 99e8447, I wondered whether it would be possible (and maybe even better?) to use the existing ManagementClientImpl instance for the device restart, and just create another Destination. Thus, I've also implemented that approach (see commit ab8ac6d).
Both of these implementations exhibit the same behavior: calimero does not send any T_Disconnect anymore. Here's our low level bus trace:
"2024-05-06 00:14:09.430514" RX : (S0059200072 E0059210695 dt RX-RX:00005422 ok: 0x0000) B0 10 3E 11 32 50 80 92 calimero -> device: T_Connect
"2024-05-06 00:14:09.432537" RX : (S0059212258 E0059213399 dt RX-RX:00001563 ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.454498" RX : (S0059219341 E0059234003 dt RX-RX:00005942 ok: 0x0000) BC 11 07 1B 0B D3 00 80 0C C9 D3 (unrelated)
"2024-05-06 00:14:09.456488" RX : (S0059235569 E0059236709 dt RX-RX:00001566 ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.478550" RX : (S0059242443 E0059257128 dt RX-RX:00005734 ok: 0x0000) BC 10 3E 11 32 53 43 81 07 FF 27 calimero -> device: A_Restart (Master Reset)
"2024-05-06 00:14:09.479534" RX : (S0059258691 E0059259832 dt RX-RX:00001563 ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.495621" RX : (S0059265043 E0059275655 dt RX-RX:00005211 ok: 0x0000) B0 11 32 10 3E 60 C2 E0 device -> calimero: T_Ack
"2024-05-06 00:14:09.497547" RX : (S0059277223 E0059278363 dt RX-RX:00001568 ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.520573" RX : (S0059283885 E0059299910 dt RX-RX:00005522 ok: 0x0000) BC 11 32 10 3E 64 43 A1 00 00 01 C9 device -> calimero: A_Restart_Response (Master Reset Response)
"2024-05-06 00:14:09.522472" RX : (S0059301476 E0059302617 dt RX-RX:00001566 ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.538483" RX : (S0059307826 E0059318439 dt RX-RX:00005209 ok: 0x0000) B0 11 32 10 3E 60 81 A3 device -> calimero: T_Disconnect
"2024-05-06 00:14:09.540508" RX : (S0059320005 E0059321146 dt RX-RX:00001566 ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.557460" RX : (S0059326568 E0059337190 dt RX-RX:00005422 ok: 0x0000) B0 10 3E 11 32 50 C2 D0 calimero -> device: T_Ack
"2024-05-06 00:14:09.559489" RX : (S0059338754 E0059339894 dt RX-RX:00001564 ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.737517" RX : (S0059516817 E0059517959 dt RX-RX:00176923 ok: 0x0000) FF
Corresponding calimero/Updater log: updater-2024-05-06_00-14-07.157.zip
Is that the expected behavior? Based on your comment a few days ago,
waits for a remote
T_disconnect
first (if any), before sending its own disconnect based on the TL4 connection timeout.
I had expected to still see one T_Disconnect originating from calimero. But I'm not sure it's necessary and I'm just curious.
Also, in the log, the TL logs a disconnectIndicate
at 00:14:10.631
and then a detach
at 00:14:14.642
(i.e. 4 seconds later) before it moves on to connect to another destination. Is this delay expected?
Which approach do you as the author of calimero recommend: Using two different ManagementClientImpl instances with the same TransportLayer, or just a single ManagementClientImpl?
waits for a remote T_disconnect first (if any), before sending its own disconnect based on the TL4 connection timeout.
I had expected to still see one T_Disconnect originating from calimero. But I'm not sure it's necessary and I'm just curious.
Let's rephrase it :) The client waits for a max. of TL connection timeout to see if the device is so nice to send a disconnect itself (before restarting). If not, then on timeout the client will send a disconnect to signal the connection to be closed.
Also, in the log, the TL logs a disconnectIndicate at 00:14:10.631 and then a detach at 00:14:14.642 (i.e. 4 seconds later) before it moves on to connect to another destination. Is this delay expected?
I think thats the wait for the restart plus checking if a device is in programming mode (1 + 3 s)? The detach happens here.
The main reasons to use several mgmt clients for the same link is using
for certain mgmt tasks.
Other than that, there is no real need. IIRC you asked some time ago about a different TL connection FSM style, that would be another use-case. (Also, the mgmt procedures implementation synchronizes on the mgmt client on certain tasks to sequence them out. So you could also use several mgmt procedures (i.e., > 1 mgmt clients, even with the same TL) to parallelize those.) *
* Within reason, I understand you know the KNX spec.
Rereading your initial issue description, 3rd point. This should also be OK now? I think it would only happen if the device does not disconnect itself, and is still restarting (i.e. not available) after 6 seconds?
In the Selfbus Updater, we're using calimero (currently version 2.5.1) to send a Master Reset with special Erase Code and Channel to instruct the device to reboot into its bootloader. The code is basically:
On the TP1-based KNX bus we observe the following frames:
Issues we see here:
negative confirmation of 1.1.50: ...
When it happens, that log entry is confusing our users.To somewhat circumvent point 3, we've already implemented a restart delay of 250ms in our device firmware. Still, I want to point it out, because after a Restart request the device can become unavailable, so things like missing acknowledgements are to be expected. Thus, in this situation, log level warning is too high and should be lowered.