calimero-project / calimero-core

Core library for KNX network access and management
Other
126 stars 64 forks source link

ManagementClient Master Reset behavior (frame order, logging) #130

Open dallmair opened 2 months ago

dallmair commented 2 months ago

In the Selfbus Updater, we're using calimero (currently version 2.5.1) to send a Master Reset with special Erase Code and Channel to instruct the device to reboot into its bootloader. The code is basically:

ManagementClientImpl mcDevice = new ManagementClientImpl(link);
Destination dest = mcDevice.createDestination(individualAddress, /*connectionOriented*/ true, /*keepAlive*/ false, /*verifyMode*/ false);
mcDevice.restart(dest, RESTART_ERASE_CODE, RESTART_CHANNEL);
mcDevice.close();

On the TP1-based KNX bus we observe the following frames:

B0 10 3E 11 32 50 80 92                calimero -> device: T_Connect
LL_ACK
BC 10 3E 11 32 53 43 81 07 FF 27       calimero -> device: A_Restart (Master Reset)
LL_ACK
B0 11 32 10 3E 60 C2 E0                device -> calimero: T_Ack
LL_ACK
BC 11 32 10 3E 64 43 A1 00 00 01 C9    device -> calimero: A_Restart_Response (Master Reset Response)
LL_ACK
B0 10 3E 11 32 50 81 93                calimero -> device: T_Disconnect
LL_ACK
B0 11 32 10 3E 60 81 A3                device -> calimero: T_Disconnect
LL_ACK
B0 10 3E 11 32 50 81 93                calimero -> device: T_Disconnect (!)
LL_ACK
B0 10 3E 11 32 50 C2 D0                calimero -> device: T_Ack (!)
LL_ACK
<Usually, device restarts here>

Issues we see here:

  1. Calimero sends two T_Disconnect messages
  2. Calimero sends the T_Ack that corresponds to the Master Reset Response only after the T_Disconnect(s)
  3. When there is high bus load, calimero's T_Ack or even the second T_Disconnect can be delayed to a time when the device already restarts and does not confirm the data frame anymore. Then, calimero logs warnings negative confirmation of 1.1.50: ... When it happens, that log entry is confusing our users.

To somewhat circumvent point 3, we've already implemented a restart delay of 250ms in our device firmware. Still, I want to point it out, because after a Restart request the device can become unavailable, so things like missing acknowledgements are to be expected. Thus, in this situation, log level warning is too high and should be lowered.

bmalinowsky commented 2 months ago

Are there timestamps available somewhere for this sequence? The current restart implementation, after getting the A_restart.res actually waits for a remote T_disconnect first (if any), before sending its own disconnect based on the TL4 connection timeout.

dallmair commented 2 months ago

You bet! With best intentions, I had left out all the potentially confusing information of our low level log. But now that you requested the information, here we go :)

Setup

We're using a DIY Selfbus device with a "busmonitor" firmware to capture what we see on the bus. The device records the timing information in microseconds and sends the log lines via UART and a USB adapter to a RasPi which runs cat /dev/ttyACM0 | ts '"%Y-%m-%d %H:%M:%.S"'

Thus, the timestamp in double quotes at the beginning of each line is the timestamp when the RasPi received the line, not necessarily the exact time when the respective frame was sent on the bus. For most accurate timing information with microsecond precision, use the S, E, and dt RX-RX columns. Especially dt RX-RX is tremendously helpful to see what's going on.

Column Meaning
Timestamp When the line was received by the RasPi
S MCU clock time when transmission of the frame started (start of first byte's start bit), in microseconds since the start of the MCU
E MCU clock time when the frame was finished (end of last byte's stop bit), in microseconds since the start of the MCU
dt RX-RX Delay between end of last and start of current frame
ok For data frames only, whether there was a transmission error (parity error, checksum error)
Content The frame data

The last line is FF, because our devices pull the bus low for a few microseconds when they restart. Our "busmonitor" categorizes this as start bit and therefore sees a single-byte frame. Not nice from a conformance perspective, but very helpful for low level debugging :/

Annotated Log

"2024-05-01 21:34:51.210844" RX : (S1382475000 E1382485623 dt RX-RX:01204182  ok: 0x0000) B0 10 3E 11 32 50 80 92                calimero -> device: T_Connect
"2024-05-01 21:34:51.212912" RX : (S1382487185 E1382488326 dt RX-RX:00001562  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.244011" RX : (S1382503712 E1382518396 dt RX-RX:00015386  ok: 0x0000) BC 10 3E 11 32 53 43 81 07 FF 27       calimero -> device: A_Restart (Master Reset)
"2024-05-01 21:34:51.245937" RX : (S1382519959 E1382521100 dt RX-RX:00001563  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.263506" RX : (S1382526311 E1382536924 dt RX-RX:00005211  ok: 0x0000) B0 11 32 10 3E 60 C2 E0                device -> calimero: T_Ack
"2024-05-01 21:34:51.267802" RX : (S1382538491 E1382539631 dt RX-RX:00001567  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.288008" RX : (S1382545153 E1382561179 dt RX-RX:00005522  ok: 0x0000) BC 11 32 10 3E 64 43 A1 00 00 01 C9    device -> calimero: A_Restart_Response (Master Reset Response)
"2024-05-01 21:34:51.290953" RX : (S1382562744 E1382563885 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.305182" RX : (S1382569095 E1382579719 dt RX-RX:00005210  ok: 0x0000) B0 10 3E 11 32 50 81 93                calimero -> device: T_Disconnect
"2024-05-01 21:34:51.306400" RX : (S1382581285 E1382582425 dt RX-RX:00001566  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.322257" RX : (S1382587636 E1382598249 dt RX-RX:00005211  ok: 0x0000) B0 11 32 10 3E 60 81 A3                device -> calimero: T_Disconnect
"2024-05-01 21:34:51.326083" RX : (S1382599814 E1382600955 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.341203" RX : (S1382606377 E1382616998 dt RX-RX:00005422  ok: 0x0000) B0 10 3E 11 32 50 81 93                calimero -> device: T_Disconnect (!)
"2024-05-01 21:34:51.344190" RX : (S1382618563 E1382619703 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.366379" RX : (S1382631334 E1382641956 dt RX-RX:00011631  ok: 0x0000) B0 10 3E 11 32 50 C2 D0                calimero -> device: T_Ack (!)
"2024-05-01 21:34:51.370010" RX : (S1382643520 E1382644660 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 21:34:51.503153" RX : (S1382777629 E1382778771 dt RX-RX:00132969  ok: 0x0000) FF                                     device restarts
dallmair commented 2 months ago

And here is a TRACE level log of the Updater invocation (including calimero entries) in HTML format. I've kept everything until 3sec after calimero's T_Ack, and stripped everything afterwards as it's not relevant.

updater-2024-05-01_21-34-48.514.zip

dallmair commented 2 months ago

Here is another log, from another user (unfortunately, the timestamps are without milliseconds). In this case, the device sent the T_Disconnect first, but calimero still sent two T_Disconnect and one T_Ack, in that order.

"2024-05-01 06:39:11" RX : (S3389721773 E3389732379 dt RX-RX:00005710  ok: 0x0000) B0 10 FB 11 29 60 80 7C                calimero -> device: T_Connect
"2024-05-01 06:39:11" RX : (S3389733942 E3389735083 dt RX-RX:00001563  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389786982 E3389802996 dt RX-RX:00051899  ok: 0x0000) BC 10 FD 58 0A D4 00 80 66 27 0B E2    (unrelated)
"2024-05-01 06:39:11" RX : (S3389804561 E3389805702 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389811388 E3389826050 dt RX-RX:00005686  ok: 0x0000) BC 10 FB 11 29 63 43 81 07 FF C9       calimero -> device: A_Restart (Master Reset)
"2024-05-01 06:39:11" RX : (S3389827614 E3389828755 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389834143 E3389844757 dt RX-RX:00005388  ok: 0x0000) B0 11 29 10 FB 60 C2 3E                device -> calimero: T_Ack
"2024-05-01 06:39:11" RX : (S3389846322 E3389847463 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389853017 E3389869042 dt RX-RX:00005554  ok: 0x0000) BC 11 29 10 FB 64 43 A1 00 00 01 17    device -> calimero: A_Restart_Response (Master Reset Response)
"2024-05-01 06:39:11" RX : (S3389870608 E3389871749 dt RX-RX:00001566  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389877134 E3389887748 dt RX-RX:00005385  ok: 0x0000) B0 11 29 10 FB 60 81 7D                device -> calimero: T_Disconnect
"2024-05-01 06:39:11" RX : (S3389889313 E3389890455 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389896072 E3389908040 dt RX-RX:00005617  ok: 0x0000) BC 11 5F 31 6F E1 00 81 33             (unrelated)
"2024-05-01 06:39:11" RX : (S3389909605 E3389910746 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389916299 E3389928266 dt RX-RX:00005553  ok: 0x0000) BC 11 5F 31 6E E1 00 80 33             (unrelated)
"2024-05-01 06:39:11" RX : (S3389929830 E3389930971 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389936907 E3389948865 dt RX-RX:00005936  ok: 0x0000) BC 10 FD 12 6D D1 00 00 00             (unrelated)
"2024-05-01 06:39:11" RX : (S3389950429 E3389951570 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389957296 E3389973310 dt RX-RX:00005726  ok: 0x0000) BC 10 FD 58 0B D4 00 80 01 05 18 B5    (unrelated)
"2024-05-01 06:39:11" RX : (S3389974874 E3389976015 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3389981329 E3389991936 dt RX-RX:00005314  ok: 0x0000) B0 10 FB 11 29 60 81 7D                calimero -> device: T_Disconnect
"2024-05-01 06:39:11" RX : (S3389993499 E3389994640 dt RX-RX:00001563  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390000162 E3390012142 dt RX-RX:00005522  ok: 0x0000) BC 11 34 12 6D E1 00 40 B8             (unrelated)
"2024-05-01 06:39:11" RX : (S3390013707 E3390014848 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390020566 E3390032533 dt RX-RX:00005718  ok: 0x0000) BC 11 5F 31 65 E1 00 81 39             (unrelated)
"2024-05-01 06:39:11" RX : (S3390034097 E3390035238 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390041073 E3390053040 dt RX-RX:00005835  ok: 0x0000) BC 11 5F 31 66 E1 00 81 3A             (unrelated)
"2024-05-01 06:39:11" RX : (S3390054604 E3390055745 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390061707 E3390073666 dt RX-RX:00005962  ok: 0x0000) BC 10 FD 12 6C D1 00 00 01             (unrelated)
"2024-05-01 06:39:11" RX : (S3390075230 E3390076372 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390082089 E3390094048 dt RX-RX:00005717  ok: 0x0000) BC 10 FD 12 68 D1 00 00 05             (unrelated)
"2024-05-01 06:39:11" RX : (S3390095613 E3390096754 dt RX-RX:00001565  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390101683 E3390102824 dt RX-RX:00004929  ok: 0x0000) FF                                     device restarts
"2024-05-01 06:39:11" RX : (S3390104395 E3390105536 dt RX-RX:00001571  ok: 0x0000) LL_NACK
"2024-05-01 06:39:11" RX : (S3390110993 E3390121599 dt RX-RX:00005457  ok: 0x0000) B0 10 FB 11 29 60 81 7D                calimero -> device: T_Disconnect (!)
"2024-05-01 06:39:11" RX : (S3390126968 E3390137575 dt RX-RX:00005369  ok: 0x0000) 90 10 FB 11 29 60 81 5D                repeat
"2024-05-01 06:39:11" RX : (S3390142992 E3390153598 dt RX-RX:00005417  ok: 0x0000) 90 10 FB 11 29 60 81 5D                repeat
"2024-05-01 06:39:11" RX : (S3390158973 E3390169579 dt RX-RX:00005375  ok: 0x0000) 90 10 FB 11 29 60 81 5D                repeat
"2024-05-01 06:39:11" RX : (S3390175100 E3390187079 dt RX-RX:00005521  ok: 0x0000) BC 11 34 12 68 E1 00 40 BD             (unrelated)
"2024-05-01 06:39:11" RX : (S3390188643 E3390189784 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390196060 E3390208018 dt RX-RX:00006276  ok: 0x0000) BC 10 FD 12 6B D1 00 00 06             (unrelated)
"2024-05-01 06:39:11" RX : (S3390209582 E3390210723 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-01 06:39:11" RX : (S3390216109 E3390226714 dt RX-RX:00005386  ok: 0x0000) B0 10 FB 11 29 60 C2 3E                calimero -> device: T_Ack (!)
"2024-05-01 06:39:11" RX : (S3390232106 E3390242712 dt RX-RX:00005392  ok: 0x0000) 90 10 FB 11 29 60 C2 1E                repeat
"2024-05-01 06:39:11" RX : (S3390248075 E3390258681 dt RX-RX:00005363  ok: 0x0000) 90 10 FB 11 29 60 C2 1E                repeat
"2024-05-01 06:39:11" RX : (S3390264081 E3390274687 dt RX-RX:00005400  ok: 0x0000) 90 10 FB 11 29 60 C2 1E                repeat
"2024-05-01 06:39:11" RX : (S3390280316 E3390292292 dt RX-RX:00005629  ok: 0x0000) BC 11 32 12 6C E1 00 40 BF             (unrelated)

Corresponding TRACE level log in HTML format, again stripped after the relevant part:

updater-2024-05-01_06-39-08.079.zip

bmalinowsky commented 2 months ago

Thank you for the detailed logs! I first thought it might have to do with some odd timeout sequence, but the disconnect happens almost immediately. The reason you are seeing this behavior is because there are two competing TL instances here. It's a little unfortunate, the javadoc for the behavior of L4 connections like in that case is sparse.

You can add another constructor like this

public SBManagementClientImpl(final KNXNetworkLink link, final SBManagementClientImpl mc)
        throws KNXLinkClosedException {
    super(link, mc.transportLayer());
    ...

and use it to create the local restart instance mcDevice; it shares the TL logic and will fix it.

dallmair commented 2 months ago

Thanks for the feedback! As I've mostly worked on the device firmware so far and still need to get a better understanding of the Updater implementation, I did not see that we're actually creating another Transport Layer instance for the restart, so your pointer is of course spot-on and highly appreciated.

When I implemented your suggestion in commit 99e8447, I wondered whether it would be possible (and maybe even better?) to use the existing ManagementClientImpl instance for the device restart, and just create another Destination. Thus, I've also implemented that approach (see commit ab8ac6d).

Both of these implementations exhibit the same behavior: calimero does not send any T_Disconnect anymore. Here's our low level bus trace:

"2024-05-06 00:14:09.430514" RX : (S0059200072 E0059210695 dt RX-RX:00005422  ok: 0x0000) B0 10 3E 11 32 50 80 92                calimero -> device: T_Connect
"2024-05-06 00:14:09.432537" RX : (S0059212258 E0059213399 dt RX-RX:00001563  ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.454498" RX : (S0059219341 E0059234003 dt RX-RX:00005942  ok: 0x0000) BC 11 07 1B 0B D3 00 80 0C C9 D3       (unrelated)
"2024-05-06 00:14:09.456488" RX : (S0059235569 E0059236709 dt RX-RX:00001566  ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.478550" RX : (S0059242443 E0059257128 dt RX-RX:00005734  ok: 0x0000) BC 10 3E 11 32 53 43 81 07 FF 27       calimero -> device: A_Restart (Master Reset)
"2024-05-06 00:14:09.479534" RX : (S0059258691 E0059259832 dt RX-RX:00001563  ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.495621" RX : (S0059265043 E0059275655 dt RX-RX:00005211  ok: 0x0000) B0 11 32 10 3E 60 C2 E0                device -> calimero: T_Ack
"2024-05-06 00:14:09.497547" RX : (S0059277223 E0059278363 dt RX-RX:00001568  ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.520573" RX : (S0059283885 E0059299910 dt RX-RX:00005522  ok: 0x0000) BC 11 32 10 3E 64 43 A1 00 00 01 C9    device -> calimero: A_Restart_Response (Master Reset Response)
"2024-05-06 00:14:09.522472" RX : (S0059301476 E0059302617 dt RX-RX:00001566  ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.538483" RX : (S0059307826 E0059318439 dt RX-RX:00005209  ok: 0x0000) B0 11 32 10 3E 60 81 A3                device -> calimero: T_Disconnect
"2024-05-06 00:14:09.540508" RX : (S0059320005 E0059321146 dt RX-RX:00001566  ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.557460" RX : (S0059326568 E0059337190 dt RX-RX:00005422  ok: 0x0000) B0 10 3E 11 32 50 C2 D0                calimero -> device: T_Ack
"2024-05-06 00:14:09.559489" RX : (S0059338754 E0059339894 dt RX-RX:00001564  ok: 0x0000) LL_ACK
"2024-05-06 00:14:09.737517" RX : (S0059516817 E0059517959 dt RX-RX:00176923  ok: 0x0000) FF

Corresponding calimero/Updater log: updater-2024-05-06_00-14-07.157.zip

Is that the expected behavior? Based on your comment a few days ago,

waits for a remote T_disconnect first (if any), before sending its own disconnect based on the TL4 connection timeout.

I had expected to still see one T_Disconnect originating from calimero. But I'm not sure it's necessary and I'm just curious.

Also, in the log, the TL logs a disconnectIndicate at 00:14:10.631 and then a detach at 00:14:14.642 (i.e. 4 seconds later) before it moves on to connect to another destination. Is this delay expected?

Which approach do you as the author of calimero recommend: Using two different ManagementClientImpl instances with the same TransportLayer, or just a single ManagementClientImpl?

bmalinowsky commented 2 months ago

waits for a remote T_disconnect first (if any), before sending its own disconnect based on the TL4 connection timeout.

I had expected to still see one T_Disconnect originating from calimero. But I'm not sure it's necessary and I'm just curious.

Let's rephrase it :) The client waits for a max. of TL connection timeout to see if the device is so nice to send a disconnect itself (before restarting). If not, then on timeout the client will send a disconnect to signal the connection to be closed.

Also, in the log, the TL logs a disconnectIndicate at 00:14:10.631 and then a detach at 00:14:14.642 (i.e. 4 seconds later) before it moves on to connect to another destination. Is this delay expected?

I think thats the wait for the restart plus checking if a device is in programming mode (1 + 3 s)? The detach happens here.

The main reasons to use several mgmt clients for the same link is using

for certain mgmt tasks.

Other than that, there is no real need. IIRC you asked some time ago about a different TL connection FSM style, that would be another use-case. (Also, the mgmt procedures implementation synchronizes on the mgmt client on certain tasks to sequence them out. So you could also use several mgmt procedures (i.e., > 1 mgmt clients, even with the same TL) to parallelize those.) *

* Within reason, I understand you know the KNX spec.

bmalinowsky commented 1 month ago

Rereading your initial issue description, 3rd point. This should also be OK now? I think it would only happen if the device does not disconnect itself, and is still restarting (i.e. not available) after 6 seconds?