eclipse-californium / californium

CoAP/DTLS Java Implementation
https://www.eclipse.org/californium/
Other
729 stars 367 forks source link

Multicast & Unicast issue "via different multicast groups" warning and message dropped #1567

Closed kentsimp closed 3 years ago

kentsimp commented 3 years ago

I am using Californium as a Coap client/server, running on a linux OS, communicating with thread nodes.

My application receives unicast and multicast messages on the default port.

Everything works fine most of the time, but occasionally there appears to be an issue with message delivery up the stack to my endpoint. When it happens I see this message:

15:41:49.009 [CoapServer(main)#4] WARN  o.e.c.core.network.UdpMatcher - [n/a] received request CON-POST   MID=27756, Token=E4C8, OptionSet={"Uri-Path":"logon"} via different multicast groups ([ff03:0:0:0:0:0:0:1%wpan0]:5683 != null)!

I am using:

        <dependency>
            <groupId>org.eclipse.californium</groupId>
            <artifactId>californium-core</artifactId>
            <version>2.4.1</version>
        </dependency>

I found this comment in UdpMulticastConnector.java, but I am not sure how to interpret this, in that it seems to work most of the time for me.

Does this mean it is not possible to have a reliable implementation of unicast and multicast server on the same port?

The message is not delivered to my coap endpoint code.

The problem also appears to fix itself and message delivery starts functioning again without intervention. Is it possible that some network topology change is causing it?

What are my options to treat this problem?

Is there some multicast group configuration I have implemented incorrectly?

My multicast server setup:

    private boolean createEndpoints(final CoapServer server, final int multicastPort, final NetworkConfig config) throws SocketException, UnknownHostException {

        final NetworkInterface networkInterface = NetworkInterfaceUtil.getMulticastInterface();
        if (networkInterface == null) {
            LOGGER.warn("No multicast network-interface found!");
            return false;
        }
        LOGGER.info("Multicast Network Interface: {}", networkInterface.getDisplayName());

        if (NetworkInterfaceUtil.isAnyIpv6()) {
            final Optional<Inet6Address> meshAddress = NetworkInterfaceUtil.getMeshAddress();
            if (!meshAddress.isPresent()) {
                LOGGER.info("No mesh address available");
                return false;
            }

            final UDPConnector udpConnector = new UDPConnector(new InetSocketAddress(meshAddress.get(), multicastPort));
            udpConnector.setReuseAddress(true);
            final CoapEndpoint coapEndpoint = new CoapEndpoint.Builder()
                    .setNetworkConfig(config)
                    .setConnector(udpConnector)
                    .build();

            final InetAddress address = new InetSocketAddress("[ff03::1%wpan0]", 0).getAddress();

            final UdpMulticastConnector.Builder builder = new UdpMulticastConnector.Builder()
                    .setLocalAddress(address, multicastPort)
                    .addMulticastGroup(address, networkInterface);

            createReceiver(builder, coapEndpoint);

            server.addEndpoint(coapEndpoint);
        }

        final UDPConnector udpConnector = new UDPConnector(new InetSocketAddress(InetAddress.getLoopbackAddress(), multicastPort));
        udpConnector.setReuseAddress(true);

        final CoapEndpoint coapEndpoint = new CoapEndpoint.Builder()
                .setNetworkConfig(config)
                .setConnector(udpConnector)
                .build();

        server.addEndpoint(coapEndpoint);

        LOGGER.info("Multicast Server (IPv6) started");

        return true;
    }

TIA

boaks commented 3 years ago

Let me first mention, that if you not plan to use it in "production" soon, I would appreciate, if you merge to the 3.0.0-SNAPSHOT (master), because I slightly redesigned the multicast stuff. (Even if I have to confess, that this can't solve the general issue with java and multicast.)

I found this comment in UdpMulticastConnector.java, but I am not sure how to interpret this, in that it seems to work most of the time for me.

Does this mean it is not possible to have a reliable implementation of unicast and multicast server on the same port?

My experience was, that sometimes it seems to work, and sometimes not. My assumption was, that this depends on the OS (including the version) and the exact usage of java (and the version of java). That assumption may be wrong. But I didn't found an other solution.

In your case, it may be even a different problem.

UdpMatcher - [n/a] received request CON-POST MID=27756, Token=E4C8, OptionSet={"Uri-Path":"logon"} via different multicast groups ([ff03:0:0:0:0:0:0:1%wpan0]:5683 != null)!

For me the MID=27756 and the CON points to some details of the coap-client. I guess, that is not Californium in that case.

RFC7252 - Multicast - Messaging Layer

"Such multicast requests MUST be Non-confirmable."

"the sender MUST avoid using a Message ID that is also still active from this endpoint with any unicast endpoint that might receive the multicast message."

The second one is the difficult one. I'm not sure, if the client implementation is aware of that. In Californium we choose to use a separate range for that (see NetworkConfig.Keys.MULTICAST_BASE_MID ).

The issue may therefore just be, that the MID is somehow simply in use and so the duplicate is a false detection.

I started with, please switch to 3.0.0-SNAPSHOT. If this is possible for you, please provide then new logs. We will see, if this is a setup or a MID issue. See also MIGRATION_HINTS, for some assistance on migrating to 3.0.

boaks commented 3 years ago

Just to mention:

If the logging level DEBUG is activated for the UDPConnector, the logs should also show, if the same requests are received by different local-addresses (unicast/multicast), or not, and the MID collision is caused by the client.

kentsimp commented 3 years ago

Thanks @boaks

I have upgraded to 3.0.0-SNAPSHOT. I will also enable DEBUG logging on org.eclipse.californium.

Our client is Open Thread.

We haven't been able to reproduce the same problem so far on 3.0.0-SNAPSHOT, but I have a couple of other things to debug on 3.0.0 so I will reply in a couple of days with an update.

boaks commented 3 years ago

Let me try to sum up, what may have caused the MID collision:

coap-clients usually use MIDs in the scope of a destination. MIDs must be unique per destination for the EXCHANGE_LIFETIME (default 247s). For multicast-destinations, that must be true as well, which causes some "pain".

Assume a coap-client sends unicast-requests to several servers, then none of the MIDs could be used for a multicast-request, otherwise, the MID gets ambig.

Simple implementations may now process the MID for multicast destinations just in the same way, as for unicast destinations.

coap-client  -- 10.20.30.40:5683 ; MID = 147 -- > coap-server-1
coap-client  -- 10.20.30.50:5683 ; MID = 147 -- > coap-server-2
...
coap-client  -- 224.0.1.187:5683 ; MID = 147 -- > coap-server-n

The first use of MID 147 is always OK. The second is OK, because it's a different server (at least, as long as no "virtual hosts" are involved). The third may be considered by a simple implementation to be OK, but multicast destinations (224.0.1.187) don't address a different server, and so that approach fails.

Assuming "virtual hosts" makes even the second request unsafe. Therefore, sometimes it seems the be better to use separate client endpoints for each destination. In your case, you may fix your communication, if you use two client-endpoints, one for multicast, and one for the unicast. As long, as the coap-server is IP-reachable (NATs makes it sometimes more complex), the two client-endpoints will work better. Only if the some components on the route blocks such a separate client-endpoint, the a common endpoint must be used on the client-side as well. If the common endpoint must be used, the MID for multicast must be handled in a special way, e.g. with a reserved MID scope for multicast destinations. Otherwise it easily collides with MIDs in use.

boaks commented 3 years ago

@kentsimp

Any news? If my assumption about the client is right, I may adapt the warning to point better to the different root-causes.

kentsimp commented 3 years ago

Hi @boaks

Sorry we still haven't been able to confirm as our system has gone into a live pilot (half way around the world) so we have been supporting that on 2.4.1.

We were unable to reproduce the original problem on 3.0.0, but we did have other problems with the client (seemed to be not setting the block option m flag correctly). Although due to the pilot I have been unable to get the necessary tcpdump files to confirm and raise this.

In summary, we are still not sure whether the client ID is the problem, as with 3.0.0 we can't reproduce the issue (good and bad for us).

boaks commented 3 years ago

Can you reproduce it with 2.4.1?

boaks commented 3 years ago

Just to mention:

If my assumption with the MID is right, the probability is not too high. Except, you have blockwise transfers to that peer using unicast. That usually allocates a larger MID range. If now the multicast request is sent, that probability may be quite high.

kentsimp commented 3 years ago

It was quite reproducible previously on 2.4.1, so we can roll back and try that too.

Confirming that we are also doing blockwise transfers.

Thanks again for your help @boaks - I will update the issue as soon as I have more information to share.

boaks commented 3 years ago

If it is reproducible with 2.4.1, we may also adapt the logging there to see some more details. Lets see ...

boaks commented 3 years ago

PR #1581 I added some more information to the logging for 3.0.0. May be , that helps then for other cases, if this one doesn't occur with the 3.0.0. It compares also the tokens and adds the receive-time delta. If the tokens are different, it's mostly a client, which handles MID for multicast a little too simple. Such clients should consider, if a separate coap/udp endpoint could be used. The receive-time delta is similar. Depending on the amount of messages and the cpu-power of the server, the delta should be quite small for "messages received multiple times by different interfaces". If the time is larger, that points more to different request using the same MID (and token).

kentsimp commented 3 years ago

I finally got the TCP dumps for the 3.0.0-SNAPSHOT block option issue (please let me know if you want me to raise this as a new issue).

The issue is that our OTA software update isn't working in 3.0.0 because the more flag is not being set on the block 2 option, after the first block transfer.

We are not able to run the same scenario with 3.0.0 (do an OTA update, then bring a node online) due to this issue, which may be why we can't reproduce the original MID collision issue on 3.0.0.

My code which sets up the response is below, and remains unchanged between the two versions of californium (3.0.0-SNAPSHOT which fails, and 2.4.1 which succeeds):

protected void respond(final CoapExchange exchange, final CoAP.ResponseCode responseCode, final byte[] payload, final BlockOption block2Option) {
    final Response response = new Response(responseCode);
    response.setPayload(payload);
    response.getOptions().setBlock2(block2Option);
    exchange.respond(response);
    final String payloadString = new String(Optional.ofNullable(payload).orElse(new byte[]{}));

    LOGGER.info("[WPCS {}] --> #{} /{}, payload={}, coapStatus={}",
            this.pcsAddressMapper.getWpcsId(exchange.getSourceAddress().getHostAddress()),
            this.threadContext.get(),
            this.endpointName,
            StringUtils.left(payloadString, 24).concat(payloadString.length() > 24 ? "..." : EMPTY),
            responseCode);
}

An extract from the working version: 2.4.1

19:09:21.736509 IP6 Khadas.39399 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 90
    0x0000:  6003 cea7 0062 1140 fd7f 4570 1d38 eda6  `....b.@..Ep.8..
    0x0010:  0000 00ff fe00 5800 fd7f 4570 1d38 eda6  ......X...Ep.8..
    0x0020:  7695 70cf a63a 895a 99e7 1633 0062 b9c6  v.p..:.Z...3.b..
    0x0030:  4802 01a6 9420 bbfa 4a2a 6221 b775 7067  H.......J*b!.upg
    0x0040:  7261 6465 4d13 7570 6772 6164 6555 7269  radeM.upgradeUri
    0x0050:  3d66 696c 652f 4a4b 3663 3134 5459 6e57  =file/JK6c14TYnW
    0x0060:  4765 5154 7162 0d00 7374 6f72 6167 6553  GeQTqb..storageS
    0x0070:  6c6f 743d 300d 0662 6163 6b75 7053 746f  lot=0..backupSto
    0x0080:  7261 6765 536c 6f74 3d31                 rageSlot=1
19:09:21.843928 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.5683: UDP, length 44
    0x0000:  6000 0000 0034 1140 fd7f 4570 1d38 eda6  `....4.@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  5818 f556 18e7 48d8 1633 1633 0034 2bd1  X..V..H..3.3.4+.
    0x0030:  4201 2563 1c49 b466 696c 65c1 04ff 6669  B.%c.I.file...fi
    0x0040:  6c65 6e61 6d65 3d66 696c 652f 4a4b 3663  lename=file/JK6c
    0x0050:  3134 5459 6e57 4765 5154 7162            14TYnWGeQTqb
19:09:21.848993 IP6 Khadas.5683 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 266
    0x0000:  6001 b43b 0112 1140 fd7f 4570 1d38 eda6  `..;...@..Ep.8..
    0x0010:  5818 f556 18e7 48d8 fd7f 4570 1d38 eda6  X..V..H...Ep.8..
    0x0020:  7695 70cf a63a 895a 1633 1633 0112 f85d  v.p..:.Z.3.3...]
    0x0030:  6245 2563 1c49 d10a 0cff eb17 a603 0800  bE%c.I..........
    0x0040:  0000 0000 0003 0000 0000 f40a 0af4 1c00  ................
    0x0050:  0000 0200 0000 0100 0000 0000 0000 0000  ................
    0x0060:  0000 0000 0000 0000 0000 0000 0000 f509  ................
    0x0070:  09f5 ec34 0000 0000 0b01 0008 e10f 007c  ...4...........|
    0x0080:  0020 a508 e10f a108 e10f 8909 e10f a108  ................
    0x0090:  e10f a108 e10f a108 e10f a108 e10f a108  ................
    0x00a0:  e10f a108 e10f 203a e10f a108 e10f a108  .......:........
    0x00b0:  e10f 603a e10f a108 e10f a108 e10f 10b5  ..`:............
    0x00c0:  054c 2378 33b9 044b 13b1 0448 aff3 0080  .L#x3..K...H....
    0x00d0:  0123 2370 10bd 0801 0020 0000 0000 d43c  .##p...........<
    0x00e0:  e10f 08b5 034b 1bb1 0349 0348 aff3 0080  .....K...I.H....
    0x00f0:  08bd 0000 0000 0c01 0020 d43c e10f 044b  ...........<...K
    0x0100:  054a 9a60 d3f8 8820 42f4 7002 c3f8 8820  .J.`....B.p.....
    0x0110:  7047 00ed 00e0 0008 e10f 7047 7047 fee7  pG........pGpG..
    0x0120:  0000 08b5 fff7 ebff 00f0 d7f8 0a4a 0a4b  .............J.K
    0x0130:  0b49 8b42 06d3 0a4b 0b4a                 .I.B...K.J
19:09:21.850263 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.39399: UDP, length 12
    0x0000:  6000 0000 0014 1140 fd7f 4570 1d38 eda6  `......@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  0000 00ff fe00 5800 1633 99e7 0014 dfc2  ......X..3......
    0x0030:  6843 01a6 9420 bbfa 4a2a 6221            hC......J*b!
19:09:22.001067 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.5683: UDP, length 51
    0x0000:  6000 0000 003b 1140 fd7f 4570 1d38 eda6  `....;.@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  5818 f556 18e7 48d8 1633 1633 003b b12a  X..V..H..3.3.;.*
    0x0030:  4202 2564 ab05 b864 6f77 6e6c 6f61 64ff  B.%d...download.
    0x0040:  7072 6f67 7265 7373 3d32 3536 2665 7569  progress=256&eui
    0x0050:  3634 3d44 3043 4635 4546 4646 4538 3734  64=D0CF5EFFFE874
    0x0060:  4537 38                                  E78
19:09:22.006332 IP6 Khadas.5683 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 6
    0x0000:  6001 b43b 000e 1140 fd7f 4570 1d38 eda6  `..;...@..Ep.8..
    0x0010:  5818 f556 18e7 48d8 fd7f 4570 1d38 eda6  X..V..H...Ep.8..
    0x0020:  7695 70cf a63a 895a 1633 1633 000e 3edc  v.p..:.Z.3.3..>.
    0x0030:  625f 2564 ab05                           b_%d..
19:09:22.016642 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.5683: UDP, length 44
    0x0000:  6000 0000 0034 1140 fd7f 4570 1d38 eda6  `....4.@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  5818 f556 18e7 48d8 1633 1633 0034 472e  X..V..H..3.3.4G.
    0x0030:  4201 2565 f0e9 b466 696c 65c1 14ff 6669  B.%e...file...fi
    0x0040:  6c65 6e61 6d65 3d66 696c 652f 4a4b 3663  lename=file/JK6c
    0x0050:  3134 5459 6e57 4765 5154 7162            14TYnWGeQTqb
19:09:22.022737 IP6 Khadas.5683 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 266
    0x0000:  6001 b43b 0112 1140 fd7f 4570 1d38 eda6  `..;...@..Ep.8..
    0x0010:  5818 f556 18e7 48d8 fd7f 4570 1d38 eda6  X..V..H...Ep.8..
    0x0020:  7695 70cf a63a 895a 1633 1633 0112 2909  v.p..:.Z.3.3..).
    0x0030:  6245 2565 f0e9 d10a 1cff 0021 9342 06d3  bE%e.......!.B..
    0x0040:  00f0 65f8 52f8 040f 43f8 040b f1e7 43f8  ..e.R...C.....C.
    0x0050:  041b f3e7 00bf d43c e10f 0001 0020 0801  .......<........
    0x0060:  0020 0801 0020 4401 0020 0020 7047 7047  ......D.....pGpG
    0x0070:  01f0 82bd 01f0 e9be 0000 30b5 1c4b 87b0  ..........0..K..
    0x0080:  01aa 03f1 1005 1868 5968 1446 03c4 0833  .......hYh.F...3
    0x0090:  ab42 2246 f7d1 1b78 164a 2370 d2f8 9430  .B"F...x.J#p...0
    0x00a0:  03f0 0703 022b 1ed0 1369 c3f3 0443 102b  .....+...i...C.+
    0x00b0:  1bd8 06a9 0b44 0f49 13f8 140c 0f4b 4343  .....D.I.....KCC
    0x00c0:  8b42 06d1 1169 0122 c1f3 4161 8a40 b3fb  .B...i."..Aa.@..
    0x00d0:  f2f3 074a d2f8 0001 c0f3 0420 0130 b3fb  ...J.........0..
    0x00e0:  f0f0 07b0 30bd 054b f3e7 054b f1e7 0c3a  ....0..K...K...:
    0x00f0:  e10f 0040 0e40 0009 3d00 4042 0f00 00f0  ...@.@..=.@B....
    0x0100:  4902 c0ea 2101 4ff4 0170 02f0 78bf 2a4b  I...!.O..p..x.*K
    0x0110:  07b5 0122 5a64 294b db6c 1b0e 0fd1 2849  ..."Zd)K.l....(I
    0x0120:  2948 4af6 e852 0a60 0268 42f0 0402 0260  )HJ..R.`.hB....`
    0x0130:  0b60 254a 1368 43f4 0053                 .`%J.hC..S
19:09:22.120106 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.5683: UDP, length 44
    0x0000:  6000 0000 0034 1140 fd7f 4570 1d38 eda6  `....4.@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  5818 f556 18e7 48d8 1633 1633 0034 2bc5  X..V..H..3.3.4+.
    0x0030:  4201 2566 fc51 b466 696c 65c1 24ff 6669  B.%f.Q.file.$.fi
    0x0040:  6c65 6e61 6d65 3d66 696c 652f 4a4b 3663  lename=file/JK6c
    0x0050:  3134 5459 6e57 4765 5154 7162            14TYnWGeQTqb
19:09:22.131180 IP6 Khadas.5683 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 266
    0x0000:  6001 b43b 0112 1140 fd7f 4570 1d38 eda6  `..;...@..Ep.8..
    0x0010:  5818 f556 18e7 48d8 fd7f 4570 1d38 eda6  X..V..H...Ep.8..
    0x0020:  7695 70cf a63a 895a 1633 1633 0112 b6a6  v.p..:.Z.3.3....
    0x0030:  6245 2566 fc51 d10a 2cff 1360 01e0 012b  bE%f.Q..,..`...+
    0x0040:  f7d0 224a 136a 23f4 4413 43f4 8813 1362  .."J.j#.D.C....b
    0x0050:  a2f5 8042 1368 43f4 8073 1360 02f5 4652  ...B.hC..s.`..FR
    0x0060:  2432 1368 43f4 f843 1360 194b 4022 1a61  $2.hC..C.`.K@".a
    0x0070:  5a68 5203 fcd4 fff7 7bff 02f0 39ff 01f0  ZhR.....{...9...
    0x0080:  b0fe 30b9 40f2 0120 02f0 39ff 4ff4 0170  ..0.@.....9.O..p
    0x0090:  0ae0 104b 0193 019b 002b 0bdc 40f2 0543  ...K.....+..@..C
    0x00a0:  9842 f3d1 40f2 0320 02f0 29ff 0020 03b0  .B..@.....).....
    0x00b0:  5df8 04fb 019b 013b ece7 0000 0e40 b081  ]......;.....@..
    0x00c0:  e00f 9031 0e40 9831 0e40 6431 0e40 0040  ...1.@.1.@d1.@.@
    0x00d0:  0e40 0030 0e40 0035 0c00 13b5 234b 0124  .@.0.@.5....#K.$
    0x00e0:  8df8 0640 8df8 0740 5b68 5b05 0fd5 02f0  ...@...@[h[.....
    0x00f0:  26ff a0f2 0220 80b2 0728 08d8 8440 14f0  &........(...@..
    0x0100:  c90f 1ebf 0023 8df8 0630 8df8 0730 0423  .....#...0...0.#
    0x0110:  1b68 0133 08d1 4ff4 0270 02f0 e2fe 0023  .h.3..O..p.....#
    0x0120:  8df8 0630 8df8 0730 9df8 0730 5bb1 0020  ...0...0...0[...
    0x0130:  02f0 0dfe 8df8 0600 9df8                 ..........

See OTA good trace block responses above 
6245 2563 1c49 d10a 0cff    = block 0 more flag set
6245 2565 f0e9 d10a 1cff    = block 1 more flag set
6245 2566 fc51 d10a 2cff    = block 2 more flag set

And the bad version with 3.0.0-SNAPSHOT

18:33:10.449982 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.5683: UDP, length 44
    0x0000:  6000 0000 0034 1140 fd7f 4570 1d38 eda6  `....4.@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  5818 f556 18e7 48d8 1633 1633 0034 4785  X..V..H..3.3.4G.
    0x0030:  4201 3798 ee5f b466 696c 65c1 04ff 6669  B.7.._.file...fi
    0x0040:  6c65 6e61 6d65 3d66 696c 652f 4a4b 3663  lename=file/JK6c
    0x0050:  3134 5459 6e57 4765 5154 7162            14TYnWGeQTqb
18:33:10.455111 IP6 Khadas.5683 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 266
    0x0000:  6001 b43b 0112 1140 fd7f 4570 1d38 eda6  `..;...@..Ep.8..
    0x0010:  5818 f556 18e7 48d8 fd7f 4570 1d38 eda6  X..V..H...Ep.8..
    0x0020:  7695 70cf a63a 895a 1633 1633 0112 1412  v.p..:.Z.3.3....
    0x0030:  6245 3798 ee5f d10a 0cff eb17 a603 0800  bE7.._..........
    0x0040:  0000 0000 0003 0000 0000 f40a 0af4 1c00  ................
    0x0050:  0000 0200 0000 0100 0000 0000 0000 0000  ................
    0x0060:  0000 0000 0000 0000 0000 0000 0000 f509  ................
    0x0070:  09f5 ec34 0000 0000 0b01 0008 e10f 007c  ...4...........|
    0x0080:  0020 a508 e10f a108 e10f 8909 e10f a108  ................
    0x0090:  e10f a108 e10f a108 e10f a108 e10f a108  ................
    0x00a0:  e10f a108 e10f 203a e10f a108 e10f a108  .......:........
    0x00b0:  e10f 603a e10f a108 e10f a108 e10f 10b5  ..`:............
    0x00c0:  054c 2378 33b9 044b 13b1 0448 aff3 0080  .L#x3..K...H....
    0x00d0:  0123 2370 10bd 0801 0020 0000 0000 d43c  .##p...........<
    0x00e0:  e10f 08b5 034b 1bb1 0349 0348 aff3 0080  .....K...I.H....
    0x00f0:  08bd 0000 0000 0c01 0020 d43c e10f 044b  ...........<...K
    0x0100:  054a 9a60 d3f8 8820 42f4 7002 c3f8 8820  .J.`....B.p.....
    0x0110:  7047 00ed 00e0 0008 e10f 7047 7047 fee7  pG........pGpG..
    0x0120:  0000 08b5 fff7 ebff 00f0 d7f8 0a4a 0a4b  .............J.K
    0x0130:  0b49 8b42 06d3 0a4b 0b4a                 .I.B...K.J
18:33:10.457175 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.55367: UDP, length 12
    0x0000:  6000 0000 0014 1140 fd7f 4570 1d38 eda6  `......@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  0000 00ff fe00 5800 1633 d847 0014 5fcc  ......X..3.G.._.
    0x0030:  6843 7d02 54b8 0d79 46c7 19a8            hC}.T..yF...
18:33:10.631980 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.5683: UDP, length 51
    0x0000:  6000 0000 003b 1140 fd7f 4570 1d38 eda6  `....;.@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  5818 f556 18e7 48d8 1633 1633 003b 3028  X..V..H..3.3.;0(
    0x0030:  4202 3799 19d3 b864 6f77 6e6c 6f61 64ff  B.7....download.
    0x0040:  7072 6f67 7265 7373 3d32 3536 2665 7569  progress=256&eui
    0x0050:  3634 3d44 3043 4635 4546 4646 4538 3734  64=D0CF5EFFFE874
    0x0060:  4537 38                                  E78
18:33:10.639165 IP6 Khadas.5683 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 6
    0x0000:  6001 b43b 000e 1140 fd7f 4570 1d38 eda6  `..;...@..Ep.8..
    0x0010:  5818 f556 18e7 48d8 fd7f 4570 1d38 eda6  X..V..H...Ep.8..
    0x0020:  7695 70cf a63a 895a 1633 1633 000e bdd9  v.p..:.Z.3.3....
    0x0030:  625f 3799 19d3                           b_7...
18:33:10.642845 IP6 fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683 > Khadas.5683: UDP, length 44
    0x0000:  6000 0000 0034 1140 fd7f 4570 1d38 eda6  `....4.@..Ep.8..
    0x0010:  7695 70cf a63a 895a fd7f 4570 1d38 eda6  v.p..:.Z..Ep.8..
    0x0020:  5818 f556 18e7 48d8 1633 1633 0034 f9ff  X..V..H..3.3.4..
    0x0030:  4201 379a 2be3 b466 696c 65c1 14ff 6669  B.7.+..file...fi
    0x0040:  6c65 6e61 6d65 3d66 696c 652f 4a4b 3663  lename=file/JK6c
    0x0050:  3134 5459 6e57 4765 5154 7162            14TYnWGeQTqb
18:33:10.647487 IP6 Khadas.5683 > fd7f:4570:1d38:eda6:7695:70cf:a63a:895a.5683: UDP, length 266
    0x0000:  6001 b43b 0112 1140 fd7f 4570 1d38 eda6  `..;...@..Ep.8..
    0x0010:  5818 f556 18e7 48d8 fd7f 4570 1d38 eda6  X..V..H...Ep.8..
    0x0020:  7695 70cf a63a 895a 1633 1633 0112 e3da  v.p..:.Z.3.3....
    0x0030:  6245 379a 2be3 d10a 14ff 0021 9342 06d3  bE7.+......!.B..
    0x0040:  00f0 65f8 52f8 040f 43f8 040b f1e7 43f8  ..e.R...C.....C.
    0x0050:  041b f3e7 00bf d43c e10f 0001 0020 0801  .......<........
    0x0060:  0020 0801 0020 4401 0020 0020 7047 7047  ......D.....pGpG
    0x0070:  01f0 82bd 01f0 e9be 0000 30b5 1c4b 87b0  ..........0..K..
    0x0080:  01aa 03f1 1005 1868 5968 1446 03c4 0833  .......hYh.F...3
    0x0090:  ab42 2246 f7d1 1b78 164a 2370 d2f8 9430  .B"F...x.J#p...0
    0x00a0:  03f0 0703 022b 1ed0 1369 c3f3 0443 102b  .....+...i...C.+
    0x00b0:  1bd8 06a9 0b44 0f49 13f8 140c 0f4b 4343  .....D.I.....KCC
    0x00c0:  8b42 06d1 1169 0122 c1f3 4161 8a40 b3fb  .B...i."..Aa.@..
    0x00d0:  f2f3 074a d2f8 0001 c0f3 0420 0130 b3fb  ...J.........0..
    0x00e0:  f0f0 07b0 30bd 054b f3e7 054b f1e7 0c3a  ....0..K...K...:
    0x00f0:  e10f 0040 0e40 0009 3d00 4042 0f00 00f0  ...@.@..=.@B....
    0x0100:  4902 c0ea 2101 4ff4 0170 02f0 78bf 2a4b  I...!.O..p..x.*K
    0x0110:  07b5 0122 5a64 294b db6c 1b0e 0fd1 2849  ..."Zd)K.l....(I
    0x0120:  2948 4af6 e852 0a60 0268 42f0 0402 0260  )HJ..R.`.hB....`
    0x0130:  0b60 254a 1368 43f4 0053                 .`%J.hC..S
See OTA good trace block responses above 
6245 3798 ee5f d10a 0cff    = block 0 more flag set
6245 379a 2be3 d10a 14ff    = block 1 more flag not set so transfer requests stop
boaks commented 3 years ago

Thanks for reporting that!

The advantage using captures is, that it's easier for me to analyze the packages "on my own". (if possible, please select the records and export them and zip all before upload the capture.)

Do I assume right, that the thread-node is the coap-client? and Californium the coap-server?

boaks commented 3 years ago

About the detection of messages received multiple times. I retested the feature and found, that I missed to update the code using the new "localAddress". PR #1591 fixes that.

boaks commented 3 years ago

if your "response" method is called for single blocks (otherwise I'm curious about your logs), then you have bypassed Californium transparent blockwise mode.

response.getOptions().setBlock2(block2Option);

If the block option2 is wrong, I'm not sure, where this argument is from.

boaks commented 3 years ago

Which value do you use for "MAX_RESOURCE_BODY_SIZE" in "Californium.properties"?

boaks commented 3 years ago

Just to add:

There is a blockwise interoperability test for blockwise transfers between libcoap and Californium. That doesn't show any issue. I guess, the transparent blockwise mode was disabled more or less unintended. If you want to handle each block on your own code, then it's up to your code to provide the right options. But I don't get it, why that has worked with the 2.4.1.

boaks commented 3 years ago

Did you had a look at the cf-simplefile-server? Maybe that helps.

boaks commented 3 years ago

Sorry sorry, but your code snippet and your logs (which are not logged by that snippet) has been irritating me.

I guess getLimitedBlockOption doesn't handle incoming block2 options well. I will investigate and fix that. maybe tomorrow or after Easter.

boaks commented 3 years ago

I checked also some unit test.

BlockwiseServerSideTest

But for me, also this tests doesn't show any issues. My feeling is, without the capture and infos about the used block2Option in the "respond" above, it's unclear, what went wrong. I'm not sure, why the block2Option is passed in at all. Usually, if the transparent blockwise mode is used, that should fill in the block option on it's own.

boaks commented 3 years ago
==[ CoAP Request ]=============================================
MID    : 14232
Token  : EE5F
Type   : CON
Method : 0.01 - GET
Options: {"Uri-Path":"file", "Block2":"(szx=4/256, m=false, num=0)"}
Payload: 30 Bytes
---------------------------------------------------------------
filename=file/JK6c14TYnWGeQTqb

FMPOV: That is not a optimal REST design. You put information into the payload, which are elementary for processing the request.

If you want to use the "transparent mode", please ensure, that the URI (path+query) is unique for each blockwise resources (e.g.: "/file?filename=file/JK6c14TYnWGeQTqb" or "/file/JK6c14TYnWGeQTqb"). Resource can catch all request to sub-resources, see SimpleFileServer - FileResource.

If you want to use the "none transparent mode", then it should be up to your code to provide the right block2 option.

kentsimp commented 3 years ago

Do I assume right, that the thread-node is the coap-client? and Californium the coap-server?

This is correct.

if your "response" method is called for single blocks (otherwise I'm curious about your logs), then you have bypassed Californium transparent blockwise mode.

The answer is... I was not aware there was another way to do this, this was just how I solved the problem for our OTA updates. Since our system has just gone to a live pilot, I would prefer to make as little change as possible to get this working - it sounds like I will need to revisit the transparent mode option later.

Which value do you use for "MAX_RESOURCE_BODY_SIZE" in "Californium.properties"?

I have not set this explicitly anywhere

This is my code which calls the respond method I provided earlier. I have not changed this code when I change versions of Californium, and it does work on 2.4.1.

@Override
public void handleMessage(final CoapExchange exchange) {
    super.preHandleMessage(exchange);
    try {
        final UriComponents uriComponents = AppHelper.decodeURI(DUMMY_HOST, StringUtils.replaceAll(exchange.getRequestText(), "\n", "&"));
        final MultiValueMap<String, String> bodyParams = uriComponents.getQueryParams();

        String filename = getStringParam(bodyParams, "filename");
        filename = StringUtils.substringAfter(filename, "file/");
//          LOGGER.debug("Filename is '{}'", filename);

        // read the options from block 2
        final BlockOption block2Options = exchange.getRequestOptions().getBlock2();
        int szx = block2Options.getSzx();
        final int blockNumber = block2Options.getNum();
        LOGGER.debug("Block options: szx: {}, num: {}", szx, blockNumber);

        szx = Math.min(szx, this.maxBytes);

        // get the block
        final BlockDataDto blockData = this.otaSoftwareUpdater.getBlock(filename, blockNumber, szx);
        LOGGER.debug("Retrieved block data: {}", blockData);

        // build the response  // todo: is szx calc correct or needs reverse of (2^szx + 4) ?
        final int newSzx = size2Szx(blockData.getBlockData().length);

        LOGGER.debug("Request options (block 2): {}", exchange.getRequestOptions().getBlock2());

        this.respond(exchange, ResponseCode.CONTENT, blockData.getBlockData(), new BlockOption(newSzx, blockData.isMoreBlocks(), blockNumber));

    } catch (final Exception e) {
        LOGGER.warn("Exception processing /{} request", PATH, e);
        exchange.respond(ResponseCode.INTERNAL_SERVER_ERROR);
    }
}

And finally, my implementation for this.otaSoftwareUpdater.getBlock() above:

public BlockDataDto getBlock(final String filename, final int blockNumber, final int szx) throws IOException {
    final String absoluteFilename = format("%s%s%s", this.coapResourcesLocation, File.separator, filename);

    final int blockSize = this.getBlockSize(szx);
    final int blockOffset = blockSize * blockNumber;

    final RandomAccessFile file = new RandomAccessFile(absoluteFilename, "r");

    final byte[] blockData = new byte[blockSize + 1];

    // read the block size + 1 byte to determine whether there is a next block or not
    file.seek(blockOffset);
    LOGGER.debug("Seek to pos: {}", blockOffset);

    final int bytesRead = file.read(blockData, 0, blockSize + 1);
    final boolean moreblocks = blockSize + 1 == bytesRead;

    LOGGER.debug("Bytes requested = {}, bytes read = {}, more = {}", blockSize + 1, bytesRead, moreblocks);

    final byte[] data = ArrayUtils.subarray(blockData, 0, blockSize);

    return new BlockDataDto(data, moreblocks);
}

Previously when I checked my logs (and debugged), I could see my isMoreBlocks() method was returning true for the second block where tcpdump shows it as false.

I can retake the traces and attach my logs too - one logistical issue I have (taking me longer to respond) is my laptop was stolen and I have to get another developer to take the traces so I do apologize for the delays in responding.

kentsimp commented 3 years ago

I have a new set of logs covering the issue.

capture.zip

This includes my application logs and the tcp dump file covering the test.

In my app log file, this is the first request (good):

14:21:09.487 [CoapServer(main)#1] INFO  o.e.californium.core.CoapResource - [WPCS 1] <-- #2 /file, payload=filename=file/1Au7K943tMIjTmpi
14:21:09.488 [CoapServer(main)#1] DEBUG c.a.c.c.e.u.OtaBlockRequestEndpoint - [n/a] Block options: szx: 4, num: 0
14:21:09.490 [CoapServer(main)#1] DEBUG c.a.c.coap.OtaSoftwareUpdateService - [n/a] Seek to pos: 0
14:21:09.490 [CoapServer(main)#1] DEBUG c.a.c.coap.OtaSoftwareUpdateService - [n/a] Bytes requested = 257, bytes read = 257, more = true
14:21:09.491 [CoapServer(main)#1] DEBUG c.a.c.c.e.u.OtaBlockRequestEndpoint - [n/a] Retrieved block data: BlockDataDto{blockDataLen=256, moreBlocks=true}
14:21:09.491 [CoapServer(main)#1] DEBUG c.a.c.c.e.u.OtaBlockRequestEndpoint - [n/a] Request options (block 2): (szx=4/256, m=false, num=0)
14:21:09.492 [CoapServer(main)#1] DEBUG o.e.c.c.n.stack.ReliabilityLayer - [n/a] Exchange[R18] send response null-2.05   MID=   -1, Token=null, OptionSet={"Block2":"(szx=4/256, m=true, num=0)"},

Note the OptionSet={"Block2":"(szx=4/256, m=true, num=0)"} log entry from the ReliabilityLayer

And on the second block request:

14:21:09.703 [CoapServer(main)#5] DEBUG c.a.c.c.e.u.OtaBlockRequestEndpoint - [n/a] Block options: szx: 4, num: 1
14:21:09.704 [CoapServer(main)#5] DEBUG c.a.c.coap.OtaSoftwareUpdateService - [n/a] Seek to pos: 256
14:21:09.704 [CoapServer(main)#5] DEBUG c.a.c.coap.OtaSoftwareUpdateService - [n/a] Bytes requested = 257, bytes read = 257, more = true
14:21:09.704 [CoapServer(main)#5] DEBUG c.a.c.c.e.u.OtaBlockRequestEndpoint - [n/a] Retrieved block data: BlockDataDto{blockDataLen=256, moreBlocks=true}
14:21:09.705 [CoapServer(main)#5] DEBUG c.a.c.c.e.u.OtaBlockRequestEndpoint - [n/a] Request options (block 2): (szx=4/256, m=false, num=1)
14:21:09.709 [CoapServer(main)#2] INFO  o.e.californium.core.CoapResource - [WPCS 1] --> #2 /download, payload=, coapStatus=2.31
14:21:09.710 [CoapServer(main)#5] DEBUG o.e.c.c.n.s.Block2BlockwiseStatus - [n/a] cropping response body [size=256] to block (szx=4/256, m=false, num=1)
14:21:09.710 [CoapServer(main)#5] DEBUG o.e.c.c.n.stack.ReliabilityLayer - [n/a] Exchange[R20] send response null-2.05   MID=   -1, Token=null, OptionSet={"Block2":"(szx=4/256, m=false, num=1)"}, 
00 21 93 42 06 d3 00 f0 65 f8 52 f8 04 0f 43 f8 04 0b f1 e7 43 f8 04 1b f3 e7 00 bf 98 43 e1 0f

Just as for the first block, my logic prints:

14:21:09.704 [CoapServer(main)#5] DEBUG c.a.c.c.e.u.OtaBlockRequestEndpoint - [n/a] Retrieved block data: BlockDataDto{blockDataLen=256, moreBlocks=true}

However the ReliabilityLayer prints:

14:21:09.710 [CoapServer(main)#5] DEBUG o.e.c.c.n.stack.ReliabilityLayer - [n/a] Exchange[R20] send response null-2.05 MID= -1, Token=null, OptionSet={"Block2":"(szx=4/256, m=false, num=1)"},

I also noticed the addition of this logging in the bad request:

14:21:09.710 [CoapServer(main)#5] DEBUG o.e.c.c.n.s.Block2BlockwiseStatus - [n/a] cropping response body [size=256] to block (szx=4/256, m=false, num=1)

Is this "cropping" somehow messing up my block options?

kentsimp commented 3 years ago

The problem seems to be occurring here where the m flag is recalculated.

I am not sure why cropping is occurring for this payload though?

boaks commented 3 years ago

Looks like much more documentation is required :-).

The Resource is responsible to pass the chunked payload together with the proper block options.

The Resource is responsible to pass the full payload, the proper block options is handled by Californium. There maybe "random requests", but these requests are intended to be processed either in the same way, so with full payload, and the proper block options is handled by Californium. Or the response contains a block2 option and the payload according that option (both is handled in the "crop").

For 3.0 I redesigned the blockwise in order to include/merge the BERT extension for TCP. What may cause the different behavior is Block2BlockwiseStatus.crop. So the case, where the response to crop contains a block2 option, now adapts the "from", but the "m" may/should then not be calculated by Californium. I need to check, document and test that.

Maybe, just as fast idea, set MAX_RESOURCE_BODY_SIZE to 0 ("Californium.properties") in order to disable the transparent mode.

boaks commented 3 years ago
int szx = block2Options.getSzx();
final int blockNumber = block2Options.getNum();
LOGGER.debug("Block options: szx: {}, num: {}", szx, blockNumber);

szx = Math.min(szx, this.maxBytes);

Block.szx != Block.size, so that Math.min doesn't work.

Please read the javadoc for BlockOption. That contains also the functions to convert from szx to size and reverse.

(OK, though the value of "maxBytes" may be not the bytes ... but then I woull recommend to name it with Szx).

boaks commented 3 years ago

OK, I hope, I understand now the difference:

2.4.1: if a block2 options is contained in the response, that bypasses the "crop". 3.0.0: that redesign has changed that.

Anyway, please try to disable the "transparent mode", I hope that will fix your issues.

boaks commented 3 years ago

PR #1592 may provide a first fix for that response with block2 option. (Some more tests will be required, so it may not be final.)

boaks commented 3 years ago

@kentsimp

Thanks for reporting the behavior change in the blockwise.

I tried to understand the "design idea" of the implementation in 2.4. Using either the disabled transparent blockwise mode with application block options, or the enabled transparent blockwise mode without application block options seems to be working and well defined. But the mixed case, enabled transparent blockwise mode with application block options, seems to be only defined for some "corner" cases as "early block 2" negotiation or "random block access". I'm not sure, if it was intended to use this "mixed case" also for other scenario's. For me this looks more like undefined behavior outside of that. Therefore I added the recommendation to not use this "mixed case".

kentsimp commented 3 years ago

Confirming that if I set MAX_RESOURCE_BODY_SIZE=0, then the Block2Option seems to behave as per 2.4.1 for my transfer.

Now we will try to reproduce the original issue on 3.0.0-SNAPSHOT with transparent mode disabled.

boaks commented 3 years ago

Hi @kentsimp

if your done with the topics here, can we close this issue?

kentsimp commented 3 years ago

Yes, thanks @boaks

boaks commented 3 years ago

Hi @kentsimp

which multicast-group do you use with openthread?

In the nordic forum someone has some trouble. Maybe it's easy for you to help there out.

kentsimp commented 3 years ago

I set the group to [ff03::1%wpan0] (same as that question it seems), but to be honest I'm not sure that this is correct. It does work for us with open thread routers and end devices though.

boaks commented 3 years ago

I'm not sure that this is correct

For multicast, no one can be sure :-).

And it seems, that OpenThread uses a more specific multicast address mapping . The differences is a additional "Realm-Local scope" with "ff03" and the usage of "1/2" instead of "FD" (RFC7252). For the case there, I would assume, the root issue is something else.