eclipse-californium / californium

CoAP/DTLS Java Implementation
https://www.eclipse.org/californium/
Other
730 stars 367 forks source link

blockwise transfer: implementation error, peer requested block offset Y but resource returned block offest 0 #2234

Closed thedevleon closed 7 months ago

thedevleon commented 8 months ago

We're seeing these messages from californium in our thingsboard 3.6.0 logs (which uses californium 3.0 as far as I can tell).

[CoapServer(main)#1] WARN  o.e.c.c.network.stack.BlockwiseLayer - coap resource [xxx] implementation error, peer requested block offset 11264 but resource returned block offest 0

This error appears in the context of performing an OTA with block-wise transfers. We initiate a block-wise transfer on a firmware resource with the BLOCK2 option and a block size of 1024. The majority of OTAs done this way work fine without any issues. However, sometimes we get the above error, and any following repeated requests with the same BLOCK2 offset will only yield the same error, and the block-wise transfer needs to be re-started from the beginning.

For our CoAP client, we have built a small library wrapping the functions that Zephyr offers for building packets. It cannot be ruled out that our implementation is without issues. To be frank, I'm hoping it's a implementation error on our side, and not a bug in californium.

As such, I'm just looking for any explanation on how the requested block offset and the "resource" block offset could not be matching, causing the above error (and an Internal Server Error).

boaks commented 8 months ago

I guess, this occurs, if the request is sent after a quite phase.

Californium tracks the requests, but with a quite phase that times out. A new request is then not longer processed "transparent" but instead passed to the application layer as "random access". Where ever "block2 option" is added to the response, but that doesn't match the requested one. If you want to get rid of it, please investigate, if "ThingsBoard" is adding that block2 option.

thedevleon commented 8 months ago

I guess, this occurs, if the request is sent after a quite phase.

Californium tracks the requests, but with a quite phase that times out. A new request is then not longer processed "transparent" but instead passed to the application layer as "random access". Where ever "block2 option" is added to the response, but that doesn't match the requested one. If you want to get rid of it, please investigate, if "ThingsBoard" is adding that block2 option.

Thanks, that would explain the behavior, especially with coap clients that take longer between requesting blocks (due to packet loss).

Is there a default value for this quiet phase and can you hint me where/how it can be set?

boaks commented 8 months ago

5 minutes.

The implementation idea to distinguish between a transparent (Californium handles the blockwise) and random/non-transparent (Californium may still handle a part of the blockwise, but enables the application layer to care on this as well by forwarding all block-request), is pretty old and "before my time". I already started to consider to change that with the next major update to Cf 4. At least add the possibility to resume.

boaks commented 8 months ago

BlockwiseLayer - receiveRequest

Without status (timed out), the request is forwarded to the application layer.

BlockwiseLayer - sendResponse.

If a block2 option is provided, it must match the requested block. I assume, the ThingsBoard application layer provides that wrong block2 option.

thedevleon commented 7 months ago

Indeed, thingsboard doesn't seem to be setting the block2 option correctly.

For some reason the timeout seems to be much shorter than 5 minutes. With pauses of around 90-160s I can reliably produce the timed out state. Even though the default of 5 minutes seems to be set...

2024-03-23 09:56:18,261 [main] INFO o.e.c.c.network.stack.BlockwiseLayer - coap BlockwiseLayer uses MAX_MESSAGE_SIZE=1024, PREFERRED_BLOCK_SIZE=1024, BLOCKWISE_STATUS_LIFETIME=300000, MAX_RESOURCE_BODY_SIZE=268435456, BLOCKWISE_STRICT_BLOCK2_OPTION=true

We have a workaround in place to re-request blocks more often, so that even with a couple of missed packets, we don't run into pauses of 90s or more.

The correct fix would be for thingsboard to simply set the block2 offset correctly, then we can also perform random access without having to worry about timed out transparent sessions.

boaks commented 7 months ago

For some reason the timeout seems to be much shorter than 5 minutes.

There is a second source of that issue: an ip-address change, maybe caused by a NAT.

Are you using plain CoAP or CoAP/DTLS 1.2? For plain CoAP, if the address is changing, the transfer could not be tracked with 3x or earlier versions. (I plan to adapt that in 4. Tracking will not be possible, but it should be possible to resume an transfer.)

If you use CoAP/DTLS 1.2 CID, then it requires an update to the most recent version of Californium 3.11 and it requires to configure CoapConfig.RESPONSE_MATCHING to PRINCIPAL_IDENTITY.