In the logs uploaded to Discord by @tschak909, there are following 2 cases of the out of sequence problems:
17:42:21.704 > FileHandlerLocal::seek
17:42:23.705 > Timeout after 2000 milliseconds. Retrying
17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:23.714 > Timeout after 2000 milliseconds. Retrying
17:42:24.701 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 15, Expected: 14
17:42:26.700 > Timeout after 2000 milliseconds. Retrying
17:42:27.702 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:27.742 > FileHandlerLocal::read
and
17:47:21.652 > sending block packet ...
17:47:21.681 > Command Packet:
17:47:21.681 > fc ff ff ff ff c3 81 80 80 80 80 82 81 80 81 83 b0 80 d2 a2 f3 80 88 80 ab ef c8 cc
17:47:21.682 > Decoding 9 bytes
17:47:21.682 > handling read block command
17:47:21.683 > Drive 81 Read block 0073a2
17:47:21.683 > iwm_readblock NORMAL READ
17:47:21.686 > FileHandlerLocal::seek
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3d, Expected: 91
17:47:21.736 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3f, Expected: 91
17:47:21.738 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.739 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: ee, Expected: 92
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.742 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.743 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 48, Expected: 91
17:47:21.744 > TNFS OUT OF ORDER SEQUENCE! Rcvd: e5, Expected: 92
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: d, Expected: 91
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 75, Expected: 92
...
In the first case it seems that the session was ultimately recovered, in the second not really.
In both cases, the same packet is received twice, while the client is expecting the next one already:
17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91
I'm not sure what's the reason of double delivery, but the firmware should be able to retry the request and deal with this problem. However, it seems it can't do it. There are two issues here.
Root cause - sending response as a request
Let's take a look on the code validating the sequence number:
Please notice, that we're sending the pkt here. However, it's no longer the request packet. It's been overwritten by the tcp->recv() and now it's the response packet, the one with the wrong sequence number.
To fix this, we should keep request and response separately. Also, we should go immediately to send() if the received response is wrong.
Room for improvement - exhausting buffer on wrong response
The bug above explains the first log report. The second report is caused by a different issue. Apparently server sends a large response with a wrong number. The client reads it chunk by chunk and treats each chunk as a atomic response. This causes the random Rcvd numbers - these are not really sequence numbers, but random data from the middle of the data block.
To make this better, we may try to read all pending data from the TCP connection on the wrong sequence number and then send the original request again (see the description of the bug).
Tech debt - method complexity
_tnfs_transaction method is very complex, over 140 lines with two nested loops and multiple if/else. To make it easier to reason about and avoid bugs as above, we should split it into smaller chunks.
In the logs uploaded to Discord by @tschak909, there are following 2 cases of the out of sequence problems:
and
In the first case it seems that the session was ultimately recovered, in the second not really.
In both cases, the same packet is received twice, while the client is expecting the next one already:
I'm not sure what's the reason of double delivery, but the firmware should be able to retry the request and deal with this problem. However, it seems it can't do it. There are two issues here.
Root cause - sending response as a request
Let's take a look on the code validating the sequence number:
https://github.com/FujiNetWIFI/fujinet-firmware/blob/1b0c424233fc98e417c9f7c163804fb74850479b/lib/TNFSlib/tnfslib.cpp#L1384-L1389
this causes the fall through to the:
https://github.com/FujiNetWIFI/fujinet-firmware/blob/1b0c424233fc98e417c9f7c163804fb74850479b/lib/TNFSlib/tnfslib.cpp#L1360-L1371
unless the recv() doesn't return anything and the timeout occurs, then it logs
and goes to:
https://github.com/FujiNetWIFI/fujinet-firmware/blob/1b0c424233fc98e417c9f7c163804fb74850479b/lib/TNFSlib/tnfslib.cpp#L1339-L1346
Please notice, that we're sending the
pkt
here. However, it's no longer the request packet. It's been overwritten by thetcp->recv()
and now it's the response packet, the one with the wrong sequence number.To fix this, we should keep request and response separately. Also, we should go immediately to send() if the received response is wrong.
Room for improvement - exhausting buffer on wrong response
The bug above explains the first log report. The second report is caused by a different issue. Apparently server sends a large response with a wrong number. The client reads it chunk by chunk and treats each chunk as a atomic response. This causes the random
Rcvd
numbers - these are not really sequence numbers, but random data from the middle of the data block.To make this better, we may try to read all pending data from the TCP connection on the wrong sequence number and then send the original request again (see the description of the bug).
Tech debt - method complexity
_tnfs_transaction
method is very complex, over 140 lines with two nested loops and multiple if/else. To make it easier to reason about and avoid bugs as above, we should split it into smaller chunks.Plan
So, we need to following things:
_tnfs_transaction
into smaller methods.