FujiNetWIFI / fujinet-firmware

8-bit systems to ESP32 WiFi Multifunction Firmware
https://fujinet.online
GNU General Public License v3.0
218 stars 66 forks source link

TNFS sequence out of sequence errors are not handled correctly #741

Open trekawek opened 1 month ago

trekawek commented 1 month ago

In the logs uploaded to Discord by @tschak909, there are following 2 cases of the out of sequence problems:

17:42:21.704 > FileHandlerLocal::seek
17:42:23.705 > Timeout after 2000 milliseconds. Retrying
17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:23.714 > Timeout after 2000 milliseconds. Retrying
17:42:24.701 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 15, Expected: 14
17:42:26.700 > Timeout after 2000 milliseconds. Retrying
17:42:27.702 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:27.742 > FileHandlerLocal::read

and

17:47:21.652 > sending block packet ...
17:47:21.681 > Command Packet:
17:47:21.681 > fc ff ff ff ff c3 81 80 80 80 80 82 81 80 81 83 b0 80 d2 a2 f3 80 88 80 ab ef c8 cc 
17:47:21.682 > Decoding 9 bytes
17:47:21.682 > handling read block command
17:47:21.683 > Drive 81  Read block 0073a2
17:47:21.683 > iwm_readblock NORMAL READ
17:47:21.686 > FileHandlerLocal::seek
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3d, Expected: 91
17:47:21.736 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3f, Expected: 91
17:47:21.738 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.739 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: ee, Expected: 92
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.742 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.743 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 48, Expected: 91
17:47:21.744 > TNFS OUT OF ORDER SEQUENCE! Rcvd: e5, Expected: 92
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: d, Expected: 91
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 75, Expected: 92
...

In the first case it seems that the session was ultimately recovered, in the second not really.

In both cases, the same packet is received twice, while the client is expecting the next one already:

17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15 17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91

I'm not sure what's the reason of double delivery, but the firmware should be able to retry the request and deal with this problem. However, it seems it can't do it. There are two issues here.

Root cause - sending response as a request

Let's take a look on the code validating the sequence number:

https://github.com/FujiNetWIFI/fujinet-firmware/blob/1b0c424233fc98e417c9f7c163804fb74850479b/lib/TNFSlib/tnfslib.cpp#L1384-L1389

this causes the fall through to the:

https://github.com/FujiNetWIFI/fujinet-firmware/blob/1b0c424233fc98e417c9f7c163804fb74850479b/lib/TNFSlib/tnfslib.cpp#L1360-L1371

unless the recv() doesn't return anything and the timeout occurs, then it logs

17:42:23.714 > Timeout after 2000 milliseconds. Retrying

and goes to:

https://github.com/FujiNetWIFI/fujinet-firmware/blob/1b0c424233fc98e417c9f7c163804fb74850479b/lib/TNFSlib/tnfslib.cpp#L1339-L1346

Please notice, that we're sending the pkt here. However, it's no longer the request packet. It's been overwritten by the tcp->recv() and now it's the response packet, the one with the wrong sequence number.

To fix this, we should keep request and response separately. Also, we should go immediately to send() if the received response is wrong.

Room for improvement - exhausting buffer on wrong response

The bug above explains the first log report. The second report is caused by a different issue. Apparently server sends a large response with a wrong number. The client reads it chunk by chunk and treats each chunk as a atomic response. This causes the random Rcvd numbers - these are not really sequence numbers, but random data from the middle of the data block.

To make this better, we may try to read all pending data from the TCP connection on the wrong sequence number and then send the original request again (see the description of the bug).

Tech debt - method complexity

_tnfs_transaction method is very complex, over 140 lines with two nested loops and multiple if/else. To make it easier to reason about and avoid bugs as above, we should split it into smaller chunks.

Plan

So, we need to following things:

  1. Split the _tnfs_transaction into smaller methods.
  2. On the wrong sequence number
    • exhaust buffer,
    • send the request again,
    • make sure we don't send response as a request.