FujiNetWIFI / fujinet-firmware

8-bit systems to ESP32 WiFi Multifunction Firmware
https://fujinet.online
GNU General Public License v3.0
218 stars 66 forks source link

TNFS TCP connection not reconnecting when timed out by client #725

Open tschak909 opened 2 months ago

tschak909 commented 2 months ago

The TNFS client is not detecting that the TCP connection has been closed, and attempts to use it.

TNFSD also needs checking.

-Thom @trekawek

trekawek commented 2 months ago

tnfsd offers a built-in mechanism for removing expired sessions. An expired session is the one where the last contact was made in more than SESSION_TIMEOUT, by default 6h. The session is removed then, but the TCP connection is not disconnected, so client may think it's still connected.

Also, mechanism for removing old sessions runs when the new client is connected, so it may give impression that new clients are removing the old ones.

This probably can be reproduced by:

  1. Decreasing the SESSION_TIMEOUT e.g.: to 60s, so the issue is easier to reproduce.
  2. Connect to tnfsd with FujiNet, wait for 60s (client 1).
  3. Connect to tnfsd with another FujiNet or Python client (client 2).

This should allow to confirm whether client 1 and/or client 2 behaves as expected or not.

Once this is reproduced, we can:

  1. Disconnect TCP when the session expires.
  2. Validate that FujiNet tnfs behaves correctly when it's disconnected by tnfsd after such timeout. Correct behaviour is reconnecting with a new session id.
trekawek commented 2 months ago

I investigated the problem a little bit further and now I think the cause lies elsewhere.

The session timeout works fine, it shouldn't disconnect TCP sockets, because a single socket may handle multiple sessions. In fact, we should disable it for the TCP connections completely, because if a connection is alive, so should be the session. This mechanism makes sense for the UDP only, where we don't know if the session is alive or not if it's idle. I created https://github.com/FujiNetWIFI/tnfsd/pull/1/commits/98ea15a86984fd2a8bfd307b87beea14d15dfade to address that, but it's not the root cause of the problem.

There are 2 factors causing problems around timeouts and reconnects:

TNFS server

If the TCP connection is disconnected, TNFS server will remove all related sessions. As a result, if the same client connects again and tries to use the same session id, they'll receive invalid session error (and most likely they won't know how to handle it, see the section about the client).

In the https://github.com/FujiNetWIFI/tnfsd/pull/1 I changed the behaviour of the server - after that it will no longer remove the sessions on TCP disconnection, but only unset the socket FD.

Breaking TCP connections can be tested by staring tnfsd on a non-standard port 16385 and putting https://github.com/Shopify/toxiproxy between tnfsd and FujiNet client:

tnfsd -p 16385 &
toxiproxy-server &
toxiproxy-cli create --listen 0.0.0.0:16384 --upstream localhost:16385 tnfs

With this command, toxiproxy will automatically redirect traffic from its own 16384 port to tnfsd 16385. Now we can bring the connection down/up:

toxiproxy-cli toggle tnfs

or introduce latency:

toxiproxy-cli toxic add -t latency -n tnfs_latency -a latency=100 -a jitter=50 tnfs

After the PR above, TNFS client is able to get it session back, even after TCP connection is lost.

TNFS client

TNFS client currently is not able to handle invalid session error coming from the server. Apart from the disconnected TCP socket (resulting in the removed sessions, as above), this may happen if the server is restarted or the session expires after 6h (see my previous comment), regardless of the used protocol: TCP or UDP.

The code responsible for handling such cases exists, but now it's disabled for the FujiNet hardware (it's only active on PC):

https://github.com/FujiNetWIFI/fujinet-firmware/blob/b832124b07b37ccaf63ce40b59c6fb49d014af3a/lib/TNFSlib/tnfslib.cpp#L1406-L1411

It tries to create a new session if the old one is invalid. I tested it and it's able to handle the lost session gracefully. It's enough to e.g.: keep browsing the TNFS directory tree. Unfortunately, handles to all opened files are lost, as they are kept server-side in the session data.

I created #727 to enable this feature. While testing it, I found another minor problem in the tnfsd, which won't return any response if the provided FD is invalid. It's fixed in the https://github.com/FujiNetWIFI/tnfsd/pull/1/commits/c1a359737e2ff5d40fb3e82a699523f7740b22e0.

Also, https://github.com/FujiNetWIFI/fujinet-firmware/pull/727/commits/b772f99c6fc54fd06daed60d5733e926d560586a adds support for _udp. prefix. It can be useful for the testing purposes.

trekawek commented 1 month ago

I added one final commit https://github.com/FujiNetWIFI/fujinet-firmware/pull/727/commits/874f0be97ecffe574c436932eb66e1703abfa553 to the #727. It starts a periodic timer for every mounted TNFS filesystem and sends a keep-alive message every 60s.

This way:

Long story short: this should allow mounting an ATR image and leave the computer overnight. The next morning the mounted ATR should be still usable.

The only case where it wouldn't be usable is the tnfsd server restart.