DragonMinded / netboot

Utilities for netbooting and talking to a NetDimm installed in a Naomi, Triforce or Chihiro.
The Unlicense
52 stars 12 forks source link

FreeBSD exhibits similar behavior to OSX for socket timeouts #14

Open notsonic opened 2 years ago

notsonic commented 2 years ago

I just set up the tools as a web server in a jail on my TrueNAS server, which is freebsd.

My cabinets were rapidly rebooting when trying to send games. I found this PR (https://github.com/DragonMinded/netboot/pull/10) and swapped out "darwin" for "freebsd12" and all is well. I see that there's an env var to trigger the behavior as well.

Maybe raising an issue isn't entirely necessary since I don't really have an issue, I figured it might be worthwhile to have this in the repos history if someone happens to come across it themselves.

DragonMinded commented 2 years ago

Maybe we should just make that check for either? Kinda a pain, because I want it to quickly figure out when a game has been turned off remotely, which means you need a timeout, but BSD/Darwin breaks that. Could also move to a dedicated thread with no timeout and a monitor that nukes it when there isn't motion for awhile? Dunno. The original script I upgraded worked "better" because it didn't try to be fully in control of the process, but then you lose the ability to treat the game like a kiosk and ensure it is running.

notsonic commented 2 years ago

I don't fully understand the implications of the different timeout code paths to be honest. The behavior for my cabinets seems fine. If i turn them on, the game boots. If I change games while its running, they receive the new game. The cabinet status is accurately reflected the whole time (maybe with a bit of delay). Is there some behavior lost without the fast timeout?

DragonMinded commented 2 years ago

The idea behind setting a timeout was so that a stalled connection due to a device going offline mid-send could be detected. On some systems, sockets hang forever in that state, and that means you never return control out of the send or recv call. I could experiment with killing the timeout altogether (like the old system had) and seeing if it didn't behave correctly at least on Linux. I think that might fix things across the board, but it might also have the side effect of getting the state machine stuck.

notsonic commented 2 years ago

I turned one of my cabinets off while it was loading (again I'm using the server set up) and I could see that the status hung at the same percentage in the web interface. After turning it back on again, it restarted from 0% and seems to have transferred the game successfully. I don't know if this means there's a dangling thread from the previous boot.

Is the 1 or 10 second timeout maybe just too aggressive? I'll try out some different values and report back.

DragonMinded commented 2 years ago

That's exactly the issue that the timeout was attempting to fix. I didn't want it hung forever (basically until the next time the cab was powered) sitting at the hung percentage. I wanted the state machine to be able to go back to "waiting for cabinet power".

1 second timeout is FAR FAR too aggressive. Are you netbooting a chihiro/triforce? Try a larger timeout. 10 seconds seems fine for naomi.

notsonic commented 2 years ago

Hey, sorry it's a Naomi. The 1 second timeout I was referring to is this one here: https://github.com/DragonMinded/netboot/blob/trunk/netdimm/netdimm.py#L341

I've been messing with these 3 lines of code but I haven't really used the sockets lib before. Would you be able to explain them (341-343)?

Changing the timeout values doesn't seem to do anything. It really seems like the major difference is setting it to blocking.

I noticed in the docs that settimeout changed with 3.7, I'm on 3.9. Is this relevant?

Changed in version 3.7: The method no longer toggles [SOCK_NONBLOCK](https://docs.python.org/3/library/socket.html#socket.SOCK_NONBLOCK) flag on [socket.type](https://docs.python.org/3/library/socket.html#socket.socket.type).
DragonMinded commented 2 years ago

Oh, good catch, that would definitely screw things up. Setting the timeout used to also go along with blocking implications. Hmmm, ugh. I really don't know. Its basically impossible to try to test all permutations of Linux/OSX/BSD with Naomi/Triforce/Chihiro, especially given I don't have any chihiros, triforces or native BSD devices.

notsonic commented 2 years ago

If it were me, I just wouldn't support BSD lol. I'm only using it because I already had the TrueNAS server running. I could just run this in a linux vm instead of a jail.

I assume the difference comes down to the native socket implementation differences between linux and bsd. They must have different defaults or something. I tried using socket.setsockopt to set time outs SO_RCVTIMEO and SO_SNDTIMEO (socket.settimeout is something that's in the python layer only, apparently) and that didn't seem to do anything.

I did notice one bad behavior using the blocking sockets, the server won't come up if one of the cabinets is already on.

I wonder if there's some magic in socket.create_connection that socket.connect might be missing.