Tarmslitaren / FrosthavenAssistant

flutter app
GNU Affero General Public License v3.0
182 stars 45 forks source link

Feature/147 Create a separate dockerized server #229

Closed aschneem closed 3 weeks ago

aschneem commented 1 month ago

Network refactoring around the server code. Moved the base server logic without the Flutter dependencies to its own abstract class and refactored the existing Server class / file to inherit from it. I then moved the abstract class to a separate pure dart project folder frosthaven-assistant-server and created an implementation in pure dart. The implementation was able to be fairly simplified due to not fully needing to understand and update the game state itself since it can mostly serve as a relay to the other clients. When containerizing the code I found that the remoteAddress was not necessarily enough information to distinguish between the different socket clients so I made a small change to also include the remote port.

If I missed something please let me know. I wasn't sure if it was necessary to update the github actions for the project

Also I've done some testing throughout the development and tested it during a Gloomhaven session this last weekend. It worked fairly well. It was a little hard to know if issues I saw were transient issues caused by things like my laptop going to sleep and running on a hotel WiFi network. I'd appreciate additional testing and being informed of any issues encountered and steps to reproduce if possible.

Note that I tried to make sure that I didn't make changes that would cause problems with the existing client / server communication contract.

aschneem commented 1 month ago

@Tarmslitaren I know that this is a big one and that you probably weren't expecting it take your time before you merge it. Let me know if you have questions. I'm hoping that we might get a couple people to test at their next X-Haven sessions to try and shake out some additional bugs.

dekimsey commented 1 month ago

This is awesome! I was looking at this myself last weekend and realized it needed a larger refactor to make it work.

I'll definitely give it a spin and see if it works for us! Our next meet isn't till the 20th, so I won't have real world feedback till then. But I'll try building it in advance at least.

Thank you for putting this together!

dekimsey commented 1 month ago

I very excitedly played with this tonight. Ran into a small build issue which I've noted in the PR.

Additionally, a couple of observations:

This is awesome! My group has struggled a bit with connectivity stability, I suspect this could eliminate problems for us. My group will be excited to try it!

aschneem commented 1 month ago

Thanks for looking in to things. I accepted your change. I added listeners for SIGINT and SIGTERM, which do seem to help. I had them call the existing stopServer before exiting it should be more graceful, but there still is probably a better way to do / handle things. When I use docker run attached it still won't respond for ctrl+c for me unless I pass the -it flags. I'm working on Windows if it is potentially related to that. I did see that it does appear to respond correctly to start / stop in Docker desktop. When I was looking at the persistence in the flutter server code it was making use of shared preferences, which wasn't something I could use in pure dart. I basically pulled out the persistence piece and when I started testing, it seemed to be working good enough since it will essentially take the state from the first connected client. It is possible that this is introducing some issues that I haven't diagnosed yet, but I figured that it might be fine operating with just in memory state. I will say that I was seeing some strange transient client disconnected / server unresponsive errors during our testing and tuned down the time between server pings and that seemed to resolve things as best that I can tell currently. Also in case it is relevant the session that my group used it, I was running dart from the command line since it was before I got the dockerfile figured out.

nicholas-st-parker commented 1 month ago

This is really nice. I will try to test this out on Friday. I've been running FrosthavenAssistant on my desktop and connecting with mobile devices for reliability issues. This would make it much more convenient.

Tarmslitaren commented 1 month ago

Looks nice guys! Tell me when you think this is ready and tested ( within reason ) and then I will merge.

I really appreciate the engagement since this is not a feature that's up my alley.

dekimsey commented 1 month ago

So we did run into some trouble with our attempt this weekend. There were cases where the server just stopped listening/responding. Process is still running, but the server is no longer responding. Nothing useful in the logs, but it looked like the usual suspect: one of our players has a (cough, strong) tendency to navigate away from the app. It wasn't a guarantee, but when it happened it would definitely kill the server until I restarted the process.

Thinking aloud, I'm not sure what to do here. The process doesn't log enough information to provide useful diagnostics. So maybe, that's worth doing in this effort? Or perhaps as a second PR on top of this one? Otherwise it was definitely more stable than running it from the phone, that's a success!

Tarmslitaren commented 1 month ago

If you prefer to add diagnostics in a separate MR, then I can merge this now. (provided, regular non-dockerized connections still work ok)

If the server stops responding after a client disconnects and the reconnects frequently, it is of course useful to look into how the reconnecting is working. It is possible a new socket connection is opened every time, which might not be great. (Haven't touched this in a while)

aschneem commented 1 month ago

I'll put in some additional logging and see what I can do to recreate. I feel like I saw some similar things the one real world test that I ran, but I wasn't able to consistently recreate afterwards after some tweaks. I'll try running some additional testing with multiple clients over a longer timeframe to see if I can recreate this. My group typically runs on a monthly cadence. I might be able to get a real world test in a few weeks.

nicholasstparker commented 1 month ago

We tried this out on Friday and ran through two scenarios. It worked well, but there seemed to be some wonkiness when hitting undo more than a few times. I will test it out again later.

aschneem commented 1 month ago

I pushed out some changes the most obvious thing that I've been finding that could cause an issue is trying to access something on a closed socket connection, which could potentially happen in a race condition and could be more likely with many clients connecting and disconnecting. I added some logging about the connection health including how many connections have been made. I suspect some of the server unresponsive things could be how the client is handling the connection when the app is paused and then resumed with switching away. On the undo side I did see that there could be an index access error, but think that it might need a little more work than that potentially. In particular I've noticed some strange behavior if the edit screen is up with one client for a character and a different client undo's the actions done with that edit screen. I personally am okay shipping with potential issues with undo, but would like to confirm that the stability issues have been addressed.

dekimsey commented 4 weeks ago

Definitely fewer crashes. Last night we experienced ~ 2 hard crashes that took the server down, (container needed to be restarted). Only 1 or 2 disconnects.

Noted bugs:

2024-07-27.log

That all being said, this is good work. I got compliments from my players that the changes were definitely improving the game play experience. I know squashing this bugs isn't likely directly server code. But I think the fact this PR separates the server from the UI is letting us identify and tackle them. And that's been 100% worth it.

Tarmslitaren commented 4 weeks ago

@dekimsey changing the client version mismatch message to something like: "Client version mismatch. Please update. (Server Version is $serverVersion. client version is $version.)" would give some more information. What more I see from that log file is that you probably do not use the latest pushed version of the game server code as the crash report lines do not match. This could also be the culprit of the client mismatch error, as the server was initially version 190, while the main game has been updated to 191. Since the game code is the same it should not cause any actual errors as long as all clients have the same version.

Can you say which scenario load caused the crash? It might not be related, since the crash is on a ping message, but it doesn't hurt to check, in case it's a reproducible error.

dekimsey commented 4 weeks ago

@dekimsey changing the client version mismatch message to something like: "Client version mismatch. Please update. (Server Version is $serverVersion. client version is $version.)" would give some more information.

Will do.

What more I see from that log file is that you probably do not use the latest pushed version of the game server code as the crash report lines do not match. This could also be the culprit of the client mismatch error, as the server was initially version 190, while the main game has been updated to 191. Since the game code is the same it should not cause any actual errors as long as all clients have the same version.

Interesting, I know I re-fetched, rebuilt, and pulled it for the evening's game. I checked, and I can honestly say I don't know what happened. Perhaps I failed to reset the head or rebase. But I'm definitely missing the added logging. Very strange, sorry for the bad report!

Can you say which scenario load caused the crash? It might not be related, since the crash is on a ping message, but it doesn't hurt to check, in case it's a reproducible error.

Scenario #10 Crystal Enclosure

aschneem commented 3 weeks ago

Thanks for the info, even if it wasn't the most up to date I did find a few additional places that were accessing the remote address on the socket that could potential be closed in a race condition that from what I've seen has been the source of the hard crashes. I did get a chance to use things out in the wild yesterday and everything seemed to work well. I did see a few server unresponsive messages pop-up, but they did seem to indicate an issue with the server, but I suspect are mostly due to connections breaking and needing to be reestablished when the client app pauses if that is the Flutter terminology for the application lifecycle event when the app is hidden. I did see that with 4 users in about a 2 and a half hour session we saw 66 total connections get established.

If everyone else is I think I'm fine with merging at this point and when I'm able to I'll try to dig in to the client code and see I can make some improvements to how it manages connections to get rid of the server unresponsive errors.

I did push in a couple minor changes today with this comment where I made some of the code accessing the remote address on a Socket safer and I also added the client and server versions to the set network message client mismatch error.

Tarmslitaren commented 3 weeks ago

All right, I'll merge this now. Thank you everyone! Please contact me for any questions regarding the code base if needed, gong forward. And please advise if/when it would be a good idea to backport the separate server for use in the flutter app.

The main valid reason for clients to disconnect is when they go to background is in IOS, the platform does not allow sockets to stay open in background. Android is more open, but there are still limitations.

There is code to automatically reconnect and so get the latest state when app goes back to foreground, however as discussed here it seems to not always work as intended for getting the latest state?