Closed NilanthAnimosus closed 5 years ago
Thanks for the extensive report again, looks like we still have a lot more work to do...
The "character data spam" and high event counts are to be expected when clients join a round - the server needs to send quite a lot of events to get them up to sync, so I suspect the problems are caused by something else.
albeit for 1 player who couldn't spectate at all on a campaign round (Perhaps the save file was different?) due to level equality check errors
This is probably an unrelated issue as well, I think it's probably caused by #848.
Just to clarify, these issues happened when you were playing the campaign mode? It just occurred to me that there's quite a big problem in the way campaigns are synced when a client is hosting the server on the same machine: the server treats the client host the same as any other client, sending them campaign saves which the client then loads. Because both the server and the client are saving the files to the same location, constantly reading and writing to the same file, I think it's highly likely that the files get messed up somewhere along the line, which could explain these mass desync issues. "Oddly these failiures only occured when I myself was trying to play on 127.0.0.1" also seems to imply that it's caused by the client host.
for the record, the locations were in different places.
I host the game by hosting via visual studio under debugwindows x64 "Server"
I play as a client on my own server (Via 127.0.0.1) using my steam directories barotrauma folder.
I get save files in both directories. I do not build the project then host using a debugged client because I am aiming to retrieve server issues and server side crashes, so lately I have begun debugging the server itself as I cannot attach a debugger to two processes at the same time with one instance of visual studio.
After awhile we switched to mission but also lost players, we kept creating new campaigns after each failiure, about 2-4 players must have given up and I had stopped myself, and then the server appeared to stabilize to just dropping random players, they then played campaign later after a mission or two of the mission gamemode, and it was fairly stable except for the one player.
However I did not play this day, I never got in and I gave up - instead I watched the console and switched to trying to debug, resolve and gather information as I could tell something seemed very wrong with this. this is about as much as I could gather from their play sessions and what was shared/stated in the alpha chat.
I was oddly the only player not getting disconnected, though if it was an issue of my network it has the weirdest imaginable timing that stops when I left.
I did some testing with 8-10 clients with high simulated latency and packet loss and may have an idea about what's causing this. I believe it is actually what you suggested in #674, the game simply creates so many events in some cases that 64 events per update (which is 20 * 64 = 1280 events per second with the default update rate) is not enough, which causes the clients to fall behind and eventually get kicked out. An easy way to reproduce it is to fire coilguns for an extended period of time into a direction where they have lots of space to keep going without hitting anything.
It would make sense that you playing on the server on the same machine has an effect on this. EventManager does not create a new event if there's already an identical event waiting to be sent (e.g. a condition update for a launched coilgun bolt), but since the client on the same machine will usually have received all the previous condition updates, it will create a new event almost every time the condition of the bolt changes, which leads to very high event counts.
Thanks for the help, I have a couple of ideas how this could be remedied, and hopefully we can push out a patch soon!
Wait - the server setting for the updates per second actually does anything for syncing? I thought it was a hardcoded timespan of 150 milliseconds or about 6.6 updates / second?
It's now been exposed to serversettings.xml and the default value has been increased to 20 updates per second (tickrate="20").
Ah, I see why I thought this, the gameclient is coded to 150 milliseconds / network update on server join (I assume to send on their side).
Actually the clients use the same setting as the server (see ServerSettings.ClientRead).
yeah I just read that part 5 seconds ago haha
its a shame we are not sending more than one packet in an update to send the extra events - I once did this for items to permit a server to multiply its network costs for better syncing capability, but when the networking is perhaps starting to get under heavy load it would be nice if somehow we could get past the 64 event limitation - I know a packet could probably fit more events as some will be incredibly tiny with little info but still its at best only going to get more restrictive until someone just makes a bigger mod or a bigger submarine with more to network at one time.
Don't suppose you have any ideas to get more events sent out? I thought of attempting it in my old nilmod but I figured the indicator of when the event packet was written was likely important for the order or something so I never did (Only instead handling more position/health and such updates).
The game certainly uses extremely little in terms of networking. and being able to send more packets per update or more information out in general wouldn't exactly be like a DDOS or something.
I just changed the event syncing logic in the development repo so that the EventManager writes as many events per packet as it can fit (there's often room for way more than 64, especially if the events are something that doesn't require writing much data). Another thing we could do is simply sending multiple packets per update if one packet is not enough: most servers should be able to handle more than the current maximum of ~28 KB/s per user (1408 B per packet with 20 packets per second).
The only issue with that is if we write too many events constantly leaves less space for chat and position updates.
Personally I'm more for the latter since Most internet connections today have way more than 200kb/s uploads. mind you going down that path maybe it makes some sense to restructure how the ordering of clients receiving network updates is handled so the top of the list is those needing the most updates?
I have a 625kb/s upload connection myself - measuring at 110MBPS down and 6 MBPS up. but I typically don't see more than 0.02 megabits or megabytes in usage despite my event lists could be going mad, there is definately much more space to write information I'd say as it feels extremely fractional of a usage.
8 players feels more like less than 50-100kb/s to say the least though (not even sure its above 30kb/s).
Maybe we should send events in their own packet(s) entirely, and send item position/chat information and such in their own packet? Not sure if thats a good idea or not but if events are written first and there is too many it'l drown out the other data. for prolongued times that can start to perhaps get weird.
as an additional important note, the serverlog function now goes on top of the chat sending, which additionally eats into the packets. and chat is after event writing - and both are before item positions which would mean high events + serverlog may nearly drown out on a busy server those clients (?).
The way I implemented was that the events still leave a bit of room for other types of data (writing up to 1024 bytes per packet), but you're right that it'd make sense to create multiple packets because we're not really pushing the bandwidth usage that hard at the moment.
As a side note, I'm talking about bytes, not bits. When you say 8 players feels like 50-100kb/s, you mean the total bandwidth usage is somewhere around that many bits per second?
for the kb/s I was talking in kilobytes as I recognized it looks like you talked in bytes.
When I noticed errors of too many events my first thought was to pull up software (In this case my ROG gamefirst thats software for my motherboard so it should be reliable enough) which is intended as a per-application rate limiter, but it doubles up as a very basic network measurement.
I hadn't configured it or anything, and I have the image up above. I was seeing it complain yet I was getting a recording of the network usage for the sake of checking what I was apparently using.
Bytes or bits, it really didn't seem to be a very large number for during the yellow too many events messages. I think judging from the usage it was probably measuring megabits per second. I suppose as megabytes per second 200kb/s download and upload seems ok, but then I'd suspect 200kb/s download on the server to be, past my expectation?
I know kb Is bytes, I even went to double check the math by taking 1408 bytes divided by 1024 to get kilobytes, then found it to be 27.5kb/s which seemed about right to get 28kb/s. though I was just stating a speed tests result which is in megabits usually (Which is 6 megabits per second for me, which translates to 625kb/s as my top upload speed).
ok - this software was measuring in megabits / second.
Changes in 27917ee should help with these issues.
Description Just a small note, but its the first time we got so many concurrent players if not more even than the first hosted time of the alpha.
I have seen a lot of mass disconnects, failed round starts, and other problems. at first we had mass unexplained disconnects where it was just a timeout on attempting campaign - and it was just as possibly my own network as it was the game given my client never was booted (which was on the same machine connected via 127.0.0.1).
But later through the day small issues started cropping up constantly. I saw eternal character data spam, warning messages that events were stacking up (400+ of 64) even to the point where sometimes this exceeded 1200 in short bursts.
often a player would be disconnected from one of these spam waves of events from excessive desync (events over 10.02 or so seconds of age). other players started having level equality check errors on spectate or entity not existing errors on join or even midgame potentially. a number of event errors appeared to be occuring over a rounds duration or on start.
At 8-10+ players the server was completely unstable, at 4-5 it was fairly stable.
one major problem is quite a few details tend not to get logged, servers cannot save consolelogs and a lot of nice information is lost as they are not serverlog entries either (IE. too many events, writing character data, information that could be added really) - It was coming in so much so fast it would have been nice to bundle up some logs and provide them.
this is about 6-8 players, not even 10+ and there was far worse before it lasting 5-8x longer
I failed to catch where it appeared to send out character informtion a good 400-500 times plus in a row, this barely even scratches it: writing char info can get excessive at times
Once the player counts died down - it generally become very stable, albeit for 1 player who couldn't spectate at all on a campaign round (Perhaps the save file was different?) due to level equality check errors. finally I feel like there could be more logged information server-side.
Steps To Reproduce Unknown. too little information if i'm honest.
Version 0.8.9.7 closed steam alpha server on windows 7 64bit - For awhile it was hosted on my code with slight edits which may suffer merge issues. Was going to try this against a steam copy of the server - but by then there were 0 players once I had restarted it and they all headed to bed.
Additional Information
Stuff from logs:
Various images from the very short time period: Oddly these failiures only occured when I myself was trying to play on 127.0.0.1 - though there was in excess of 10 players or so and we couldnt start a round until having lost players including myself to try to diagnose it
these logs are over the full length of hosting:
Too many events for prolongued periods of time (At times easily exceeding 10-15 seconds long):
Very low network usage, despite being very unstable:
Event counts getting high and network errors appearing to occur:
Excessively high event count + another disconnect:
disconnect error from a client later into the hosting:
Another error from a client that caused a disconnect: