Neos-Metaverse / NeosPublic

A public issue/wiki only repository for the NeosVR project
197 stars 9 forks source link

Desync for Australian/New Zealand users on US headless sessions with high player count #3039

Open Stellanora64 opened 2 years ago

Stellanora64 commented 2 years ago

Describe the bug?

For primarily Australian and New Zealand users that are on a US headless servers with a higher player count (around 8 to 10 or more) Desync begins to occur.

Note:

Desync begins to affect primarily:

Relevant issues

no apparent relevant issues that haven't already been fixed.

To Reproduce

Reproducing this issue may be hard to replicate for users other than Australian and New Zealand users but is a consistent issue in creator jams where the player count is at around 10 or more and the headless is US.

the main things needed to reproduce the bug are to be connected to a headless that is generally a very far distance from where you are playing (a VPN may help with this) and to have the session have 10 or more users.

Expected behavior

to have no desync in headless that are in a different region to users that join, and for there to be no desync with a high player count session.

Log Files

DESKTOP-F65B6FQ - 2021.9.16.105 - 2021-09-19 10_57_51.log

the issue occurs at 1:28:57 PM.179

Screenshots

No response

How often does it happen?

Always

Does the bug persist after restarting Neos?

Yes

Neos Version Number

2021.9.16.105

What Platforms does this occur on?

Windows, Linux

Link to Reproduction Item/World

No response

Did this work before?

I Don't Know

If it worked before, on which build?

No response

Additional context

No response

Reporters

Neos: Crusher Discord: Crusher#6146

shadowpanther commented 2 years ago

As far as I know, Australia has a very low-bandwidth uplink to the rest of the world, so even if your local connection has high bandwidth, connection to any host outside of Australia would be rate-limited. Desync happens because that rate-limit is saturated with streams from all users coming through the host user. Datamodel changes have lower priority and thus changes get delayed.

I'm not sure what the solution to this would be apart from adding more uplink cables from Australia to the rest of the world to add bandwidth. Maybe mesh networking would help a bit as you would have multiple connections to different hosts instead of one.

Stellanora64 commented 2 years ago

That does make sense, although the only thing that I would believe has a higher priority is the voice mode of other users but this doesn't seem to be the case.

ohzee00 commented 2 years ago

Sadly I'm unsure how much can be done here(besides perhaps mesh networking linked above) however the only possible suggestion is asking users to change their settings to Steam networking sockets in Neos. Note though, this only works if the host has steam being it runs through their service, headlesses might need to be configured with that in mind.

Mind you this isn't a direct fix, I just remember helping some users before that were desyncing badly due to latency and that Steam networking sockets helped them somewhat, at least allowed them to play in a semi-usable state.

Frooxius commented 2 years ago

Unfortunately this is a limitation of the current transport protocol that we are using, it unfortunately degrades with certain connections (typically high latency, but can also be just a result of quirks of given connection).

You can try switching to "Prefer Steam Networking Sockets" in the settings, which can behave a lot better in these scenarios, but unfortunately currently doesn't have bandwidth estimation, so it can end up dropping the connection as well.

We are currently waiting on Valve to implement this feature so we can switch to it as the main protocol (or specifically the open version Game Networking Sockets).

https://github.com/ValveSoftware/GameNetworkingSockets

Enverex commented 2 years ago

Would this also be why people on very low speed connections (e.g. ~3Mb/s down, 0.5Mb up ) are unable to ever reach sync, even in a world with one other person? I have a friend in France and regardless of whether he is host or client, even with a single other person present with almost nothing else happening, he will not be able to maintain sync.

shadowpanther commented 2 years ago

doesn't have bandwidth estimation

Speaking of, if bandwidth estimation would be implemented (by SNS or GNS), could it be used by Neos to downgrade audio streams bitrate and maybe pose stream frequency to allow for eventual sync for clients with very limited bandwidth?

iamgreaser commented 2 years ago

GreaseMonkey here.

The server we were getting desynced on is the main Creator Jam hub server which is hosted in Ukraine. I get a ~305 msec ping to it.

For reference, about 8 years ago I was getting pings of about 200 msec to US West, 260 msec to US East, and 350 msec to Europe. Nowadays I get about 180 msec to US West and I haven't remeasured the rest.

I think I'm hitting the LNL Relay connection in several cases, even though I know that my NAT handles UDP holepunching just fine.

Connection speed is... pretty decent? I suspect it does boil down to latency.

Lexevolution commented 2 years ago

Whenever I had any major desync on some sessions, and switched to SNS, I sometimes got this strange side effect where everything seemed like it was synced, but in realty, all of my actions, voice and my perspective of the world was 30+ seconds behind. And to everyone else, I was lagging behind their conversations/actions by 30+ seconds. It looked very different to the regular desyncing issue which doesn't usually affect voices.

Frooxius commented 2 years ago

@Enverex Don't know, we'd need to gather data on that. Does Steam Networking Sockets make a difference? Could be lots of things with the connection.

@shadowpanther That's unlikely, typically that detail is hidden in the protocol itself and not accessible. Generally those don't pose much of an issue anyways, since those are using streams and can be lost. It's the reliable changes that start queuing up. Usually it's not even the bandwidth itself, but rather packet rate.

@iamgreaser Sometimes it's the combination of connections that just don't work with UDP holepunching. We've had cases where you could have connection A, B and C. A and B would work fine, B and C would work fine with each other, but A and C will always go through the relay.

@Lexevolution Steam Networking Socket handles things differently, which is why you get that effect. Essentially you hit the fixed bandwidth limit, so everything starts trickling through and delaying like that. It's why we need the bandwidth estimation to be implemented in the protocol before we can switch to it as primary one.

sveken commented 2 years ago

Is there any roadmap or really rough date when we could see that happening? Unfortunately the desync issue for Australians which i only experience on Neos pretty much locks us out of things like the MetaMovie which was a ton of fun when it worked and other big events,

Frooxius commented 2 years ago

@sveken We asked on the bandwidth estimation here: https://github.com/ValveSoftware/GameNetworkingSockets/issues/108 But currently there's no ETA on their end so we just have to wait and see before the switch. I saw some movement for bandwidth estimation a few months ago.

There are a few things that I'm looking into on our end though that might help improve the network performance, mainly with combining smaller messages into a bigger one to reduce the overall packet-rate.

Frooxius commented 2 years ago

I pushed upgraded LNL library in 2021.10.25.1351, which should have a number of improvements that should help with this.

Can you give that a go and see if it's any better please?

sveken commented 2 years ago

Will give a little test tonight and i have booked another Metamovie ticket for the 30th to test there, as that is where i ran into the most desync issues with all the cool things that go on. Will report back afterwards,

Just to confirm, i am best to disable the "Prefer Steam Sockets" now with the new update?

Stellanora64 commented 2 years ago

I'll do the same this evening, and I'll see how it goes with the creatorjam this weekend as it is really consistent with desync there.

@sveken yes. Disable preferring steam sockets as the updated libraries only affect LNL networks. (from what I understand)

sikirebirth commented 2 years ago

To continue/answer the questions asked in the other issue,

Frooxius commented 2 years ago

@sikirebirth If you can get a screenshots of the user list in the essentials that can help! Did you check the queued packets yourself on your end or did the host check?

Ideally if you can get the host to check what it says for you that can help, because the value you're seeing won't be quite up to date, due to the data model being delayed.

sikirebirth commented 2 years ago

The qued packets were always checked by hosts of the worlds I was in, and other people of the world. i am unsure when i can jump back on neos for this week but i will post the requested things whenever i do.

Frooxius commented 2 years ago

Sweet thanks for the info! We'll see if we get some more data from others in the meanwhile too.

BigRedWolfy commented 2 years ago

This weekend I'd like to see how the changes to networking affect both DeSync and ReSync sessions. Another temporary solution to try limit the issue I'll eventually get around to is spreading both sessions into multiple smaller sessions with tighter controls to limit the number of people in one session, around 5 maybe more, by allowing many more people to be connected to the headless spread over multiple worlds and having items syncronised that may be present in more than one individual session similar to VBLFC using nested sessions. Currently this is much more preferable than organising a private intercontinental network bridge between the two current headless sessions running, since such an arrangement would be very expensive

sveken commented 2 years ago

I have only done limited testing so far, (still waiting for the weekend). I do think there is definitely an improvement i was able to interact normally in a world with 17 people perfectly fine, However as soon as the 18th person joined i noticed the queued packets quickly starting to climb and then stabilize at 24,000 as reported by the user list thingy in the world.

As soon as the user count dropped to 17 or lower the queued packets rapidly started to drop down and go back to 0. Neos was only using 1.5Mbits/s of my bandwidth, this was a headless server world. Will report back with how Metamovie goes.

Frooxius commented 2 years ago

@sveken Thanks for the info!

It almost sounds like it hit some bandwidth limit on the host.

@BigRedWolfy Limiting the session can definitely help as it lowers the overall amount of bandwidth, but we'd like to make sure there can be as many people as possible.

There's still aspect that the updated LiteNetLib doesn't help much - it uses the sliding window algorithm, which doesn't scale super well with large latency. The Game Networking Sockets should work much better in this regard, but we're still waiting on bandwidth estimation.

sveken commented 2 years ago

I can't remember unfortunately, is there a previously visited section i can check for you? My download/upload is 30/30Mbits/s over 4G I was the only one experiencing large amounts of consistent queued packets, some other users would get a brief 300-800 but it would disappear. The 1.5Mbits/s is just what was reported by Task Manager, i can monitor the specific up/down on the router next time if that helps.

Frooxius commented 2 years ago

Thanks for the info! Do you know at least know where they were located and what the ping was?

Stellanora64 commented 2 years ago

After some testing desync is still occurring (but only occurring at around 16 to 18+ users in the session), and once the session reaches a certain amount of network traffic the queued packets increase around a 100 packets every 2 to 3 seconds without stopping, unless the player count decreases.

But the updated libraries have certainly helped as I only start getting desync once the player count is around 16 to 18 users which is nearly an additional 10 users than before the update, and the queued packets catch up relatively quickly only being around 30 to 40 seconds until I'm fully synced once the player count is below the threshold.

The session had ~140 ping when testing.

Here are the logs from the session. Most of the desync occurred at 6:28:39 PM.075

DESKTOP-F65B6FQ - 2021.10.26.9 - 2021-10-28 18_03_48.log

sveken commented 2 years ago

Just did the MetaMovie again, The starting was much better only starting what felt like 5 seconds of dysync compared to 5 minutes before the update. However further along in the story the deysnc did get worse as more things happened, i was told my high score was 50K queued packets. Bandwidth did not go over 1.3mb/s on the download for Neos. World had 14 people in it.

Definitely an improvement however.

rabbuttz commented 2 years ago

I don't know if my issue is related to this, but I have recently been experiencing out-of-sync issues in certain sessions. This has been happening more frequently since the recent update regarding the network came in. Specifically, item locations, Logix processing, etc. are out of sync. Strangely enough, the voice and user locations seem to be perfectly in sync, and when I check the QueuedPackets in Neos, it shows 0. Sometimes I can't see the user even though they are supposed to be there. When I go to the dash menu and look at the session details, it looks like the user is not there. This problem did not seem to be related to whether or not I was using SteamNetworkingSockets. My internet speed is 368.0Mbps↓/29.5Mbps↑. The logs are attached. The problem occurs when I'm in a session called "SLOT開発室", around 3:00. https://www.dropbox.com/s/69oi5kvwi3tvj5f/DESKTOP-CB6HJR0%20-%202021.10.30.625%20-%202021-11-01%2020_12_06.log?dl=0

Nutcake commented 2 years ago

Hello, I've also been experiencing this issue the past two weeks and just wanted to add another datapoint.

I've been trying to play on a headless server located in the US (Washington to be specific) and I am located in Germany with a 500 MBit/s down, 50 MBit/s up DOCSIS 3.1 based connection. My latency to the server according to the userlist is around 60-80ms (tho I have been told that number is just a single direction, so double that for RTT I guess?). I can play fine with around 6-8 people on the server but more than that and I quickly start getting a massive packet-queue in the hundred-thousands and rising and extreme desync. The connection is an LNL connection and I've tried both directly connecting via IP and by joining the session through my contacts-list which seems to use NAT-Punchthrough.

I've also tried to use German and US-based servers from a high speed VPN service to connect just as an experiment and that made no difference.

Another user from Germany has experienced the same issue in that server and I know that we both use the same ISP provided router (Arris TG3442DE), so since I wanted to get a better router anyways I'm going to buy a new one soon and check if that has any influence on this issue.

I hope this can be resolved soon, as this issue completely prevents me from taking part in the weekend events my community is hosting.

Edit: An update to this, we figured out that the world that the headless server was using had a clock in it with extremely unoptimized logix that was spamming network packets every tick. Removing that clock seems to have fixed the issue for me enitrely, though we were "only" able to test it with around 16 people.