Neos-Metaverse / NeosPublic

A public issue/wiki only repository for the NeosVR project
197 stars 9 forks source link

Unexplainable Connection Failures #3357

Open Enverex opened 2 years ago

Enverex commented 2 years ago

Describe the bug?

When trying to join sessions, it occasionally seems to fail, trying again may fail again but it will eventually work. The host sees the client so it gets that far, but doesn't complete.

From the client's perspective, you never enter the session. The connection messages (in the bottom left) disappear but the session never loads and there's no error message shown in-game.

Relevant issues

It could be related to the "kicked from world" reports in https://github.com/Neos-Metaverse/NeosPublic/issues/3265 but I've no evidence that's the case, just adding here just in case.

To Reproduce

Try joining a session. I can't really elaborate more than that unfortunately.

Expected behavior

For sessions to join or fail clearly, or in cases like this where there's no reason for it to fail, for the join to succeed.

Log Files

Example log from both the client and server side of this happening. On the client side, I saw messages about Punchthrough (attempt 1-5) then LNL Relay then the messages disappeared, but nothing had happened.

The headless I was connecting to has no NAT, firewall or anything else, it's a direct IP assignment to that machine.

== Client
19:02:27.981 (119 FPS)  Resolving SessionID: 
19:02:27.981 (119 FPS)  Joining session: lnl-nat:///S-U-MetaMovieCore1:MMLobbyNew
19:02:27.983 (119 FPS)  NetworkInitStart
19:02:27.984 (119 FPS)  Network manager: NetX.LNL_Manager - priority: 0
19:02:27.984 (119 FPS)  Network manager: FrooxEngine.SteamNetworkManager - priority: -100
19:02:27.987 (119 FPS)  Connecting to: lnl-nat:///S-U-MetaMovieCore1:MMLobbyNew
19:02:27.989 (119 FPS)  Punchthrough attempt: 0
19:02:29.040 (119 FPS)  Starting GatherJob for https://operationaldata.neos.com/thumbnails/f7ec743d-8d12-4847-96df-1f0d66e8067c-v2.webp, AttemptsLeft: 5
19:02:29.062 (119 FPS)  Punchthrough attempt: 1
19:02:29.635 (119 FPS)  Finished Gather for: https://operationaldata.neos.com/thumbnails/f7ec743d-8d12-4847-96df-1f0d66e8067c-v2.webp. Elapsed: 0.5948495 s
19:02:30.064 (119 FPS)  Punchthrough attempt: 2
19:02:31.066 (119 FPS)  Punchthrough attempt: 3
19:02:32.069 (118 FPS)  Punchthrough attempt: 4
19:02:33.071 (118 FPS)  NAT Punchthrough failed, Connecting to Relay
19:02:34.254 (118 FPS)  Starting GatherJob for https://operationaldata.neos.com/thumbnails/06458b38-1bb2-4f8a-bfa0-6341a5cc749f-v2.webp, AttemptsLeft: 5
19:02:34.256 (118 FPS)  Starting GatherJob for https://operationaldata.neos.com/thumbnails/143fab93-29be-455f-aaa5-4c6811aac3fe-v2.webp, AttemptsLeft: 5
19:02:34.734 (118 FPS)  Finished Gather for: https://operationaldata.neos.com/thumbnails/06458b38-1bb2-4f8a-bfa0-6341a5cc749f-v2.webp. Elapsed: 0.4795639 s
19:02:34.782 (118 FPS)  Finished Gather for: https://operationaldata.neos.com/thumbnails/143fab93-29be-455f-aaa5-4c6811aac3fe-v2.webp. Elapsed: 0.5257852 s
19:02:37.755 (118 FPS)  NAT INTRODUCTION SUCCESS: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew
19:02:37.755 (118 FPS)  NAT INTRODUCTION SUCCESS: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew
19:02:37.755 (118 FPS)  NAT INTRODUCTION SUCCESS: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew
19:02:37.755 (118 FPS)  NAT INTRODUCTION SUCCESS: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew
19:02:37.755 (118 FPS)  NAT INTRODUCTION SUCCESS: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew
19:02:37.767 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.767 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.767 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.768 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.768 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.768 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.769 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.769 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.770 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.770 (118 FPS)  NETWORK RECEIVE UNCONNECTED: 46.4.158.30:47581, token: s-u-metamoviecore1:mmlobbynew, connection token: S-U-MetaMovieCore1:MMLobbyNew
19:02:37.786 (118 FPS)  LNL_Connection OnPeerConnected: 46.4.158.30:47581, IsOpen: False
19:02:37.786 (118 FPS)  Connecting to relay but connected to host directly. Switching connection state back to direct
19:02:37.787 (118 FPS)  Connected to: lnl-nat:///S-U-MetaMovieCore1:MMLobbyNew
19:02:37.792 (118 FPS)  Waiting for join grant
19:02:38.127 (118 FPS)  Join Granted, Begin Data Model Init
19:02:39.153 (119 FPS)  Disconnected: 52.165.32.220:12601, reason: ConnectionFailed, socketErrorCode: Success
== Server
7:02:37 PM.546  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.546  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.546  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.546  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.547  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.547  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.547  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.547  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.548  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.548  NAT INTRODUCTION SUCCESS: 91.125.80.36:50283, token: s-u-metamoviecore1:mmlobbynew
7:02:37 PM.562  LNL Listener - connection request: 91.125.80.36:50283
7:02:37 PM.562  Peer Connected: 91.125.80.36:50283
7:02:37 PM.889  Verifying Join Request. UserID: U-Enverex, Username: Enverex, AccessLevel: FriendsOfFriends
7:02:37 PM.889  Join Granted For UserID: U-Enverex, Username: Enverex
7:02:37 PM.900  User Joined MetaMovie: Alien Rescue (Live). Username: Enverex, UserID: U-Enverex, AllocID: 10, AllocIDstart: 1,  MachineID: gxy6myuoy0easqkfxbkwuq
7:02:37 PM.901  User Spawn MetaMovie: Alien Rescue (Live). Username: Enverex, UserID: U-Enverex, MachineID: gxy6myuoy0easqkfxbkwuq
7:02:37 PM.901  Spawning User Enverex (IDDFB3200)
7:02:37 PM.901  User Enverex Role: Admin, HasFingerTracking: False, HasEyeTracking: False, HasLipTracking: False
7:02:38 PM.969  Peer Disconnected: 91.125.80.36:50283, reason: RemoteConnectionClose, socketErrorCode: Success
7:02:38 PM.986  User Left MetaMovie: Alien Rescue (Live). Username: Enverex, UserID: U-Enverex, AllocId: 10, AllocIDstart: 1, MachineID: gxy6myuoy0easqkfxbkwuq
7:02:38 PM.989  Destroying User: User IDDFB3200 (Alloc: 10) - UserName: Enverex, UserId: U-Enverex, MachineId: gxy6myuoy0easqkfxbkwuq, Role: Admin
Currently updating user: User ID2C00 (Alloc: 0) - UserName: MetaMovieCore1, UserId: U-MetaMovieCore1, MachineId: 98fkoxwp5ukzs_47cqe2sw, Role: Admin

   at void BaseX.UniLog.Log(string message, bool stackTrace)
   at void FrooxEngine.UserRoot.Slot_OnPrepareDestroy(Slot slot)
   at void FrooxEngine.Slot.PrepareDestruction()
   at void FrooxEngine.World.OnSlotRemoved(SyncBagBase<RefID, Slot> bag, RefID key, Slot slot)
   at void FrooxEngine.SyncBagBase<BaseX.RefID, FrooxEngine.Slot>.ElementRemoved(RefID key, Slot element)
   at bool FrooxEngine.SyncBagBase<BaseX.RefID, FrooxEngine.Slot>.InternalRemove(RefID key, bool sync, bool change)
   at bool FrooxEngine.SyncBagBase<BaseX.RefID, FrooxEngine.Slot>.Remove(RefID key)
   at void FrooxEngine.World.InternalRemoveSlot(RefID key)
   at void FrooxEngine.World.RemoveSlot(Slot slot)
   at void FrooxEngine.Slot.DestroySelf()
   at void FrooxEngine.Slot.Destroy(bool sendDestroyingEvent)
   at void FrooxEngine.Slot.RelocateOrDestroyEmpty(Func<Slot> getRelocationTarget)
   at void FrooxEngine.Slot.DestroyPreservingAssets(Slot relocateAssets, bool sendDestroyingEvent)
   at void FrooxEngine.SimpleUserSpawn.OnUserLeft(User user)
   at void FrooxEngine.World.RunWorldEvents()
   at bool FrooxEngine.World.RefreshStep()
   at bool FrooxEngine.World.Refresh()
   at void FrooxEngine.WorldManager.UpdateStep(double maxMilliseconds)
   at bool FrooxEngine.WorldManager.RunUpdateLoop(double maxMilliseconds)
   at void FrooxEngine.Engine.UpdateStep(double maxMilliseconds)
   at void FrooxEngine.Engine.RunUpdateLoop(double maxMilliseconds)
   at void FrooxEngine.StandaloneFrooxEngineRunner.UpdateLoop()
   at void System.Threading.ThreadHelper.ThreadStart_Context(object state) in /build/mono/src/mono/mcs/class/referencesource/mscorlib/system/threading/thread.cs:line 71
   at void System.Threading.ThreadHelper.ThreadStart() in /build/mono/src/mono/mcs/class/referencesource/mscorlib/system/threading/thread.cs:line 109
7:02:38 PM.992  Exception running asynchronous task:
System.AggregateException: One or more errors occurred. (Object reference not set to an instance of an object) ---> System.NullReferenceException: Object reference not set to an instance of an object
  at FrooxEngine.User.get_AvatarAccessKey () [0x00011] in <5c4d7b9711784234a3fa14befde4fcf9>:0
  at FrooxEngine.User.get_HasNewAvatarAccessKey () [0x00000] in <5c4d7b9711784234a3fa14befde4fcf9>:0
  at FrooxEngine.CommonAvatarBuilder.SpawnCloudAvatar (FrooxEngine.User user, FrooxEngine.CommonAvatar.AvatarManager avatarManager) [0x00365] in <5c4d7b9711784234a3fa14befde4fcf9>:0
  at FrooxEngine.CommonAvatarBuilder+<>c__DisplayClass36_0.<BuildAvatar>b__5 () [0x00073] in <5c4d7b9711784234a3fa14befde4fcf9>:0
   --- End of inner exception stack trace ---
---> (Inner Exception #0) System.NullReferenceException: Object reference not set to an instance of an object
  at FrooxEngine.User.get_AvatarAccessKey () [0x00011] in <5c4d7b9711784234a3fa14befde4fcf9>:0
  at FrooxEngine.User.get_HasNewAvatarAccessKey () [0x00000] in <5c4d7b9711784234a3fa14befde4fcf9>:0
  at FrooxEngine.CommonAvatarBuilder.SpawnCloudAvatar (FrooxEngine.User user, FrooxEngine.CommonAvatar.AvatarManager avatarManager) [0x00365] in <5c4d7b9711784234a3fa14befde4fcf9>:0
  at FrooxEngine.CommonAvatarBuilder+<>c__DisplayClass36_0.<BuildAvatar>b__5 () [0x00073] in <5c4d7b9711784234a3fa14befde4fcf9>:0 <---

   at void BaseX.UniLog.Error(string message, bool stackTrace)
   at void FrooxEngine.CoroutineManager.CheckExceptions(Task task)
   at void System.Threading.Tasks.ContinuationTaskFromTask.InnerInvoke() in /build/mono/src/mono/external/corert/src/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs:line 48
   at void System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() in /build/mono/src/mono/external/corert/src/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs:line 2334
   at bool System.Threading.ThreadPoolWorkQueue.Dispatch() in /build/mono/src/mono/mcs/class/referencesource/mscorlib/system/threading/threadpool.cs:line 785
   at bool System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() in /build/mono/src/mono/mcs/class/referencesource/mscorlib/system/threading/threadpool.cs:line 1261

Screenshots

No response

How often does it happen?

Sometimes

Does the bug persist after restarting Neos?

Yes

Neos Version Number

2021.11.10.1253

What Platforms does this occur on?

Windows, Linux

Link to Reproduction Item/World

No response

Did this work before?

Yes

If it worked before, on which build?

No response

Additional context

I'm a little confused by reason: ConnectionFailed, socketErrorCode: Success which sounds a bit like a "Task Failed Successfully" type message.

Reporters

No response

Enverex commented 2 years ago

To add to this, Neos won't then let me join the session after a failure as it seems to think I'm there. I'm not and can't switch to it, but the button changes to "Focus" rather than "Join". Though I am showing as Away rather than present. This eventually times out I believe.

image

DeliriousJax commented 2 years ago

I want to add on my logs here. I have a very similar bug where both this happens as well as people loading in, but being so desynced that they remain as headsets at spawn. It seems that in their perspective they seem to be wandering around and grabbing objects. (Even objects they grab in world appears by their headset at spawn). What others see is them at spawn as either a headset and hands, or their avatar will load in and it remains in a crushes, unmoving pose. You can sometimes see their context menu and if they grab something in the world form their perspective, it appears in their hand momentarily until they put it back at which point it reappear in the world where they drop it from their perspective.

Here are some logs from a session I'm hosting now where it happened multiple times.

DESKTOP-VNDI8OR - 2021.11.5.486 - 2021-11-05 17_26_47.log . To elaborate, their voices come in fine and the rest of the world reacts fine to them. But it causes a lot of odd desync issues such as people being invisible or stuck elsewhere in world for one another while for some they are having conversations with multiple people who are unaware of one another.

inzanesonikku commented 2 years ago

here are a collection of logs from last night that these spawn bugs kept on happenin

DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-13 17_22_13.log DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-13 17_40_39.log DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-13 19_17_50.log DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-12 23_12_21.log g.

inzanesonikku commented 2 years ago

here are a collection of logs from last night that these spawn bugs kept on happenin

DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-13 17_22_13.log DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-13 17_40_39.log DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-13 19_17_50.log DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-12 23_12_21.log g.

EDIT - DESKTOP-6T9887R - 2021.11.10.1265 - 2021-11-13 21_33_51.log

this is one from tonight

MrMegaTronic commented 2 years ago

I'm having issues with clients connecting to my headless server. This has only been an issue within the last two and a half weeks. Some have intermittent connections while others try to connect, it attempts multiple NAT punch-throughs, then LNL relay and then fails. This is affecting our business' use of Neos.

Is there clear guidance on how to get past this LNL-relay connection issue? It will be expensive to set up a dedicated IP for my headless server just to have it not solve the problem. We are going to be having a meeting with our client soon and need to get this working.

A consistent observation is that the successful connections seem to evoke the FrooxEngine.SteamNetworkManager and the failed connections don't. Also, shown in the attached screenshot, the failed connection doesn't resolve to the correct IP address but the one using the Steam Network Manager does. Aside from being 1,000 miles apart from each other, all clients are all using the Steam version of Neos.

Screenshot 2021-11-22 125418

Enverex commented 2 years ago

Also, shown in the attached screenshot, the failed connection doesn't resolve to the correct IP address but the one using the Steam Network Manager does.

The "wrong IP" in your screenshot appears to be the IP of the Microsoft hosted relay server. I'm seeing it doing this fallback even when it shouldn't be using the relay at all (then failing anyway). It's affecting a lot of our clients at random with seemingly no workaround.

bontebok commented 2 years ago

@Enverex Tonight I did some diagnostics where I forced myself to use LNL Relay by removing the Static Port NAT from my OPNsense (pfSense) router and closing down Steam to force the LNL Relay to be used.

However what surprised me is that I found that I could not establish any connection through the LNL Relay. I believe right now the LNL Relay is not online or working correctly. It may be not working or potentially under heavy load. I live very close to the datacenter where the relay server resides and I'm not having any success. (I will be submitting a new GitHub issue shortly with logs)

Early in November, I updated the Networking page on the Wiki to provide some details on optimal router configuration, is it possible that LNL Relay is needed for some clients due to having a strict NAT? Wiki

Regarding the disconnect issue, is it possible that the fallback is being caused by a failed connection attempt in the early stages of the connection and falling back to the LNL Relay this causing the clients to be disconnected? I have experienced what you described and typically the only time that the disconnect happens to me is if I join the host very quickly after starting a session. If I wait a bit longer or retry my connection, I can believe I can avoid being disconnected.

Enverex commented 2 years ago

Both of my test servers are dedicated servers in a proper data centre, they are direct connected to the internet on dedicated IP addresses with no firewalls or NAT (of any type) in place, thus port forwarding and NAT issues and such should never come into play. For the same reason the relay should also never be needed.

Regarding the "never connected in the first place, but shows as connected on the host side" - again this is to servers that should never require the relay, but in these cases, neither the host nor the client have recently started a session.

bontebok commented 2 years ago

You're right, even a Type-3 NAT would be able to connect to a Type-1 (public IP). But oddly, it isn't working on my end when I try connecting to your headless. I assume that your headless is the one hosting the worlds Catband and Cineworld/Cineplex? I set my NAT up to be a Type-3 and closed Steam to not allow Steam Networking Sockets to connect and was unsuccessful in connecting.

What I observed in the packet capture is clearly a direct contact with your server (46.4.x.x) on an outbound port of 12105, I also see your reply clearly coming back to my client sourced from port 12105 and in the packet it contains details the session ID.

However during the client connection, the client doesn't establish any sort of direct connection and instead attempts to UDP hole punch, eventually reaching LNL Relay and failing. See below, your IP is redacted for privacy.

image

It should never be the case that a Type-3 NAT should not be able to establish a UDP connection to a Type-1. Clearly based on the response from your server we are immediately establishing a connection and I am receiving your packets direct without the need for a relay or any kind of UDP hole punch techniques. It looks like your end keeps sending a response to my request but they appear to be ignored in my client.

However, if I switch to a Type-2 Static Port NAT (traditional consumer NAT), I connect almost instantaneously and I never even see the word "LNL" on my screen, it goes directly from Connecting to Synching Initial World State.

I believe Neos is port dependent on the incoming source port for all clients. This makes sense as even a Type-2 to Type-2 would use static ports and only a Type-3 would not. I don't believe this should be the case as the client should be able on a randomized incoming port and the host would simply validate the client based on a token or something rather than requiring that the client matches the source port it is connecting outbound on.

Do you know if a client behind a Type-3 NAT would be able to connect direct to your server w/o the need to use LNL Relay or Steam Networking Sockets prior to the networking changes in October?

Enverex commented 2 years ago

Prior to the changes in October I'd not had any complaints about connection issues on either of the servers I'm managing and I'd not experienced the issue myself prior to that point (which I have now several times).

shiftyscales commented 2 years ago

This issue is currently being investigated and reviewed internally by the QC team, who have also been assisting in documenting temporary work-arounds users can utilize until this issue is able to be reviewed, and resolved by the development team.

Thank you for your patience, and I apologize on the lack of communication regarding this frustrating issue. I will see to it that it is prioritized as soon as it's able to be.