cwi-dis / VR2Gather

Unity application framework for immersive social VR
MIT License
4 stars 6 forks source link

Orchestrator connections get closed unexpectedly #173

Closed jackjansen closed 5 months ago

jackjansen commented 6 months ago

We see that some times the connection to the orchestrator gets closed.

It's somewhat random, but there are a number of things that seem to influence how often and how quick it happens.

It may be that this has to do with socketio ping/pong packets.

Going to create a branch and try some experiments:

Also want to play with BestHTTP logging (see https://besthttp-documentation.readthedocs.io/en/dev/7.GlobalTopics/Logging/) once I can repeat the problem.

And there are BestHTTP options that are a bad idea, such as AutoReconnect which we should disable.

Also need to see whether we can influence the PingTimer and PingTimeout.

jackjansen commented 6 months ago

Added the option to print (very) verbose Best.HTTP logging. Tried it with a 2-user session, where 1 user has NoRepresentation (and that's the user where I enabled logging.

Noticed two interesting things:

jackjansen commented 6 months ago

The _placeholder stuff is (I think) due to the way we're encoding the data, I think it's a socketIO thing. @troeggla ?

But what we're sending is the 100ms interval PlayerNetworkControllerSelf.SendPlayerData() messages about the player position and such.

But: that should also happen in the MVP case, or the 3-user case.

jackjansen commented 6 months ago

The SendPlayerData() calls may be the reason we are not sending the pings as soon as the session has been created: If I understand the code in best.http OverHTTP1.cs and OverHTTP2.cs correctly the pings are only sent if nothing has been received in the last pingFreq interval, and if pingFreq != 0.

The pingFreq and SendPings are initialised from WebSocketTransport.Manager.Options.WebsocketOptions?.PingIntervalOverride

jackjansen commented 6 months ago

Interesting... The default for Best.SocketIO.WebsocketOptions.PingIntervalOverride appears to be Timespan.Zero... That may indicate (but I'm not sure? @troeggla do you understand this stuff?).

Yes! Found some documentation that is pointing this way: https://benedicht.github.io/BestHTTP-Documentation/pages/best_http2/protocols/socketio/socketio.html#websocketoptions

jackjansen commented 6 months ago

Added code to set the PingIntervalOverride to 20s and not reconnect. I'm not seeing extra ping messages, but that may be because we're still receiving a lot, so there's no need to send a ping to check whether the orchestrator is still alive (we know, because it's sending us data).

Hmm, could it be that there are two ping/pong protocols in SocketIO/WebSockets? One client->server and one server->client, and that it's the other one that is bothering us? @troeggla ?

jackjansen commented 6 months ago

Tried with CulturalHeritage. Disconnects after a while.

It turns out that the machine does not send any SendPlayerData for approximately 30 seconds, probably because it is too busy with all the incoming data.

Need to confirm this.

jackjansen commented 6 months ago

Not really fixed, but worked-around in 89a0596 by being much cleaner about what to do when the orchestrator is disconnected.

The root cause of the issue is that the disconnecting machine is so overloaded that it doesn't send any messages to the orchestrator, therefore the orchestrator disconnects.

We should really see if we can handle this in a better way (for example by reducing the overall Unity frame rate, and ensuring that if our unity frame rate is less than our point cloud stream frame rate we drop frames early.

troeggla commented 6 months ago

If it's a performance problem, maybe this copy here is part of the problem: https://github.com/cwi-dis/VR2Gather/blob/89a0596f29a4882d6cd2f2d00695d8fe8a835ab7/nl.cwi.dis.vr2gather/Runtime/VRTOrchestrator/API/OrchestratorWrapping/OrchestratorWrapper.cs#L544

jackjansen commented 5 months ago

Reopening. The problem still exists. @troeggla can you please add the findings you communicated over Slack?

jackjansen commented 5 months ago

But I'm deleting the branch as it no longer serves a useful purpose.

jackjansen commented 5 months ago

@ashutosh3308 can you please try whether the various ways we got into this problem in Brussels still cause problems?

ashutosh3308 commented 5 months ago

@jackjansen : okay. I will test it on Monday

jackjansen commented 5 months ago

We haven't seen this problem at all over the last week, since switching to the new socketio implementation.