MovingBlocks / Terasology

Terasology - open source voxel world
http://terasology.org
Apache License 2.0
3.68k stars 1.34k forks source link

Substantial delay in connecting a player to a multiplayer server running JoshariasSurvival #4154

Open Cervator opened 4 years ago

Cervator commented 4 years ago

What you were trying to do

Connect normally with a 2nd client to an already running game server without extensive load times. Varies based on modules enabled.

A hosting client doesn't take an inordinate amount of time standing up the server and connecting to it. Issue does not present in plain CoreSampleGameplay.

With ManualLabor enabled (identified as a culprit in #4126) connecting took a moderate 30-45 seconds for player 2. With the rest of JS enabled it went to a full 3 minutes in one test.

What actually happened

Logging in for JoshariasSurvival did succeed after #4148 was merged, and didn't slow down the host, however it took a full 2-3 minutes for no apparent constructive reason. @keturn has done some initial research noted on #4126 and on Discord (in #stabilization I think) that points to more potential issues like what #4148 fixes, that might be rooted in how Gestalt handles some stuff. So this may be a fix needed over there (either as a point release for the 5.x line or eventually by upgrading to 7.x)

Progress bar may hold some hints as to which parts are causing the delays. Asset scanning and system initialization seemed two big ones. Just trying to look real quick it seemed like about 45s for asset scanning. 1m for initializing systems for the first player to join a headless server. Think initialize systems took longer with a hosting client + client.

How to reproduce

keturn commented 4 years ago

Recapping some of my ramblings from #stabilization:

Watching the connection/loading process with a profiler does reveal some slow things.

When I looked at some of those things (which do relate to asset scanning and initialization), I found some places I believe have potential for optimization.

The hitch is that I'm new to these tools for profiling multi-threaded Java applications, and I'm new to those parts of the source, and I have very little surety that the things I was looking at are the same things responsible for making multiplayer so much dramatically worse than it was a release or two ago. After all, it's not like gestalt v5 changed in that time.

(though now that I mention it, it's possible we did change something about the gestalt dependency in the TeraNUI branch.)

Bisecting the history to discover when the issue started would be very useful. Not only would it provide more confidence that we were looking at the right problem, I think it would do wonders for morale and maintainability to be able to say "this is what changed and this is the impact it had" to replace my current vague dread that something changed but no-one can say what.

I'm daunted by the thought of conducting a search that requires navigating not only engine's version history, but that of JoshariasSurvival and its dependencies as well. I don't know if there's someone more fearless who is willing to tackle that, or if we first need to set up some tooling to facilitate that kind of multi-repo search process.

pollend commented 4 years ago

so I ran the game with the CPU profiler and I think it might be a problem with how we load classes between NUI and the gestalt sandbox. wonder if nui is trying to scan for all classes from one module to N - 1 modules. might be in a process of constantly scanning and re-scanning the entire class-path. This is executed from a lamdba so I'm not sure where the call chain starts.

image

I added some debugging to record times in ModuleEnviromentSandbox. Not sure why its just InventoryComponent and InventoryAccessComponent?

22:47:40.181 [main] INFO  o.t.p.t.r.ModuleEnvironmentSandbox - org.terasology.logic.inventory.InventoryComponent :37505
22:47:41.970 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"eafca496-b166-42dc-8af5-3afbc8c04448"}
22:47:46.929 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"8d4918b6-83ab-449a-8869-cdbd844a1e89"}
22:47:51.938 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"04c7d5b7-8a2f-4a25-88ad-c63126af53bc"}
22:47:56.953 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"6063d887-4c3b-409e-b0b7-2e6b48e53018"}
22:48:01.928 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"df9d927b-be4f-47c0-8222-068f62ae3dc4"}
22:48:06.939 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"1f7bfebd-6ced-4756-b5d1-b6cd313c2409"}
22:48:11.953 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"02a39811-3ce4-4e31-ab0d-388998908e34"}
22:48:15.692 [main] INFO  o.t.p.t.r.ModuleEnvironmentSandbox - org.terasology.logic.inventory.InventoryAccessComponent :35497
22:48:15.760 [main] ERROR o.t.e.e.internal.PojoEntityManager - Adding a component (class org.terasology.logic.common.DisplayNameComponent) over an existing component for entity 237
22:48:15.760 [main] WARN  o.t.l.nameTags.NameTagClientSystem - Can't create player based name tag for character as owner has no client component
22:48:15.766 [main] ERROR o.t.e.e.internal.PojoEntityManager - Adding a component (class org.terasology.logic.common.DisplayNameComponent) over an existing component for entity 274
22:48:15.766 [main] WARN  o.t.l.nameTags.NameTagClientSystem - Can't create player based name tag for character as owner has no client component
22:48:16.962 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"2db7f5cf-7c79-42a6-aab4-9bfedcf35c73"}
22:48:21.971 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"4c188f8c-ee51-427a-b6a5-0b7f427887db"}
22:48:26.940 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"d3e7879a-cad3-410a-9ce1-69c639982325"}
22:48:31.958 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"457993e1-b07d-4b2f-b73a-27d0b2c97ab5"}
22:48:36.967 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"73b9ab56-98bd-4279-b91f-87a39cf12420"}
22:48:41.934 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"2c07788c-4331-498e-acf2-a927b7fccf09"}
22:48:46.943 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"f18f0b7e-ef67-4275-ab25-22c71a23bef4"}
22:48:50.067 [main] INFO  o.t.p.t.r.ModuleEnvironmentSandbox - org.terasology.logic.inventory.InventoryComponent :34297
22:48:51.967 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"87d03130-0362-46a9-8d42-1d5b18a90086"}
22:48:56.977 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"e0fd31dd-4280-41a3-8dd9-a08a5003f8c1"}
22:49:01.936 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"b33d61e6-1f62-4e2d-9f81-ee02dbe125e8"}
22:49:06.949 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"198b2272-2873-4a92-aa39-78d867a3e800"}
22:49:11.974 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"0c80ea1a-1bdf-4380-9dc7-1bc3e0c52b78"}
22:49:16.982 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"cfae9a7f-4486-4c23-abc1-7f86bb862ad5"}
22:49:21.951 [Thread-3] DEBUG c.j.d.entities.pipe.UnixPipe - Received packet: Pkt:FRAME{"evt":null,"data":{"assets":{"large_image":"515746954660151306"},"timestamps":{"start":1602049582000},"name":"Terasology","details":"Name: Player32747","state":"In Lobby","secrets":{},"application_id":"515274721080639504"},"cmd":"SET_ACTIVITY","nonce":"136a8a55-0f1a-40bc-92f9-134f922c40b6"}
22:49:25.157 [main] INFO  o.t.p.t.r.ModuleEnvironmentSandbox - org.terasology.logic.inventory.InventoryAccessComponent :35084

Here is also the relevant call-stack.

image

keturn commented 4 years ago

Hmm, that debug log is information I didn't check before.

All those "received packet" lines are five seconds apart. To within 1/100th of a second, usually. Is that expected, as some kind of keepalive packet, or does something else have a 5 second delay or timeout specifically built in to it?

pollend commented 4 years ago

Look at that ModuleEnvironmentSandbox I added some debug and timed the method there. I got a clue from running this from the profiler. 34297 is in milliseconds so each one took a half a minute to instantiate those components. So that is why it took so long to load the world. I don't even understand what those events are for since they only instantiate a Component that is fed somewhere and its by name so that might resolve to another component with the same name. This codepath from the stacktrace is only called when the world is being loaded. I think @skaldarnar might have a better clue but this narrows it down to NetworkEntitySerializer line 262.

keturn commented 4 years ago

The stack trace looks like the mirror of the one we dealt with in #4148; this is the deserialization side of the serialization that was stalling the server.

and we could try doing some optimizations here too, to the class loader and type handler stuff. If we're lucky, we'll find some shortcut that skips through 34,297 ms to something usable, but that's a lot. Thing is, the class loader code hasn't changed recently.

Oh, the ComponentClassTypeHandler, on the other hand: #4016 (oh look who approved the changes to ComponentClassTypeHandler.getFromString? It was me!)

I'm not convinced that's the problem yet, but it's by far the closest thing I've seen yet to a change in the right timeframe to a relevant class.

keturn commented 4 years ago

Other TypeHandler stuff that first appeared in v4.0: #3903 (118b63a16cfb6ecc9da2ff2f2b15ed655ccc52e2)

Oh, and the NUI merge. Actually changed quite a bit of the stuff involving the type registry and reflections, largely in this commit: https://github.com/MovingBlocks/Terasology/pull/3819/commits/d3b20a7c195e02b02a9dfbd2971327dcb3fd5313

because a lot of the reflections stuff went to nui-reflect: https://github.com/MovingBlocks/TeraNUI/tree/v1.0.0/nui-reflect/src/main/java/org/terasology/reflection

heck, okay, that's a lot. When I think of NUI I think of the input and widget stuff, I didn't realize so much of the type handling moved. 😐

pollend commented 4 years ago

That is really kind of strange, why is all the reflection logic moved to there. sounds like it would fit better in gestalt.

keturn commented 4 years ago

I was looking for a way to blame #4016 because adding a findSubTypeOf where there wasn't one before would explain a lot. I thought "maybe that class wasn't unused after all, maybe all TypeHandler classes get auto-registered somehow?" — but no, that doesn't look true, only the ones with the registration annotation are auto-registered and that one didn't have it.

3913 might be interesting, because

  1. it's linked to https://github.com/Terasology/Machines/pull/29 and Machines and is in response to something that showed up when used with ManualLabor:

    This issue was observed through the ManualLabor:AssemblyTable not showing the assembly slots (InventoryGrids) when placing the assembly table.

  2. the discussion specifically mentions InventoryAccessComponent https://github.com/MovingBlocks/Terasology/pull/3913#issue-410368314

that lines up with our observations very well!

keturn commented 4 years ago

so that's probably why it only showed up recently, because this is a new kind of thing that wasn't serialized before that change to Machines,

and my guess is that the reason why it's so good at exposing this performance problem is because it covers all subclasses of Component, and there are a heckton of those.

I'm not sure any other type handler does that findSubTypeOf thing, and certainly not with a type that has as many subtypes as Component!

The recommendation to use findSubTypeOf and getSubTypeIdentifier came from @eviltak here: https://github.com/Terasology/Machines/pull/29#issuecomment-620912691

sooooo, having started the evening by saying "gosh, I'm not really sure this is a problem we should solve by trying to performance-optimize those things that came up in the profile," I have now come full circle back to "gosh, those things in the profile are being used in ways they weren't before, maybe we need to optimize them." 😂

keturn commented 4 years ago

One optimization possibility I had sketched out earlier:

ModuleEnvironmentSandbox.findSubtypeOf https://github.com/MovingBlocks/Terasology/blob/eff0c0551dfd7221a77a3a8a1fec4812979a8006/engine/src/main/java/org/terasology/persistence/typeHandling/reflection/ModuleEnvironmentSandbox.java#L42-L43
calls https://github.com/MovingBlocks/TeraNUI/blob/82556b66810b48c349352dad2443a8678e77f247/nui-reflect/src/main/java/org/terasology/reflection/TypeRegistry.java#L129-L131
calls https://github.com/MovingBlocks/TeraNUI/blob/b4c243adf0ff856a7403aff82524093be6221e61/nui-reflect/src/main/java/org/terasology/reflection/ReflectionUtil.java#L471-L473

which is doing this clever thing to parallelize the class loading which should work great, unless you have a custom class loader that's not parallel-capable: https://discord.com/channels/270264625419911192/714313228271747152/754988388272177232

I think it would be safe to turn that on for ModuleClassLoader, so we should probably do that, but needs to be thought through.

but the other thing is, walking through that chain of events, it's running through classloaders for a ton of classes (every Component!) only to then filter most of them out by name:

https://github.com/MovingBlocks/Terasology/blob/eff0c0551dfd7221a77a3a8a1fec4812979a8006/engine/src/main/java/org/terasology/persistence/typeHandling/reflection/ModuleEnvironmentSandbox.java#L45

but it had to know their names before even loading them in the first place, so it really seems like we should be able to avoid most of those loadClass calls.

and we do want to avoid loadClass calls, because each one results in one, maybe two calls for every loaded module, and we have a lot of modules.

keturn commented 4 years ago

the other thing in ModuleClassLoader.loadClass I'm suspicious of is that I think a class loader should check to see if the class is already loaded before calling up the chain, as described for https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ClassLoader.html#loadClass(java.lang.String,boolean)

but I'm not sure ModuleClassLoader checks that first.

I don't yet understand the ModuleClassLoader thing of each classloader having a parent but ModuleClassLoaders also have a baseClassLoader, so I'm really not sure what order things happen in.

keturn commented 4 years ago

(also alllllllllll of this stuff is subject to change with #4144)

keturn commented 3 years ago

Delay on first connection to server

keturn commented 3 years ago

Now that Terasology is using gestalt-module v7, this is up for re-investigation.

See if we can still reproduce this. Maybe it was fixed by v7 itself.