Performance: Optimize frame time of network correction

fishfolk / jumpy

Tactical 2D shooter in fishy pixels style. Made with Rust-lang 🦀 and Bevy 🪶

https://fishfolk.org/games/jumpy/

Other

1.64k stars 117 forks source link

Performance: Optimize frame time of network correction #846

Open MaxCWhitehead opened 1 year ago

MaxCWhitehead commented 1 year ago

Description

While profiling in a networked match in dev-optimized profile, client often rewinds / resims during correction after receiving another players input when predicted ahead.

A rewind/resim that advances four steps may sometimes take 30ms+, reducing framerate of game from target of 60fps. These times are worse when correction requires simulating more than four frames. Our max prediction window is 8 frames, and I often see corrections simulating 5-6 frames with frame times pushing 40+ms.

Below is screenshot of network update loop.

On average, advancing game single frame takes ~4.5ms, load gaming state for rewind takes ~2.7ms, saving game state takes ~2.6ms. Average cost of each frame in correction is roughly ~7ms. Improving times in any of these scopes should improve conditions here.

zicklag commented 1 year ago

Maybe this the problem we are looking for. Maybe we're just too slow.

If one client slows down while running prediction frames, I wonder if it exacerbates the issue by delaying the sending of it's inputs across the network, too.

I really would have thought that the save and load times should be lower, since it's literally just cloning the world. I wonder if the allocator is causing the bulk of that slowdown there if it's something else.

MaxCWhitehead commented 1 year ago

I was thinking this wasn't causing latency - but thinking about it more, if a frame takes 30-40ms, that means we won't queue next frames input for that long, which would delay things a bit... If hit a slow frame, input goes out slightly slower, other client predicts further ahead, corrects further back, has slower frame, then other client gets delayed input. If both clients start getting slower frames, that does buy time for input to replicate, vs one getting further ahead, so I don't think this is a full feedback loop/spiral or anything, but it probably will contribute to latency and impact all other clients at least for next frame or two slowing everyone down.

I did find that raising predicted frame window to 10 locally makes freezes much less frequent, so we def aren't spiraling of control, just spiking from up above the window slightly, at least for my local setup / network conditions. Raising to 10 does increase the potential worst-case for these resim frames, tho I didn't often see it much past 6 steps in resim, which is good that we don't just get worse with bigger window. But these corrections were slow enough to impact local framerate which is why I opened this issue. I am considering maybe we should allow 10 frames or something, the impact seems like a net positive. Would like to get more data and see how other people's games fair though before trying to just tune problem away. Improving perf will benefit all conditions tho, is a safer path forward and only good things should come from it.

MaxCWhitehead commented 1 year ago

I know you posted about allocator / clone on discord, haven't had a chance to dig into it yet. Looking at how ComponentStore and Resources is setup is a good first approach. Having them allocated from a pool would mean getting more in cache line and less loading memory to clone all of this. Seems like components of same type are at least stored contiguously (at a glance anyway, not sure, didn't look in depth). But if lots of different types of components / lots of resources, still a lot to clone.

I think first step is look at how many component stores and resources to clone, then can see how long a clone takes for each of those. Maybe can add scopes and instrument pieces of clone to get a breakdown of costs.

Before we start optimizing memory storage, we may want to consider if what is in the world is actually a necessary for ggrs snapshot. This method is definitely simpler, but we could also separate game state/components required for simulation / ggrs snapshot, from auxiliary stuff, for example, the NetworkDebug resource I use to for the network debugging tool. This has a pretty massive buffer of frame data for graph. Is definitely not needed in game state. I should probably switch this to a global or something and get out of ECS. I bet we have a number of things that are not actually required for game taking up memory.

Pulling stats on size of component stores/resources may also help, possible some are bigger then we expect. I'll probably spend some time on these things soon here, tho I did start rabbithole-ing on some optimizations for tilemap collision resolution in advance/update_kinematics, tho this may not be fruitful haha will see if I pull the plug on that. Is a larger cost tho of game step.

zicklag commented 1 year ago

I am considering maybe we should allow 10 frames or something, the impact seems like a net positive.

That could be a good idea. Theoretically, if the CPU performance is good enough, increasing the window allows us to handle greater network latency without freezing, which is good for people with worse networks.

Seems like components of same type are at least stored contiguously (at a glance anyway, not sure, didn't look in depth)

Yes, each component type is stored contiguously with other components of the same type. For every component we also have a bitmap stored in a Vec and ( currently ) taking up 512 bytes. The bitset is of a fixed size and allows up to 4096 entities by default.

I was thinking of replacing the fixed-size bitsets with a compressed Roaring bitmap instead, so that it only takes up as much room as necesary for each component, which might speed up copies and maybe iteration, too.

Pulling stats on size of component stores/resources may also help, possible some are bigger then we expect.

That's a good idea. I'm reworking some of the stuff in bones ECS for the new asset system which will include a way to get complete details of all the components' memory layout/fields, and we could use that to make a GUI inspector for the bones world.

I'll probably spend some time on these things soon here

Just a warning, I'm touching a lot of the bones ECS in the reflection branch right now. If you want to mess with stuff in bones_ecs to see how we might be able to get perf gains that's fine, but maybe don't spend too much time since some details will be changed with the new stuff I'm working on, and I wouldn't want your work to get wasted.

I did start rabbithole-ing on some optimizations for tilemap collision resolution in advance/update_kinematics,

Feel free to take a hack at it! Most of the game and bones, while coded with some performance perspective in mind, haven't really been profiled or optimized much, so there might be great improvements lying around that we haven't thought of yet.

MaxCWhitehead commented 1 year ago

My plan was to make simple / hacky local changes, to get some numbers, not do any real memory / debugging work. I know Tracy supports memory profiling, unsure what the instrumentation required there looks like, could be useful, but stuff in-game in our tooling makes good sense to me, the memory layout stuff sounds useful.

Thanks for the heads up - won't start any projects in core systems without touching bases.

MaxCWhitehead commented 1 year ago

With #848, we are now getting a frame involving six simulation steps at 33ms (vs 4 steps at 30ms in original post here). Progress! Average time of AdvanceWorld now reduced to ~2.7ms from ~4.5ms.

MaxCWhitehead commented 1 year ago

Here is a high-level breakdown of the costs of save game state:

These percentages vary a bit in each call, but the bulk is cloning components, resources is cheap. Acquiring lock / copying are stats I added here.

What is interesting is that "copy" stat is large. We clone our world into an option that is passed to ggrs save func, which then assigns this option into ggrs state.

This was captured in dev-optimized, I found in release build, the situation is not so bad, but the ratio of time between the two is similar.

I would expect only one clone, and then be written directly to destination via move semantics, i.e. compiler to optimize this away. However, I'm guessing the Mutex lock in ggrs prevents this from happening, probably forces ordering such that we clone before lock is acquired, vs waiting for lock and then cloning directly to target memory. Not sure if that is accurate, but a guess. Not an expert on this lol.

Probably ok to just focus on component cloning.

MaxCWhitehead commented 1 year ago

The good news, we are much faster in release builds! I think the differences in optimization between jumpy and ggrs in dev-optimized is actually adding a lot over overhead. Here we see a 7 step frame is 16.7ms! I think we are actually in better shape here then I thought.

zicklag commented 1 year ago

I had another thought about improving the performance of tile_collision_filtered in the fall-through step:

We're using the rapier query pipeline to do an ad-hoc collision test, but we've also got rapier sending us start-colliding and stop-colliding events, so I wonder if we could make use of those events to more efficiently. I bet there's probably some algorithm improvements that could be made.

I mean, rapier has a whole physics dynamics system that we're not using, but it's built on the same collision system we're using, and it can run thousands of objects at a time at 60fps, so there's probably a lot of room to improve this, and maybe ways to make the usage nicer or more efficient, too.

For an example of API niceness that could be better or maybe have a more efficient solution, to detect if something is resting up against a wall or the floor, we have this weird trick where we create a collider expanded by a pixel on each side and then see if that collider is intersecting with map tiles.

https://github.com/fishfolk/jumpy/blob/main/core/src/elements/crate_item.rs#L210-L221