fishfolk / jumpy

Tactical 2D shooter in fishy pixels style. Made with Rust-lang 🦀 and Bevy 🪶
https://fishfolk.org/games/jumpy/
Other
1.59k stars 116 forks source link

Desync in network play #964

Closed MaxCWhitehead closed 2 months ago

MaxCWhitehead commented 3 months ago

Description

If I boot up 3-4 clients (4 is more frequent) can pretty easily desync one or multiple of them with simple gameplay.

The client performance is terrible with so many locally, and my ping is high, so this is kind of worst case condition, but something seems to be wrong here. Going to try using some of the ggrs tools to track this down. This is no main without any of my local WIP stuff.

To Reproduce

  1. Boot up 4 clients
  2. Start online match with all 4
  3. Play around, switch between clients, observe desync.

Expected Behavior

No response

Additional Context

EDIT: I reproduced this at ca53ba98d747f3438728448dd82ec4d281e434b5 (before package + physics updates), so can probably rule out some of these recent changes. May be a more fundamental issue.

Log Messages

No response

zicklag commented 3 months ago

It may be worth setting up component hashing on Bones Schema so that we can produce hashes of the world snapshots for comparison.

I can't remember how deeply I integrated it, but we already have a hash_fn() that exists on schemas, but it's not automatically implemented for hashable types in the HasSchema derive macro.

We'd need to do something like a #[schema(derive_hash)] or something like that to opt-in to hashing the type, and then it would add a Hash bound to the struct, and use that to implement the hash_fn for the schema. We might also need to add a RawHash trait, since it's essentially the same deal as the clone implementation, it would just be opt-in instead of opt-out like clone.

Then we can implement a function on the World that will loop through all it's components data and hash all the ones that have a hash_fn, ignoring the ones that don't. This means that any component that can't be hashed, won't be detected by the de-sync checker, but if one of the components that is hashed is different between two game runs, then we'll see it in debugging.

MaxCWhitehead commented 3 months ago

Great info, thanks. Yeah I started looking at producing hash/checksum we can use with ggrs to benefit from their desync detection (ggrs supports passing this in with the saved cell, and will check for desyncs between players at specified interval).

I hacked in a callback user can specify to generate hash (in lieu of global solution), and hashed only player transforms, found desync'd on 2nd frame (even in 2 player match, apparently)

2024-04-08T01:21:43.764751Z ERROR bones_framework::networking: Network de-sync detected frame=2 local_checksum=58697 remote_checksum=13528 player=1

Added a debug log in networking to dump log inputs to be activated with RUST_LOG=bones_framework::networking=debug cargo run

and found that one player gets:

2024-04-08T01:21:43.733934Z DEBUG bones_framework::networking: Net player(0) input: DensePlayerControl { .0: 66560, jump_pressed: false, shoot_pressed: false, grab_pressed: false, slide_pressed: false, ragdoll_pressed: false, move_direction: DenseMoveDirection(Vec2(0.0, 0.0)) }
2024-04-08T01:21:43.733946Z DEBUG bones_framework::networking: Net player(1) input: DensePlayerControl { .0: 66560, jump_pressed: false, shoot_pressed: false, grab_pressed: false, slide_pressed: false, ragdoll_pressed: false, move_direction: DenseMoveDirection(Vec2(0.0, 0.0)) }

Other gets:

2024-04-08T01:21:43.097494Z DEBUG bones_framework::networking: Net player(0) input: DensePlayerControl { .0: 66560, jump_pressed: false, shoot_pressed: false, grab_pressed: false, slide_pressed: false, ragdoll_pressed: false, move_direction: DenseMoveDirection(Vec2(0.0, 0.0)) }
2024-04-08T01:21:43.097507Z DEBUG bones_framework::networking: Net player(1) input: DensePlayerControl { .0: 0, jump_pressed: false, shoot_pressed: false, grab_pressed: false, slide_pressed: false, ragdoll_pressed: false, move_direction: DenseMoveDirection(Vec2(-1.0, -1.0)) }

DenseMoveDirection is different on match start for one player, possibly a bug in input management. Tracking that down now.

On the note of advanced desync debugging: ggrs has a SyncTestSession that will rollback-resim every frame, and compare original hash with re-simulated hash. We could also take the state of last confirmed frame not desync'd after detection in normal play, dump the inputs for future frames that desync'd, and run a sync test session and local client can test for non-determinism (if re-simulating gives different results with same input).

We could even do something fancy in which we manage data structure that breaks down hashes by subdividing the data, like hash only resources or only components, drill down by type, and compare those to determine what exact data is non-deterministic. Haven't really thought that through and don't need a real solution yet. Full world hashing would be a great first step. But at least this issue seems to have some clues without getting too far in the weeds.

MaxCWhitehead commented 3 months ago

Ah so it seems that on first frame when no previous input, GGRS provides a blank / predicted input of dense input type that is bytemuck::zeroeable::zeroed()

DenseMoveDir encodes vec value between -1 and 1 as quantized integer, so 0b000000 is -1 and 0b111111 is 1.

So predicted input at first frame is not no input.

MaxCWhitehead commented 3 months ago

Perhaps it would make sense to consider if supporting Default trait in ggrs would be a good solution.

Otherwise we could quantize each move dir value between 0 and 1 with 5 bits, and then include a sign bit, as the range is symmetric. Should be the same precision I think, but zero'd would be 0.

MaxCWhitehead commented 3 months ago

Opened a PR that changes representation so zeroed is no input: https://github.com/fishfolk/bones/pull/380

Is a possible solution, seems to fix this bug. Though I still can easily desync clients in 4 player matches, so there are other issues at play.

MaxCWhitehead commented 2 months ago

I opened a draft PR in GGRS that seems to fix a isolated test case I setup.

In smoke testing with 3+ clients in Jumpy I am no longer reproducing desyncs between players. Definitely some more work to do before this would be ok to merge upstream + it needs some review for sure. Will test jumpy a bit more in depth next couple of days and see if can repro any other desync, but barring more issues I'll likely put us on a fork so can validate for a bit and see if any other issues pop up.

https://github.com/gschup/ggrs/pull/77

MaxCWhitehead commented 2 months ago

967 should fix the remaining desync issues (using the fix I linked above on fork of ggrs), no longer reproducing this, gonna close it.

Regarding desync detection - opened a issue tracking that on bones in the meantime until I get to wrapping that up.

zicklag commented 2 months ago

Super awesome, this is a really big deal! 🎉

I'd never actually gotten the game not to de-sync after some playing, so this is exciting to finally not be able to re-produce de-syncs.