foxssake / netfox

Addons for building multiplayer games with Godot
https://foxssake.github.io/netfox/
MIT License
399 stars 16 forks source link

Server-authoritative multiplayer without network-wide rollback #197

Closed bryanmylee closed 1 month ago

bryanmylee commented 9 months ago

Server-authoritative multiplayer with local synchronization

In our multiplayer architecture, we take advantage of the fact that game state will always have to flow through an authoritative server. If state is only computed on the server based on valid player inputs, we can eliminate an entire class of exploits, specifically game state manipulation. This is known as server-authoritative multiplayer.

The server will only accept input information from player clients e.g. their input direction vector, aim vector, and other input events. These input events will be used to update state in a trusted manner before being broadcasted to all peers.

However, this introduces latency between user input and the player character moving locally. To solve this, we optimistically update local state on the client. When the input is acknowledged by the server and a verified state response is received on the client, we only need to verify that local state for the current character is consistent enough to continue as usual. Otherwise, we simply snap state back locally.

We call this approach server-authoritative multiplayer with local synchronization, or authoritative local sync.

Differences to server-authoritative multiplayer with input prediction and rollback

Server-authoritative multiplayer is usually implemented with input prediction and rollback. This provides a viable path to making state consistent across the entire network at any given point in time. However, input prediction and rollback comes with many issues, mainly:

  1. vulnerability to lag spikes on any client,
  2. vulnerability to high latency on any client,
  3. inability to scale up for large number of input-sending players, and
  4. requirement for predictable input

For example, assuming P1, P2, P3 have a server latency of 2, 5, and 10 ticks, and that input is unpredictable (thereby always triggering rollbacks), for a single tick:

  1. t=0: P1, P2, P3 send input to the server
  2. t=2: the server receives A1(t=0); server rolls back and predicts past 2 ticks; sends S1(t=0..2) to P2 and P3
  3. t=5: the server receives A2(t=0); server rolls back and predicts past 5 ticks; sends S2(t=0..5) to P1 and P3
  4. t=7: P1 receives S2(t=0..5), P2 receives S1(t=0..2); P1 and P2 roll back past 7 ticks; P1 predicts S2(t=6..7), P2 predicts S1(t=3..7).
  5. t=10: the server receives A3(t=0); server rolls back and predicts past 10 ticks; sends S3(t=0..10) to P1 and P2
  6. t=12: P1 receives S3(t=0..10), P3 receives S1(t=0..2); P1 and P3 roll back past 12 ticks; P1 predicts S3(t=11..12), P3 predicts S1(t=3..12)
  7. t=15: P2 receives S3(t=0..10), P3 receives S2(t=0..5); P2 and P3 roll back past 15 ticks; P2 predicts S3(t=11..15), P3 predicts S2(t=6..15)

The number of processing ticks per network tick is proportional to N x K where N is the number of players and K is the maximum latency between two clients. If latency is unstable or high on any client, all clients suffer. If the number of players increases, the number of rollbacks increase quadratically. If input is dynamic and unpredictable, the percentage of ticks that have to be rolled back increases.

Solving these issues with local synchronization

In our authoritative local sync design, each client only needs to synchronize its own state against the server. A client will never attempt to rollback or synchronize state for another player, accepting the latency of their state as part of the network's limitations. This means lag spikes on one client will never cause a cascading lag spike across all players.

Using the same example above, assuming P1, P2, P3 have a server latency of 2, 5, and 10 ticks, and that input is unpredictable, for a single tick:

  1. t=0: P1, P2, P3 send input to the server
  2. t=2: the server receives A1(t=0) as A1(t=2); sends S1(t=2) to P2 and P3
  3. t=5: the server receives A2(t=0) as A2(t=5); sends S2(t=5) to P1 and P3
  4. t=7: P1 receives S2(t=5) as S2(t=7), P2 receives S1(t=2) as S2(t=7)
  5. t=10: the server receives A3(t=0) as A3(t=10); sends S3(t=10) to P1 and P2
  6. t=12: P1 receives S3(t=10) as S3(t=12), P3 receives S1(t=2) as S1(t=12)
  7. t=15: P2 receives S3(t=10) as S3(t=15), P3 receives S2(t=5) as S2(t=15)

The fundamental difference of authoritative local sync versus input prediction with rollback is that network latency is accepted. This approach keeps the number of processing ticks per network tick proportional to each client's individual latency k, eliminates latency cascades, and ensures that only server-verified state is broadcasted to the network.

Syncing local state on state drift

Because game state is intrinsically inconsistent across the network due to latency, there is a high chance that client state will drift from server state due to inconsistent physics simulations and more. To alleviate this, we sync local state to the server verified state whenever the drift is too high.

A client with server latency k and max history of Q at tick time t will have states[t-Q:t], input history inputs[t-Q:t], and verified_state(t-k).

If state(t-k) is not consistent with verified_state(t-k), then sync state(t) to verified_state(t-k) + inputs[t-k:t].

bryanmylee commented 9 months ago

@elementbound Please correct me if I'm wrong here, my understanding of prediction and rollback is surface level at best so I might have made some mistakes in the description.

bryanmylee commented 9 months ago

My current implementation strategy is:

For an originating tick k on client:

  1. before_tick_loop, for all clients x, gather input for control Ax(k)
  2. on_tick, for all clients x, apply Ax(k) to state Sx(k) to produce predicted state S*x(k+1)
  3. after_tick_loop, for all clients x, send Ax(k) to server

For the server handling the request at tick k1:

  1. for any originating tick j, server receives input Ax(j) and treats it as Ax(k1)
  2. on_tick, for all clients x, server applies Ax(k1) to S(k1) to produce S(k1+1)
  3. after_tick_loop, server broadcasts S(k1+1) to all players x

On all peers handling the response at tick k2:

  1. before_tick_loop, client x receives S(k1+1)
  2. on_tick, client x reads node states Sy(k1+1) for all nodes y in S(k1+1)
  3. on_tick, if x != y, treat Sy(k1+1) as confirmed Sy(k2)
  4. on_tick, if x == y, check S*x(k1+1) against Sx(k1+1)
  5. on_tick, if k1+1 < k2 - Q, tick is too late and we have no history, ignore
  6. on_tick, if S*x(k1+1) is_approx_equal Sx(k1+1), confirm S*x(k1+1) as Sx(k1+1)
  7. on_tick, otherwise, confirm Sx(k1+1) and resimulate S*x(k1+2..k2+1) with Ax(k1+1..k2)
elementbound commented 9 months ago

Thanks @bryanmylee, much appreciated!

One of my concern is coding an older input as current on the server ( e.g. taking Ax(t=0) as Ax(t=2) ). Imagine the following timeline:

  1. P1 sends input A1(t=0) to move right, updates state S1(t=0) = (1, 0)
  2. P1 sends input A1(t=1) to move right, updates state S1(t=1) = (2, 0)
  3. Server receives input A1(t=0), uses as current input A1(t=2), broadcasts state S1(t=2) = (1,0)
    1. Let's say P1 is still going right, sends input A1(t=2), updates state S1(t=2) = (3, 0)
  4. Server receives input A1(t=1), uses as current input A1(t=3), broadcasts state S1(t=3) = (2,0)
    1. Let's say P1 is going north, sends input A1(t=3), updates state S1(t=3) = (3, 1)
  5. P1 receives state S1(t=2) = (1, 0), local state for t=2 is (3, 0), no match, let's do a course correct
  6. P1 receives state S1(t=3) = (2, 0), local state for t=3 is (1, 1) after course correction, no match, let's do a course correct

So if I'm right, we would bake in the delay into the simulation, since we get the same state, but with some delay. I think this can be solved by retaining the original time code received from the client, and either:

  1. Apply the client's old input on the simulation's current state, and hope it's not too inconsistent to glitch out
  2. Rewind the game state and apply the player's input

The second option has a "rollback lite" sound to it, while the first one might break down in more complex scenarios with multiple game state affecting things interacting with each other. I'm just guessing though.


For the implementation approach, it makes sense and reflects the plan well. I would prefer to gather and send the inputs for every tick, to avoid issues where people do something "unexpected" with their input classes ( like not using BaseNetInput and rolling their own logic, which might not fit gathering input only at the start of the loop, but still be a valid approach ).

The other is to have the same way to configure synced properties as RollbackSynchronizer does, but I'm lenient on that - feel free to just implement the data that you need for your specific project, I can just add the property config part after.


on_tick, otherwise, confirm Sx(k1+1) and resimulate S*x(k1+2..k2+1) with Ax(k1+1..k2)

I was also considering a delta mechanism here. So, instead of resimulating ticks, we could also store deltas between local states. Then, when we receive a new state from the server, we can just apply all the deltas to get an up-to-date, predicted state. This would still incur some kind of rollback, but I expect it to be way faster, since no actual simulation is happening.

The slight drawback would be that it would need a class similar to Interpolators, that would have a type-based registry on how to calculate deltas and apply deltas. But I think that's ok for this project.

This way we could also just remember the latest authoritative tick received from the server, and if we receive anything older than that, just ignore. This would also mean that we can forget delta- and input history that's older than the state received from the server.

Let me know what you think!

bryanmylee commented 9 months ago

@elementbound Thanks for the feedback, and yeah I realized during implementation that broadcasting Sx(k1+1) to players means that the original client would have to reconcile state for k1+1 with input A(k) and state S(k).

For the updated approach, the server handling the request at tick k1:

  1. for any originating tick k, server receives input Ax(k) and treats it as Ax(k1)
  2. on_tick, for all clients x, server applies Ax(k1) to S(k1) to produce S(k1+1)
  3. after_tick_loop, for all clients x, for other clients y, server broadcasts Sx(k1+1) to y but Sx(k+1) to x

On all peers handling the response at tick k2:

  1. before_tick_loop, client x receives either S(k+1) or S(k1+1)
  2. on_tick, client x reads node states Sy(k1+1) for all nodes y in S(k1+1) and Sx(k+1) for itself
  3. on_tick, if x != y, treat Sy(k1+1) as confirmed Sy(k2)
  4. on_tick, if x == y, check *`Sx(k+1)againstSx(k+1)`**
  5. on_tick, if k+1 < k2 - Q, tick is too late and we have no history, ignore
  6. on_tick, if S*x(k+1) is_approx_equal Sx(k+1), confirm S*x(k+1) as Sx(k+1)
  7. on_tick, otherwise, confirm Sx(k+1) and resimulate S*x(k+2..k2+1) with Ax(k+1..k2)

I've managed to get state replication to work well on the server and other clients, but the biggest issue I have right now is getting the local client to sync properly and getting single-fired events like "jump" to be recognized consistently on the server.

Sometimes, the players jitter rapidly for extended periods, which makes me think there's some time sync issues.

I'm also not sure how to integrate NetworkRollback into my approach and I might be doing the resimulations incorrectly. I basically run resimulations by running my on_tick function multiple times:

# verification_synchronizer.gd

func _ready() -> void:
    NetworkTime.before_tick.connect(# ... records input for local
    NetworkTime.on_tick.connect(_run_tick)
    NetworkTime.after_tick.connect(# ... run the rpcs

func _run_tick(delta: float, tick: int) -> void:
    # Set state to tick `t`
    var state := _get_latest_until_tick(_states, tick)
    PropertySnapshot.apply(state, _property_cache)
    # Set input to tick `t`
    var input := _get_latest_until_tick(_inputs, tick)
    PropertySnapshot.apply(input, _input_property_cache)

    # Applying input `t` to state `t`
    for node in _nodes:
        node._verified_tick(delta, tick)

## Only gets called on the clients, never on the server.
@rpc("unreliable_ordered", "call_remote")
func _submit_confirmed_state(state: Dictionary, tick: int) -> void:
    if tick < NetworkTime.tick - NetworkRollback.history_limit and _latest_confirmed_state_tick >= 0:
        _logger.debug("Ignoring state %s older than %s frames" % [tick, NetworkRollback.history_limit])
        return

    if state.is_empty():
        _logger.warning("Received invalid state for tick %s" % [tick])
        return

    # I've defined a separate input root to help determine authority over each player.
    # For not-my-player, just sync the received state naively. This works great.
    if not input_root.is_multiplayer_authority():
        _latest_confirmed_state_tick = NetworkTime.tick
        _states[NetworkTime.tick] = PropertySnapshot.merge(_states.get(NetworkTime.tick, {}), state)
        return

    if tick <= _latest_confirmed_state_tick:
        #_logger.warning("Already received confirmed state for tick %s" % [tick])
        return

    _latest_confirmed_state_tick = tick
    # Compare S*x(k+1) to Sx(k+1) (`state` passed into this RPC)
    var predicted_state = _states.get(tick, {}) # S*x(k+1)
    var invalid_server_state := {}
    for property in predicted_state:
        var value = state[property]
        var predicted_value = predicted_state[property]
        # _verifiers just checks that a is_approx_equal b with some margin.
        var is_property_valid = _verifiers[property].call(value, predicted_value)
        if not is_property_valid:
            invalid_server_state[property] = value

    # Apply invalid server state and resimulate to current tick.
    if not invalid_server_state.is_empty():
        # Apply state from server and resimulate to current tick.
        _states[tick] = PropertySnapshot.merge(_states.get(tick, {}), invalid_server_state)
        # For NetworkTime.tick = t, we will already be processing state t and action t to produce state t+1 later on `NetworkTime.on_tick`.
        # If we receive a confirmed state tick = k, we need to produce state k+1 up to state t non-inclusive.
        if tick < NetworkTime.tick:
            # Apply actions from k up to NetworkTime.tick non-inclusive.
            NetworkTime.before_tick_loop.emit()
            for resim_tick in range(tick, NetworkTime.tick):
                _run_tick(NetworkTime.ticktime, resim_tick)
            NetworkTime.after_tick_loop.emit() # clears jump input from simulations.

On my input node:

func _ready() -> void:
    set_process(is_multiplayer_authority())
    set_process_input(is_multiplayer_authority())
    if is_multiplayer_authority():
        NetworkTime.before_tick_loop.connect(_gather_input)
        NetworkTime.after_tick_loop.connect(_clear_input)

func _gather_input() -> void:
    is_crouching = Input.is_action_pressed("mod_crouch")
    is_running = Input.is_action_pressed("mod_run")
    direction = Input.get_vector("move_left", "move_right", "move_forward", "move_backward")

func _clear_input() -> void:
    just_jumped = false

func _process(_delta: float) -> void:
    if Input.is_action_just_pressed("jump"):
        just_jumped = true
bryanmylee commented 9 months ago

I think I've figured out the issue somewhat.

Verified state is sent from server to client at a consistent tick rate. This works fine if input messages are also received on the server at a consistent tick rate to be processed before each server-to-client tick.

There's no issue if server-to-client state sync messages are inconsistent since each message has a tick number attached and the client verifies the state by looking into history.

However, if client-to-server input messages have any latency variance, it causes the server's state computation to be incorrect because the inputs between client and server mismatch. If server-to-client state sync messages are sent in this period, we run into the sync issues mentioned before.

I'm going to test two solutions:

  1. only send server-to-client state sync messages if a new input message is received. I already track a _latest_received_input_tick for the tick number of server-to-original-client sync messages. I'll add a _latest_confirmed_input_tick to only send messages when _latest_received_input_tick increments.
  2. perform a localized rollback on the server for a player x whenever inputs for x are received. Unlike the current implementation where rollback occurs across the entire game state, this should minimize the cascading effects of having one player rollback the entire game.
elementbound commented 9 months ago

Thanks @bryanmylee, keep us posted!

bryanmylee commented 8 months ago

Just a quick update since it's been awhile, but I never got this working properly and had to move on with other features of my project :/

elementbound commented 8 months ago

No worries, I'll keep this issue open, in case I can get around to working on something similar

TheYellowArchitect commented 2 months ago

only send server-to-client state sync messages if a new input message is received. I already track a _latest_received_input_tick for the tick number of server-to-original-client sync messages. I'll add a _latest_confirmed_input_tick to only send messages when _latest_received_input_tick increments.

Send only new inputs, and the server extrapolates existing inputs for ticks which it hasn't received? I have implemented this system for my personal rollback system. I can confirm it works, but it increases code complexity by a lot, as every input must have a flag extrapolated, and every input property must have an "is equal" cap value. For example if your aim.x is 0.000020 and new aim is 0.0000022, you don't want to send a new input as the difference is negligent. There is quite the code complexity with input extrapolation, and the code currently is too clean.

The results are fruitful, as more than 50% extrapolations (movement predictions of last movement) end up accurate, than corrected via rollback.

The second option has a "rollback lite" sound to it, while the first one might break down in more complex scenarios with multiple game state affecting things interacting with each other. I'm just guessing though.

Latency should never become part of an input, the tick timestamp is enough for it. Also input delay should help most scenarios of the OP.