boardgameio / boardgame.io

State Management and Multiplayer Networking for Turn-Based Games
https://boardgame.io
MIT License
10.02k stars 708 forks source link

Split client/master reducers #950

Open shaoster opened 3 years ago

shaoster commented 3 years ago

Context

While attempting to set up the additional client/server plumbing to enable https://github.com/boardgameio/boardgame.io/issues/723, I ran into a handful of issues with the underlying data model that make life pretty hard. These various issues seemed all to stem from the divergent data flows in the various boardgame.io runtime "modes". Roughly speaking, these issues are as follows:

  1. Single-player local setup runs through a Client instance that owns its own redux-managed state machine. Actions resolve synchronously and the result of dispatches are seen directly in the client. This setup is easy to augment with callback plumbing, but...

  2. Multiplayer remote setups run in a split fashion between a client and master. While the client and master end up running the same reducer code, they maintain substantially different state machines and encounter substantially different actions during the course of a game. This is made more complex in cases where the client optimistically applies updates to its own state machine, but then may end up diverging from that state upon receiving a patch/update from the master. Correlating action results with triggering actions becomes quite complicated in this setup without adding a bunch more actions whose semantics diverge between the client and master.

    Per https://github.com/boardgameio/boardgame.io/issues/950#issuecomment-874191076, assigning correlation ids for actions upon creation would be an improvement regardless of the proposed architectural changes; the splitting of master and client reducers merely makes the action lifecycle easier to reason about.

  3. In addition to divergent action semantics between the client and master, the separate client/master wrappers for the core redux reducer leads to some fairly suboptimal duplicated logic. For example, multiplayer "local" setups should still be able to run with super-user permissions that are currently enforced on the master in the master wrapper layer instead of the core reducer, but the reducer still has a number of instances of check-or-noop for these kinds of issues, presumably to avoid corrupting game state.

    Per https://github.com/boardgameio/boardgame.io/issues/950#issuecomment-874191076, we don't need to support this debug/super-user behavior for local multiplayer, so we may be able to just delete much of the duplicated master/reducer validation logic.

This issue was created to make it easier to discuss/review the upcoming PRs intending to address these items.

Proposal

  1. Implement a clearer separation of concerns between the master and client state machines by forking their reducers and eliminating duplicated responsibilities between the client and master abstractions. In particular: a. The client wrapper should forward all "actions" to a master rather than processing state transitions itself. The client's reducer/state machine should be focused purely on messaging/error semantics and staying synchronized with the master's state. G/ctx updates should exclusively be the result of updates from master. b. The master wrapper should be responsible for pretty much all of the existing reducer logic less the client-only actions.
  2. Clean up/formalize the "debug" actions so that they can be processed on the new master when configured.
  3. Refactor/eliminate much of the master wrapper logic into more concise middleware layers.
  4. Re-plumb the existing user-exposed APIs against the new setup where there is always a master. Local prototyping/debug workflows will use a local master with a local transport. We may need to be careful to mitigate breaking changes for existing users that depend on synchronous dispatches for their single-player games.

    Per https://github.com/boardgameio/boardgame.io/issues/950#issuecomment-874191076, this isn't necessarily a documented contract we need to preserve, so we could be a bit more flexible in terms of how we proceed, and we could also choose to preserve this behavior by default for local multiplayer, perhaps behind a flag?

  5. Extend the plugins API to handle the master/client separation. (This is probably going to be pretty hard, given the flexibility of the existing API)
delucis commented 3 years ago

Hi @shaoster — sorry it took me so long to find time to give you some feedback. In general this all sounds solid. If I follow, we’d basically be formalising the debug powers of the client in the master (presumably behind some isLocal flag), so that the reducer is always run inside a master and a client is always just communicating with a master (even if the master is autoinstantiated internally by the client in single-player situations).

In general it all sounds good — so I’d suggest tackling whichever part you’d like to start with and seeing how it goes. Here are a few notes.

multiplayer "local" setups should still be able to run with super-user permissions

If you mean current uses of the Local multiplayer transport — I don’t think this should be the case. That should behave as similarly to the SocketIO instance as possible to serve as a testing ground for multiplayer games without spinning up a server. The extra debugging power is only needed as currently for single-player clients.

Correlating action results with triggering actions becomes quite complicated

Regardless how exactly the architecture ends up, might we want to add some kind of action ID to help track actions? The action creators would generate a random id field in the action object, which the reducer could then spit back out as a “transient” like the errors you added. That way each state update would also report the ID of the action that caused it.

We may need to be careful to mitigate breaking changes for existing users that depend on synchronous dispatches for their single-player games.

Yes. Although the flux model is not strictly synchronous in any case. The model currently is not “X then Y” it’s “Request X… State Update then Y”. That said, you’re right that things run synchronously locally currently, which means something like moves.A(); console.log('after A') will first complete the move, fire the state update and then log “after A”, which might change, but maybe it doesn’t have to with a local master?

shaoster commented 3 years ago

Sounds great. I'll probably have some more time next week to give this a shot.

shaoster commented 3 years ago

@delucis: Continuing work on this beginning today. I updated the original issue description with your comments.

shaoster commented 3 years ago

Proposed Updated Data Flow for State-modifying Actions

Client API:

Call Sequence

Client Part I

  1. Client creates a credentialed action.
  2. Client wraps the credentialed action in a master-bound action, perhaps called "SEND_ACTION" and dispatches it to the reducer.
  3. Client middleware (perhaps TransportMiddleware) intercepts this master-bound action and triggers a side effect, forwarding the contained credentialed action to the master instance.
  4. This middleware also assigns a correlation id (perhaps UUID) to this outbound action and registers a callback on the Client instance keyed on the correlation id.
  5. The client reducer finally receives the dispatched action recording the outbound intent in a metadata field of the client state, perhaps as a queue or map from correlation id to the outbound client time.

Master

  1. The master instance receives the credentialed action and passes the remaining undo/redo/reset action to the master reducer.
  2. The master reducer performs the requested state update if credentials and authorizations (i.e. debug mode enabled/correct player) are legit.
  3. The master sends an update/patch action to the client with an ActionResult nested field added to the state transients field. This ActionResult field can contain a coded error payload, a serialized response, and the original correlation id for the action.

Client Part II

Case 1, Receive Update within timeout

  1. Client receives the update/patch action, dispatches it to the client reducer.
  2. LogMiddleware will expose these updates to the log, but not modify the update/patch.
  3. An additional middleware (perhaps an updated SubscriptionMiddleware) If the reducer finds that the update/patch contains a transients.actionResult field, triggers the specific callback registered to the original correlation id. Note: I'm punting on the design of how that callback interacts with the existing subscription to trigger UI updates/promise resolution/synchronous waits, since with the correlation->callback map, it seems we can do almost anything we want here depending on what API we to expose to the developer. Regardless of the trigger mechanism, the callback is then removed from the client's callback map.
  4. The client reducer finally sees the update/patch, and applies the state modification. Additionally, if there is a transients.actionResult the correlation id for the action is dropped from the queue/map.

Case 2, No update/patch received within timeout

  1. Client registers a recurring job (strategy/duration might be client configurable; see implementation notes at https://github.com/boardgameio/boardgame.io/issues/950#issuecomment-879330516) that checks the outbound client timestamps for each action against the client's current local time. If any outbound timestamp is older than the timeout, the job dispatches a "timeout" action to the client reducer containing an action body with the retired action's correlation id.
  2. A client middleware (perhaps a new one?) sees the dispatched timeout and triggers a timeout exception on the corresponding callback and drops the callback.
  3. The client reducer sees the timeout and drops the retired action's correlation id from the outbound queue/map.
delucis commented 3 years ago

Thanks Philip, here are some notes

Client API

Call Sequence

Otherwise I think this all looks good unless I’ve missed something! Look forward to seeing the PR.

vdfdev commented 3 years ago

" G/ctx updates should exclusively be the result of updates from master."

Does that mean that the client will stop optimistically updating the state on multiplayer games ? I'm just worried some games might feel sluggish to the user, like our chess example which would take up to a second to update the move after the user does the action.

delucis commented 3 years ago

Good question @vdfdev. @shaoster & I have definitely discussed the importance of optimistic updates before, so the intention is to keep that feature, although it’s true I forgot about that while looking at this proposal. Any thoughts on how we could handle that @shaoster?

shaoster commented 3 years ago

That's a great point that I haven't thought through.

The current plan above handles the following cases:

  1. Single player will use a locally-instantiated master and no optimistic updates (or at least no distinction between optimistic and "true" updates). There is negligible performance impact in this case because the transport is local and shimmed.

  2. Local multiplayer works in the same fashion as single player.

  3. Remote multiplayer would register a callback that would be fired upon resolution of the action by the remote master. Per @vdfdev, if the action takes a long time to resolve (or there is a flaky network), existing game implementations may suffer from performance issues.

From an implementation perspective, it's straightforward to retain the existing optimistic behavior in case 3 by also creating a (write-only) local master that synchronously returns the optimistically updated state. However, much of the semantic clarification I want to achieve with this change (and the subsequent Action Result/Error API) becomes quite muddy when having to deal with both "optimistic" and "true" updates in the remote multiplayer case.

While there's a bunch of possible API clarifications that could work, I think the best choice among them really depends on the answers to a couple questions to which I probably have non-representative answers:

I relied entirely on synchronous updates for single player games, and hated the automatic/optimistic updates in multiplayer games, mostly because of the difficulty of attaching side effects, like animations, to state changes that might be corrected later. I was very (and still remain somewhat) confused about how to write my core UI/game logic that is meaningfully reusable between single player and multiplayer setups.

I was pretty inexperienced with the framework when I last built a multiplayer game using bg.io. Are there best/common practices the API should be designed to accommodate? Expanding on the above, I tended to handle optimistic cases manually, so that when there were no "corrections", my side effects were applied as expected. But when there were corrections, they were easy to detect and I could then add the appropriate animations or errors to indicate.

Finally, while I think it's of paramount importance to define the requirements based on actual use cases per the questions above, here's some options I see for the API:

  1. Formalize synchronous optimistic updates and asynchronous "true" updates as two distinct concepts. These could have either the same (or similar) Action Result/Action Error APIs. (They might not be the same because only "true" updates might have certain error classes like Timeouts or Network Errors). This has the best chance of remaining backwards compatible, but could be quite messy/surprising in corner cases. (Is a "correction" to an optimistic update always an error case?)

  2. Make optimistic updates an opt-in mechanism. Optimistic updates can be retrieved synchronously from a field in a decorated promise returned by an event/move/plugin action (i.e. https://gist.github.com/domenic/8ed6048b187ee8f2ec75). All updates to G/ctx/plugins in the UI are "real". This is probably how I would've designed the API to start with, based on my usage of the API in the games I've developed.

  3. Create some client-level feature flag for what kind of behavior the UI expects. If the feature flag is off, you get a high-fidelity emulation of the legacy behavior with no Action Result/Error/etc... If the feature flag is on, you get a new backwards-incompatible API tailor-made per the answers to the usage questions above.

delucis commented 3 years ago

I think we should stick as closely to the current behaviour as possible: for consumer purposes there is no distinction between an optimistic and a server update. In practice, it’s not really a question of the server “correcting” an optimistic update. We have various heuristics like the client: false flag on moves and the noClient plugin method that means the client knows when it should wait for the server and not update optimistically. If the client does update optimistically, it will then ignore the server update (by checking the update’s state ID).

I think an optimistic update could only be incorrect if a) a client had an out of date version of the game logic or b) a client had somehow been hacked by a user to behave differently. In both cases trying to correct state is probably futile.

So the two cases in practice are

  1. Client processes the update and none of the “no client” heuristics are triggered
  2. Client updates optimistically
  3. Client ignores server update when it arrives

Or

  1. Client processes the update and one or more “no client” heuristics is triggered
  2. Client state is unchanged
  3. Client receives and applies the server update
shaoster commented 3 years ago

So there are a couple practical cases where I see meaningful distinctions between optimistic updates and possible corrections:

  1. As a related case to the stale version, if there is a timeout, network issue or other ActionError, the move/event would have to be rolled back.

  2. Another realistic related case to stale version: simultaneous moves. This actually seems like the classic case where the model I outlined above, which creates an explicit distinction between optimistic and "true" updates, is critical.

  3. If the move outcome contains RNG. In practice, however, I think this is just a special case of playerView/imperfect information since the RNG is modeled as a deterministic player-hidden number sequence, so perhaps optimistic updates don't make sense in this case. While there may still be useful ways to potentially indicate an optimistic vs "true" state update in cases of imperfect information, I think these would be better handled manually by the developer using the async action callbacks.

delucis commented 3 years ago

Ah, I hadn’t considered 1 and 2 (which just aren’t handled currently — we’ll only be able to handle them better thanks to this refactor). We already don’t run moves that use the PRNG on the client (the randomness plugin tells the flow reducer the action shouldn’t be processed on the client if it was used), so we shouldn’t need to worry about 3.

I’m still doubtful about a separate update type. >99% of the time 1 or 2 won’t apply, so we should prefer having the common case (optimistic updates working seamlessly) be basically automatic and handle those edge cases when they arise. I’d suggest treating 1 and 2 basically as errors. We would keep the optimistic/authentic distinction private in the client implementation and leave consumers to worry about updates and errors rather than distinguishing between a variety of update types.

shaoster commented 3 years ago

So just to summarize:

delucis commented 3 years ago

Could we get optimistic errors as well? Seems like many errors (invalid moves, player out of turn, etc.) could reliably be returned from the client, so it would be nice for that to be optimistic too.

Consider something like this:

const game = {
  moves: {
    // Move that is sometimes invalid.
    A: (G, ctx, isInvalid) => {
      if (isInvalid) return INVALID_MOVE;
    },
    // Move that will not run on the client if it uses the PRNG.
    B: (G, ctx, isInvalid) => {
      if (isInvalid) return INVALID_MOVE;
      G.value = ctx.random.D6();
    },
  },
};
// Move can be resolved optimistically.
client.moves.A()
  .then(() => console.log('resolved optimistically'));

// Move can error optimistically.
client.moves.A(true)
  .then(({ error }) => console.log('optimistic error', error));

// Move has to resolve asynchronously because of PRNG use.
client.moves.B()
  .then(() => console.log('resolved asynchronously'));

// Move can error optimistically (PRNG wasn’t used).
client.moves.B(true)
  .then(({ error }) => console.log('optimistic error', error));

The exact handling of optimistic updates was left unresolved in my sketches in #723, but I actually think for this API, trusting the optimistic updates makes a lot of sense. This does suggest a split in error types though: a) game errors that can either be optimistic/asynchronous depending on the specific game logic (like above) and b) infrastructure/network errors where things timeout, get overridden because of out-of-sync state etc. The split seems convincing to me because b) requires resetting high-level state/retrying requests whereas a) likely requires localised error UI, so they have a different flavour. Also this would mean that in move dispatcher promises (like above) errors would be surfaced in direct response to the move instead. Errors of type b) could instead pass through some higher level onError method. Does that sound plausible?

shaoster commented 3 years ago

In general this seems like it could work. A few caveats:

  1. I'm a bit confused about the use of the same promise resolution syntax for both the definitely synchronous and definitely asynchronous updates. I might be missing something, but this mixing could make it hard to write UI components that can be reused between local and remote game modes.

  2. Relatedly, optimistic errors should probably resolve through a synchronous mechanism (rather than via a promise) and not forward actions to the remote. Optimistic errors should never mutate developer-visible game state so this should be safe.

  3. "Correction" errors are a bit different from timeout/connection issues in that it's still typically useful for the UI to correlate the error to the triggering action. In simultaneous move setups, being able to do this correlation seems critical to correct animations/effects.

delucis commented 3 years ago
  1. The move.then Promise syntax is a new API as part of this proposal so we don’t need to be as careful about synchronous updates. If users want to adopt the new possibilities of the call-site API they can and we always do it via a Promise exactly to support compatibility between local and remote games. In something like move B in my examples, the caller can’t know if the move requires a server roundtrip, so even if sometimes the result will be processed synchronously, always getting errors by resolving the Promise makes sense.

  2. We should be able to keep the internals synchronous I guess and only the call-site API would actually involve a Promise resolving.

  3. I still think that the correction errors you highlighted are more global errors like “Hey you got out of sync with your opponents” or “Hey your connection timed out. Try again?” so that’s why I’m thinking it might OK for that not to resolve at the call-site. (I do also see the drawbacks there but it seemed the best way to support optimistic call-site resolution.)

shaoster commented 3 years ago

Sweet. I think this puts us in a pretty good position. Whether simultaneous move corrections are "global" errors or not seems pretty minor and we can discuss those details more once this refactoring is complete.

I'll update the proposed data flow above ASAP and incorporate these changes into my dev branch in the next few days.