Open shaoster opened 3 years ago
Hi @shaoster — sorry it took me so long to find time to give you some feedback. In general this all sounds solid. If I follow, we’d basically be formalising the debug powers of the client in the master (presumably behind some isLocal
flag), so that the reducer is always run inside a master and a client is always just communicating with a master (even if the master is autoinstantiated internally by the client in single-player situations).
In general it all sounds good — so I’d suggest tackling whichever part you’d like to start with and seeing how it goes. Here are a few notes.
multiplayer "local" setups should still be able to run with super-user permissions
If you mean current uses of the Local
multiplayer transport — I don’t think this should be the case. That should behave as similarly to the SocketIO
instance as possible to serve as a testing ground for multiplayer games without spinning up a server. The extra debugging power is only needed as currently for single-player clients.
Correlating action results with triggering actions becomes quite complicated
Regardless how exactly the architecture ends up, might we want to add some kind of action ID to help track actions? The action creators would generate a random id
field in the action object, which the reducer could then spit back out as a “transient” like the errors you added. That way each state update would also report the ID of the action that caused it.
We may need to be careful to mitigate breaking changes for existing users that depend on synchronous dispatches for their single-player games.
Yes. Although the flux model is not strictly synchronous in any case. The model currently is not “X then Y” it’s “Request X… State Update then Y”. That said, you’re right that things run synchronously locally currently, which means something like moves.A(); console.log('after A')
will first complete the move, fire the state update and then log “after A”, which might change, but maybe it doesn’t have to with a local master?
Sounds great. I'll probably have some more time next week to give this a shot.
@delucis: Continuing work on this beginning today. I updated the original issue description with your comments.
Client Part I
Master
transients
field. This ActionResult field can contain a coded error payload, a serialized response, and the original correlation id for the action.Client Part II
Case 1, Receive Update within timeout
transients.actionResult
field, triggers the specific callback registered to the original correlation id. Note: I'm punting on the design of how that callback interacts with the existing subscription to trigger UI updates/promise resolution/synchronous waits, since with the correlation->callback map, it seems we can do almost anything we want here depending on what API we to expose to the developer. Regardless of the trigger mechanism, the callback is then removed from the client's callback map.transients.actionResult
the correlation id for the action is dropped from the queue/map.Case 2, No update/patch received within timeout
timeout
and triggers a timeout exception on the corresponding callback and drops the callback.timeout
and drops the retired action's correlation id from the outbound queue/map.Thanks Philip, here are some notes
undo
and redo
are fully fledged actions that can be validly used by games in production so shouldn’t be part of the debugging API.
The current client has an overrideGameState
method that allows setting a custom game state. The primary use is for the debug panel, which uses this to set an alternative state when stepping through the game log. While designing the debug API, we should probably account for this, potentially following something like the API described in https://github.com/boardgameio/boardgame.io/issues/892#issuecomment-762767433 (This might seem like a separate concern, but log stepping also runs the reducer, so we should bear that in mind even if we keep it in the client rather than formalising it in the master.)
We should probably support customisation of the timeout
duration on the client.
There’s no need to run a timeout check unless any actions are outstanding, we should do something like:
When dispatching an action, register the recurring timeout check if not currently running (something that a timeout monitor abstraction could do for itself, so we just call monitor.start()
and it’s a no-op if already running).
When removing an outstanding action from the map, check if it’s the last one and stop the recurring check.
The timeout check is cheap so it’s probably OK to run it relatively frequently, but there’s probably also no point in running it too frequently, because things can’t timeout until the entire timeout duration has elapsed. Setting the interval with something like max(timeout / n, minInterval)
would probably make sense. Not sure what an optimal n
would be, but n = 5
and minInterval = 150
might be OK.
Otherwise I think this all looks good unless I’ve missed something! Look forward to seeing the PR.
" G/ctx updates should exclusively be the result of updates from master."
Does that mean that the client will stop optimistically updating the state on multiplayer games ? I'm just worried some games might feel sluggish to the user, like our chess example which would take up to a second to update the move after the user does the action.
Good question @vdfdev. @shaoster & I have definitely discussed the importance of optimistic updates before, so the intention is to keep that feature, although it’s true I forgot about that while looking at this proposal. Any thoughts on how we could handle that @shaoster?
That's a great point that I haven't thought through.
The current plan above handles the following cases:
Single player will use a locally-instantiated master and no optimistic updates (or at least no distinction between optimistic and "true" updates). There is negligible performance impact in this case because the transport is local and shimmed.
Local multiplayer works in the same fashion as single player.
Remote multiplayer would register a callback that would be fired upon resolution of the action by the remote master. Per @vdfdev, if the action takes a long time to resolve (or there is a flaky network), existing game implementations may suffer from performance issues.
From an implementation perspective, it's straightforward to retain the existing optimistic behavior in case 3 by also creating a (write-only) local master that synchronously returns the optimistically updated state. However, much of the semantic clarification I want to achieve with this change (and the subsequent Action Result/Error API) becomes quite muddy when having to deal with both "optimistic" and "true" updates in the remote multiplayer case.
While there's a bunch of possible API clarifications that could work, I think the best choice among them really depends on the answers to a couple questions to which I probably have non-representative answers:
How do developers currently consume the optimistic state update in remote multiplayer games?
In the same manner that the current recommendation for Move Results/Errors is to just create a separate (pure) method that's called by both the UI and bg.io, I found it easier to just used a helper method to explicitly handle optimistic updates. I explicitly updated the UI to the optimistic state, and treated any updates to G/ctx/plugins from bg.io as asynchronously-updated sources of truth.
How heavily do developers rely on the "synchronous" behavior of the optimistic updates?
I relied entirely on synchronous updates for single player games, and hated the automatic/optimistic updates in multiplayer games, mostly because of the difficulty of attaching side effects, like animations, to state changes that might be corrected later. I was very (and still remain somewhat) confused about how to write my core UI/game logic that is meaningfully reusable between single player and multiplayer setups.
I was pretty inexperienced with the framework when I last built a multiplayer game using bg.io. Are there best/common practices the API should be designed to accommodate? Expanding on the above, I tended to handle optimistic cases manually, so that when there were no "corrections", my side effects were applied as expected. But when there were corrections, they were easy to detect and I could then add the appropriate animations or errors to indicate.
Finally, while I think it's of paramount importance to define the requirements based on actual use cases per the questions above, here's some options I see for the API:
Formalize synchronous optimistic updates and asynchronous "true" updates as two distinct concepts. These could have either the same (or similar) Action Result/Action Error APIs. (They might not be the same because only "true" updates might have certain error classes like Timeouts or Network Errors). This has the best chance of remaining backwards compatible, but could be quite messy/surprising in corner cases. (Is a "correction" to an optimistic update always an error case?)
Make optimistic updates an opt-in mechanism. Optimistic updates can be retrieved synchronously from a field in a decorated promise returned by an event/move/plugin action (i.e. https://gist.github.com/domenic/8ed6048b187ee8f2ec75). All updates to G/ctx/plugins in the UI are "real". This is probably how I would've designed the API to start with, based on my usage of the API in the games I've developed.
Create some client-level feature flag for what kind of behavior the UI expects. If the feature flag is off, you get a high-fidelity emulation of the legacy behavior with no Action Result/Error/etc... If the feature flag is on, you get a new backwards-incompatible API tailor-made per the answers to the usage questions above.
I think we should stick as closely to the current behaviour as possible: for consumer purposes there is no distinction between an optimistic and a server update. In practice, it’s not really a question of the server “correcting” an optimistic update. We have various heuristics like the client: false
flag on moves and the noClient
plugin method that means the client knows when it should wait for the server and not update optimistically. If the client does update optimistically, it will then ignore the server update (by checking the update’s state ID).
I think an optimistic update could only be incorrect if a) a client had an out of date version of the game logic or b) a client had somehow been hacked by a user to behave differently. In both cases trying to correct state is probably futile.
So the two cases in practice are
Or
So there are a couple practical cases where I see meaningful distinctions between optimistic updates and possible corrections:
As a related case to the stale version, if there is a timeout, network issue or other ActionError, the move/event would have to be rolled back.
Another realistic related case to stale version: simultaneous moves. This actually seems like the classic case where the model I outlined above, which creates an explicit distinction between optimistic and "true" updates, is critical.
If the move outcome contains RNG. In practice, however, I think this is just a special case of playerView/imperfect information since the RNG is modeled as a deterministic player-hidden number sequence, so perhaps optimistic updates don't make sense in this case. While there may still be useful ways to potentially indicate an optimistic vs "true" state update in cases of imperfect information, I think these would be better handled manually by the developer using the async action callbacks.
Ah, I hadn’t considered 1 and 2 (which just aren’t handled currently — we’ll only be able to handle them better thanks to this refactor). We already don’t run moves that use the PRNG on the client (the randomness plugin tells the flow reducer the action shouldn’t be processed on the client if it was used), so we shouldn’t need to worry about 3.
I’m still doubtful about a separate update type. >99% of the time 1 or 2 won’t apply, so we should prefer having the common case (optimistic updates working seamlessly) be basically automatic and handle those edge cases when they arise. I’d suggest treating 1 and 2 basically as errors. We would keep the optimistic/authentic distinction private in the client implementation and leave consumers to worry about updates and errors rather than distinguishing between a variety of update types.
So just to summarize:
Could we get optimistic errors as well? Seems like many errors (invalid moves, player out of turn, etc.) could reliably be returned from the client, so it would be nice for that to be optimistic too.
Consider something like this:
const game = {
moves: {
// Move that is sometimes invalid.
A: (G, ctx, isInvalid) => {
if (isInvalid) return INVALID_MOVE;
},
// Move that will not run on the client if it uses the PRNG.
B: (G, ctx, isInvalid) => {
if (isInvalid) return INVALID_MOVE;
G.value = ctx.random.D6();
},
},
};
// Move can be resolved optimistically.
client.moves.A()
.then(() => console.log('resolved optimistically'));
// Move can error optimistically.
client.moves.A(true)
.then(({ error }) => console.log('optimistic error', error));
// Move has to resolve asynchronously because of PRNG use.
client.moves.B()
.then(() => console.log('resolved asynchronously'));
// Move can error optimistically (PRNG wasn’t used).
client.moves.B(true)
.then(({ error }) => console.log('optimistic error', error));
The exact handling of optimistic updates was left unresolved in my sketches in #723, but I actually think for this API, trusting the optimistic updates makes a lot of sense. This does suggest a split in error types though: a) game errors that can either be optimistic/asynchronous depending on the specific game logic (like above) and b) infrastructure/network errors where things timeout, get overridden because of out-of-sync state etc. The split seems convincing to me because b) requires resetting high-level state/retrying requests whereas a) likely requires localised error UI, so they have a different flavour. Also this would mean that in move dispatcher promises (like above) errors would be surfaced in direct response to the move instead. Errors of type b) could instead pass through some higher level onError
method. Does that sound plausible?
In general this seems like it could work. A few caveats:
I'm a bit confused about the use of the same promise resolution syntax for both the definitely synchronous and definitely asynchronous updates. I might be missing something, but this mixing could make it hard to write UI components that can be reused between local and remote game modes.
Relatedly, optimistic errors should probably resolve through a synchronous mechanism (rather than via a promise) and not forward actions to the remote. Optimistic errors should never mutate developer-visible game state so this should be safe.
"Correction" errors are a bit different from timeout/connection issues in that it's still typically useful for the UI to correlate the error to the triggering action. In simultaneous move setups, being able to do this correlation seems critical to correct animations/effects.
The move.then
Promise syntax is a new API as part of this proposal so we don’t need to be as careful about synchronous updates. If users want to adopt the new possibilities of the call-site API they can and we always do it via a Promise exactly to support compatibility between local and remote games. In something like move B
in my examples, the caller can’t know if the move requires a server roundtrip, so even if sometimes the result will be processed synchronously, always getting errors by resolving the Promise makes sense.
We should be able to keep the internals synchronous I guess and only the call-site API would actually involve a Promise resolving.
I still think that the correction errors you highlighted are more global errors like “Hey you got out of sync with your opponents” or “Hey your connection timed out. Try again?” so that’s why I’m thinking it might OK for that not to resolve at the call-site. (I do also see the drawbacks there but it seemed the best way to support optimistic call-site resolution.)
Sweet. I think this puts us in a pretty good position. Whether simultaneous move corrections are "global" errors or not seems pretty minor and we can discuss those details more once this refactoring is complete.
I'll update the proposed data flow above ASAP and incorporate these changes into my dev branch in the next few days.
Context
While attempting to set up the additional client/server plumbing to enable https://github.com/boardgameio/boardgame.io/issues/723, I ran into a handful of issues with the underlying data model that make life pretty hard. These various issues seemed all to stem from the divergent data flows in the various boardgame.io runtime "modes". Roughly speaking, these issues are as follows:
Single-player local setup runs through a Client instance that owns its own redux-managed state machine. Actions resolve synchronously and the result of dispatches are seen directly in the client. This setup is easy to augment with callback plumbing, but...
Multiplayer remote setups run in a split fashion between a client and master. While the client and master end up running the same reducer code, they maintain substantially different state machines and encounter substantially different actions during the course of a game. This is made more complex in cases where the client optimistically applies updates to its own state machine, but then may end up diverging from that state upon receiving a patch/update from the master. Correlating action results with triggering actions becomes quite complicated in this setup without adding a bunch more actions whose semantics diverge between the client and master.
In addition to divergent action semantics between the client and master, the separate client/master wrappers for the core redux reducer leads to some fairly suboptimal duplicated logic.
For example, multiplayer "local" setups should still be able to run with super-user permissions that are currently enforced on the master in the master wrapper layer instead of the core reducer, but the reducer still has a number of instances of check-or-noop for these kinds of issues, presumably to avoid corrupting game state.This issue was created to make it easier to discuss/review the upcoming PRs intending to address these items.
Proposal
Re-plumb the existing user-exposed APIs against the new setup where there is always a master. Local prototyping/debug workflows will use a local master with a local transport. We may need to be careful to mitigate breaking changes for existing users that depend on synchronous dispatches for their single-player games.