WASM examples broken if the user switches tabs

cBournhonesque commented 8 months ago

I think this PR broke the examples for some reason.

the examples work fine on native
on wasm, the connection gets timed out pretty quickly for some reason

UPDATE:

it looks like the update is that the client tasks stop running when the page is alt-tabbed, so the connection times out if the user alttabs for too long!
the second problem is that the server seems to be stuck in a weird state where it doesn't accept webtransport connections anymore.

On client we get:

Failed to establish a connection to https://127.0.0.1:5000/: net::ERR_QUIC_PROTOCOL_ERROR.QUIC_NETWORK_IDLE_TIMEOUT (No recent network activity after 9003077us. Timeout:9s).
failed to connect to server: Error(JsValue(WebTransportError: Opening handshake failed.Error: Opening handshake failed.))

POSSIBLE SOLUTIONS:

run in background thread using webworkers?

MOZGIII commented 8 months ago

Let me know if you figure out the solution; if this is something that's better solved at the xwt level I'm interested in adding support.

Nul-led commented 7 months ago

@cBournhonesque quick update: browser seems to pause io tasks when in ram saving mode regardless of whether or not the client runs in a webworker, so the appropriate solution should be adjusting timeout i think

Nul-led commented 7 months ago

(only tested on brave, but should be the same for other major browsers when ram saving is active)

Nul-led commented 7 months ago

when ram saving mode is disabled, this issue does not seem to occur, so im certain that this is indeed the root cause

MOZGIII commented 7 months ago

There are tricky sneaky ways to keep the tab alive if this is the reason btw - but none I'd recommend implementing at this crate level

cBournhonesque commented 6 months ago

@Nul-led so you have confirmed that, if you disable ram-saving mode, you can freely switch tabs and the game (including io tasks) will continue working in the background? i.e. the issue totally disappears if you disable ram-saving mode?

Nul-led commented 6 months ago

@cBournhonesque apparently the thread does not actually get stopped entirely but instead just throttled. Might be possible to figure out if that happens and temporarily stop sending and receiving packets.

Disabling ram saver seems to work on brave, cant say with other browsers. Requires more testing ig.

simbleau commented 6 months ago

It's very unclear what the real issue is from reading this.

Can someone confirm:

Who causes the disconnection? Does the server terminate the client, does the client terminate themselves, does the browser terminate the session, etc.?
Any confirmation of whether the server bug still happens, and why?

Regarding the "Keep Alive" strategy:

I don't think this is possible. Unlike websockets which are browser-scoped, WebTransport is tab-scoped. That means when you switch tabs, WebTransport fully stops. The KeepAlive strategy only works for websockets.
However, could we instead modify the constraint to allow messages an infinite amount of time to return a response, to prevent timeout?

Other strategies:

Would automatic re-connection be possible?
Has anyone tried a web worker to keep it alive? This seems like it would work, since the script will run independently of the main thread, allowing it to continue executing even when the user switches tabs.

MOZGIII commented 6 months ago

If you have audio playing on your tab it won't get suspended. It would actually be quite fine for a game to apply this workaround.

Automatic reconnection at the WebTransport layer and all that is not possible since the browser doesn't really give us control over those details of the connection that would enable it: specifically, timeouts. It is the browser that would terminate the WebTransport session, and this will happen regardless of whether tab is paused or not, so we can't hook into it.

It is totally possible to implement the reconnect at the app level though. Would require a certain layer of logic on top of the transport, like a custom handshake to identify the connecting party - but that is possible. The lack of control over the RTT0 in the browsers API is a bit unfortunate here - but if it was there it would be not as bad latency-wise.

simbleau commented 6 months ago

If you have audio playing on your tab it won't get suspended. It would actually be quite fine for a game to apply this workaround.

Automatic reconnection at the WebTransport layer and all that is not possible since the browser doesn't really give us control over those details of the connection that would enable it: specifically, timeouts. It is the browser that would terminate the WebTransport session, and this will happen regardless of whether tab is paused or not, so we can't hook into it.

It is totally possible to implement the reconnect at the app level though. Would require a certain layer of logic on top of the transport, like a custom handshake to identify the connecting party - but that is possible. The lack of control over the RTT0 in the browsers API is a bit unfortunate here - but if it was there it would be not as bad latency-wise.

I do think this should be brought up to W3C or a WT working group, but regardless...

Since we're blocked on W3C and browsers, it sounds like there are only 2 reasonable solutions that will be solved within the heat-death of the universe.

1) Webworkers 2) App-level reconnecting

(or Both, long term)

I think 2) is understood enough by yourselves to be fixed today.

Is there any chance we could add that logic to the simple-box example? Specifically, to spell it out to new users (myself), or have an entirely new demo just for reconnecting.

Ideally, 3 app states: Connected, Reconnecting, Disconnected. If reconnecting happens, have some text centered on screen that says "reconnecting...".

MOZGIII commented 6 months ago

The real solution is adding audio to the game... :D

simbleau commented 6 months ago

Maybe this is a joke, but I don't consider this a serious solution.

The real solution is adding audio to the game... :D

I'll play devil's advocate though ... what if the user mutes their browser tab? Would it still work?

cBournhonesque commented 6 months ago

@simbleau I don't really understand the problem clearly myself. My current understand from reading the messages above are:

this is less of an issue with WebSocket because the websocket session is browser-wide, but the tab can still be throttled which could cause issues
for WebTransport the session is per-tab, and gets suspended/ended by the browser when the client switches tabs; so setting an infinite timeout on the server side doesn't fix the issue?

As for the reconnecting logic: I've started adding more networking-related state to the library, primarily so that we have more runtime-control over the networking configuration (so that a disconnected client can select a different server, etc.): https://github.com/cBournhonesque/lightyear/blob/main/lightyear/src/client/networking.rs#L279 This could be adapted to support reconnections. So the idea would be that when a user switches tabs, the server times them out; but when they reconnect, the server recognizes that it's the same ClientId and resumes their position in the game?

MOZGIII commented 6 months ago

Maybe this is a joke, but I don't consider this a serious solution.

It is though. There are npm packages that play an audio stream of barely audible noise precisely to do just this.

I'll play devil's advocate though ... what if the user mutes their browser tab? Would it still work?

No, it won't. Users have to comply with the workaround if they want to remain connected, and if not - well, it is always up to them. Browsers don't have a good way to keep a tab alive. There's https://www.w3.org/TR/screen-wake-lock/ but it is fora different purpose.

The easiest way for the user to keep the tab active is if it plays audio. The less easy way is for them to add the origin to the list of websites that never go inactive, and the most difficult way is to disable who whole Chromium / Firefox feature - which is nonetheless doable.

That said, there's also background sync, so, maybe you don't actually need the WebTransport session... This does not seem like a portable solution fit for this kind of crate though. Maybe for a more comprehensive networking solution specialized on web apps/games.

simbleau commented 6 months ago

A small, important clarification:

@MOZGIII said It is the browser that disconnects the WebTransport session when you switch tabs, not the server.

Re: web sockets- Yes, that's right. bevy_rtc doesn't have this issue because it uses WebRTC with signaling built over web sockets. Those web sockets never go idle because, regardless of whether the client app is frozen, the server continues to send KeepAlive packets to the web socket.

Re: reconnecting - it's unclear to me, too. I lean on you two to figure this out. I'm guessing when you connect there's a refresh token the client can be told about for "fast reconnecting," However I'd be fine with a total teardown/re-connect. As long as there's some way to reconnect...

simbleau commented 6 months ago

What about web workers?

simbleau commented 6 months ago

Maybe this is a joke, but I don't consider this a serious solution.

It is though. There are npm packages that play an audio stream of barely audible noise precisely to do just this.

I call that a hack, not a solution.

Perhaps we need to file a case under W3C, actually, to address this.

Because even for games, that's a shitty "solution". I mute tabs often, especially games. Communicating the technical problem and putting the onus on users to circumnavigate it is technically embarrassing and difficult for, eg. Children and childrens games.

simbleau commented 6 months ago

Filed w3c/webtransport#600

MOZGIII commented 6 months ago

I call that a hack, not a solution.

It is absolutely a hack. As you said, W3C has to deal with it, the would probably be a new Wake Lock Web API for this. This is a lot of work however, and definitely not something that is available today - so the workarounds and hacks are still meaningful to discuss here. After all - it a hack solves the issue it is usually classified as a "good enough" solution and most people can move on the next thing.

UPD:

Filed w3c/webtransport#600

This is great, let's see what they say! I have doubts they'll give us something, as this is a Chrominum thing and is standartized afaik.

I've been going though the source to figure out where it's implemented, so far found this - might be a good place to explore for others too.

MOZGIII commented 6 months ago

Re: reconnecting - it's unclear to me, too. I lean on you two to figure this out. I'm guessing when you connect there's a refresh token the client can be told about for "fast reconnecting," However I'd be fine with a total teardown/re-connect. As long as there's some way to reconnect...

I was talking about the 0RTT QUIC handshakes - they allow establishing a new QUIC connection reusing some of the key material data from the previously-established-but-now-closed QUIC connection to save a few exchanges in the handshake. This is not resuming the old connection though - it is creating a new connection, so re-connection.

With re-connection, it all depends of how the application handles the new connection. If it has a persistent identifier for the client and correlates the context with the said identifier rather than the connection - so that the connections are context-less besides providing the reference to the said persistent identifier - it is very trivial to implement reconnections, assuming the apps supports "connecting mid-game" or otherwise allowing newly connected clients in whatever is going on. This is usually done in games through the initial world-state replication on connection - but in this case an additional support for replicating updates for the previously connected persistent identity (just over a new connection) would be required.

So, the application-level support for seamless reconnection would likely be a "real" solution, as it would not rely on transient state like WebTransport session to be intact in the first place.

I would say though this is a job either for a specific application/game to implement, or a really high-level networking framework, that takes opinionated control over away more things that lightyear in particular currently does.

That said, the solution would most likely have to transport-agnostic, as this is in now way a WebTransport-specific issue - as a typical transport state is transient.

QUIC (the HTTP3/WebTransport underlying protocol) has keep-alive for idle connections as well. See https://datatracker.ietf.org/doc/html/rfc9308#name-session-resumption-versus-k

Web API for WebTransport may just expose the configuration parameters for idle connections management - but overall this is still worse than the solution above, albeit less of a hack than playing audio.

Note that this, however, would not solve the issue - well, at least maybe not entirely. If the app code is frozen, the queues won't be drained and get overfilled. The browsers will either crash the WebTransport [[Session]] or evict the older datagrams from the queue - meaning the protocol will be disrupted, and the app code will have to recover from this, which is either way likely be resetting the replication state and requesting either whole world state as for the initial connection - or a list of state updates since the last known world state; the latter only works great if you have a deterministic game, or when the desyncs are not a particularly bad problem - like for the online cooperative archvis apps, where there's no need for an authoritative conflict resolution like in games.

MOZGIII commented 6 months ago

@MOZGIII said It is the browser that disconnects the WebTransport session when you switch tabs, not the server.

There are a number of scenarios here that could happen. I have not investigated it in practice, but it is true that the browser causes the connection to disconnect - potentially by not enabling the keep alive settings. But this is unclear, and might be that the server actually sends the goaway frame.

MOZGIII commented 6 months ago

What about web workers?

Thinking about this - if you can extract the whole networked game state maintenance loop into the Web Worker together with the WebTransport - sure, that would work (well, except WebWorkers are deactivated too at certain times, so maybe a ServiceWorker instead, but this can be determined later down the line). That way you can ensure the data the server communicates is not lost and processed to the best of the client's ability while the rendering is unavailable. But moving only WebTransport out would cause the same issue I described at https://github.com/cBournhonesque/lightyear/issues/144#issuecomment-2062607360 (second part).

MOZGIII commented 6 months ago

At the https://github.com/w3c/webtransport/issues/600 they are saying it's an implementation bug, which is what I was very much suspecting thus my attempts to find the tab deactivation code in the Chromium source. From what I recall from reading WebTransport though - it shouldn't be an issue with the tab deactivation. What is most likely the issue though is that the client and server can't agree on the idle timeouts - which may or may not be caused by Chrome side, but based on the lack of the settings to tweak the idle timeout in the spec - it could. That said, double-check your server side - you could just enable the idle connection keep alive from the server side.

Unfortunately, there is still a problem of data loss that has to be solved (world state reinit or state diff sync), because the datagrams will be dropped from the recv queue if the app can't keep up with them, and the frozen app definitely can't.

simbleau commented 6 months ago

Ok so, we need confirmation from a Chromium filed issue this is a bug. Otherwise we aren't sure if it's a lightyear/xwt bug. There will be people, myself included, who wouldn't experiment or adopt lightyear today if this is a design choice of WebTransport that won't be fixed.

Secondly, I'll propose we document the workaround: Disable RAM saving mode with an issue to track the Chromium bug.

Lastly, anyone want to add a reconnection example? I think it would be helpful in any case.

MOZGIII commented 6 months ago

I am very confident this is not an xwt bug, can't be sure of lightyear. I integrate xwt differently in my app, i.e. I am spawning the IO loop into a promise (i.e. wasm_bindgen_future async loop spawn). lightyear integrates via bevy's async IO pool, which seems wrong to me. Maybe this is the reason why it hangs actually. I'll soon be working on the guidance on how to better integrate xwt - once I am finished with the research. Stay tuned.

simbleau commented 6 months ago

I am very confident this is not an xwt bug, can't be sure of lightyear. I integrate xwt differently in my app, i.e. I am spawning the IO loop into a promise (i.e. wasm_bindgen_future async loop spawn). lightyear integrates via bevy's async IO pool, which seems wrong to me. Maybe this is the reason why it hangs actually. I'll soon be working on the guidance on how to better integrate xwt - once I am finished with the research. Stay tuned.

I'm guessing your project suffers the same issue, regardless?

The bevy IoTaskPool seems uncontroversial to me.

But... (and I'm very unsure of how Lightyear works) if all that is needed is to simply run an async future, we could easily offload the tasking to a webworker, which is good news.

Nul-led commented 6 months ago

There is still a bit of latency generated by sending the data over a webworker. Working with them in rust is a pain too, so maybe making a js prototype to confirm this is a good idea first?

MOZGIII commented 6 months ago

I'm guessing your project suffers the same issue, regardless?

Actually, I haven't noticed that - but it doesn't mean it's not happening. The web deployment is borked atm, so I can't test properly.

The bevy IoTaskPool seems uncontroversial to me.

Not sure, but bevy IO task pool might be tied to frame generation (i.e. game loop ticking), which might be tied to the request animation frame, which, on a high level reasoning, is justified to be frozen when you switch tabs...

There is still a bit of latency generated by sending the data over a webworker.

Interesting, there is definitely a ton of overhead. Recv would look like this: the native sending datagram to a queue, then sending it to wasm at webworker, then wasm copying it into its own memory - then decoding, then, as we need to send it to main game, copying it out for postMessage to native, then actually sending the data via postMessage cross contexts, then copying it in the from game main's native to game main's wasm - and then finally handling it.

A lot of steps, a lot of copying, but all can happen lightning fast on the modern system - well, that is within 10 ms if I had to estimate. But it might actually be permissively costly for a low-latency game. It could be 100 times quicker if there were just copies and so serialization / deserialization steps - but there are...

simbleau commented 5 months ago

It seems this must be a bug in lightyear, not Chrome. As w3c pointed out, the behavior doesn't exist with other apps.

Here's a test app, which runs fine in Chrome and Firefox: https://webrtc.internaut.com/wt/

Perhaps it's the async polling, like @MOZGIII suggested?

Nul-led commented 5 months ago

@simbleau seems likely in this case, yes. rAF does not run when the tab is "unloaded" thus stopping packet polling via the iotaskpool. a solution based on promises / web- or serviceworkers would be idea if this is confirmed.

Nul-led commented 5 months ago

Would actually be really easy to confirm. If this behavior happens with WebSocket Transport too, then we have our culprit i think :P

simbleau commented 5 months ago

I haven't actually tried anything more than the examples for lightyear. I'm waiting on #253 to really dive into using WT. Hopefully someone can confirm who has experience with Lightyear.

MOZGIII commented 5 months ago

Would actually be really easy to confirm. If this behavior happens with WebSocket Transport too, then we have our culprit i think :P

I now have examples for xwt itself - so another way would be to run those and check if they also demonstrate the same behaviour.

Nul-led commented 5 months ago

@MOZGIII it works

Nul-led commented 5 months ago

@cBournhonesque so RAF is indeed the culprit...

simbleau commented 5 months ago

@cBournhonesque so RAF is indeed the culprit...

What is RAF?

Nul-led commented 5 months ago

@cBournhonesque so RAF is indeed the culprit...

What is RAF?

@simbleau requestAnimationFrame aka the browsers frame scheduler

simbleau commented 5 months ago

Do we have a hypothetical solution or just have identified the problem?

cBournhonesque commented 5 months ago

I tried using wasm_bindgen_futures instead of BevyIoTaskPool (https://github.com/cBournhonesque/lightyear/pull/352) since the xwt example seems to be doing that: https://github.com/MOZGIII/xwt/blob/master/examples/microapp/client/src/main.rs#L14 but I still get disconnected on tab changes..

I don't really get the RAF part, but it might because bevy still stops running when we switch tabs, which means that we stop sending/receiving keepalive packets because the netcode logic runs inside bevy. When we come back to the tab, the bevy system with netcode runs again, sees that the last packet received was >10sec ago and triggers a timeout.

Potential solutions: 1) play audio (needs to be tried, still) to force bevy systems to still run? 2) put more of the netcode logic outside of the bevy systems and inside the wasm_bindgen_futures::spawn_local task. For example we would keep sending keepalives, we would keep receiving packets (that stay buffered in an unbounded channel). When the bevy task restarts, it reads all the messages that have been buffered in the channel

2) doesn't even seem to work because then the client would have to process 1000s of frames' worth of updates when we open the tab again.

simbleau commented 5 months ago

the bevy system with netcode runs again, sees that the last packet received was >10sec ago and triggers a timeout.

... So this is a software timeout? I feel like we've asked that before and the answer was less clear than it is now. That's exactly why we filed the issue under W3C/WebTransport, since it was believed the behavior was from the browser's WebTransport runtime.

This feels really silly now.

Could we just disable the timeout in the bevy system? At the very least it seems reasonable for it to be configurable.

cBournhonesque commented 5 months ago

It is already configurable: https://github.com/cBournhonesque/lightyear/blob/30fe00a204bdff1f96cb9c04db9a664359a515b6/lightyear/src/client/config.rs#L25-L25 (for when the client generates the ConnectToken in Authentication::Manual, it's a bit confusing..) and https://github.com/cBournhonesque/lightyear/blob/30fe00a204bdff1f96cb9c04db9a664359a515b6/lightyear/src/server/config.rs#L18

It's just that having a very high value (20+ seconds) doesn't seem ideal. If a client disconnects suddenly (closes the tab), you would have to wait 20 seconds before the server is aware of the disconnection.

I also created an issue on bevy to potentially make the scheduler keep running bevy systems even if the tab is in the background: https://github.com/bevyengine/bevy/issues/13368

MOZGIII commented 5 months ago

Another possibility is bevy might be doing something to actively put itself (its wasm instance) on hold on tab switches.

MOZGIII commented 5 months ago

Can anyone make a simple / minimal guide on how to reproduce this issue?

cBournhonesque commented 5 months ago

I'm trying to make a simple example (without networking): https://github.com/bevyengine/bevy/pull/13370 (intructions to run wasm examples are in the readme) You can look at the counter log to see if the systems were running when the tab was in the background or not.

Nul-led commented 5 months ago

@MOZGIII rAF just holds indefinitely while the tab in inactive. Thats known behavior. So thats determined to be the issue.

MOZGIII commented 5 months ago

Well, yes, for RAF that's expected. But why does it still break when the code is run using wasm_bindgen_futures?

Was there a miscommunication or confusion here of some sort?

MOZGIII commented 5 months ago

Ah, I read the issue. I am not sure you'd want that - to run bevy systems in the background... Might be better to extract the systems that need to run while in background into their own threads (or Promises, but not bevy tasks). That's what my architectural approach to this would be, at least.

Anyhow, if you need to run bevy systems specifically it could be solved by using/compositing multiple schedulers - in a way that you run some systems on RAF and some with fixed intervals. That would also make bevy tasks function. This could be something that's offered by bevy out of the box - but I'd recommend first experimenting with this locally, as whatever bevy upstream implements might still be suboptimal for lightyear's use case...

Nul-led commented 5 months ago

@MOZGIII i generally agree with this sentiment. currently waiting for a reply on https://discord.com/channels/691052431525675048/750833140746158140/1240028202739437588

cBournhonesque commented 5 months ago

Sorry I'm a bit slow... is this a good summary?

Potential solutions: A) add audio as a quick way to get unblocked. Bevy systems will still run unthrottled.

B) set the netcode timeout to a very long time as a quick way to get unblocked. The io tasks shouldn't timeout anymore since they still run in the background when spawned via wasm_bindgen_futures, if I understand correctly? or we can put the io tasks in a WebWorker if they are still throttled.

The issue is that the bevy systems will still be throttled on the client so:

the server would keep sending updates for every entity all the time, since we keep sending updates until we receive an ack.
when the main thread is unthrottled, the client would have to process all the received packets at once. (+ the buffer would overflow so some packets would be lost, etc.) Basically what @MOZGIII said here

C)

keep bevy systems running in the main thread, which is throttled
spawn the io-related tasks in a WebWorker (which would run in the background in an unthrottled manner even if the user switches tabs). The task would:
- send packets to the server. In practice there would be no packets from the game, since the game is throttled/paused. So instead we can just try to keep sending keep-alives. This is the hard part.
- receive packets from the io (webtransport) and store them in a buffer (bounded-channel).

Same issues as in B).

D) Handle disconnection/reconnection in your game.

disconnect the client but without despawning its controlled entities
when the client joins the tab again (can be detected via winit events), reconnect the client. We replicate the entire world to the client again. The game needs to detect that the new connection corresponds to the same client (maybe by keeping track of the ClientId in the ConnectToken.

It's already possible to disconnect/reconnect; so I guess this would be the best solution?

E) have some other way to force bevy systems to still run in an unthrottled manner. Relevant issue: https://github.com/bevyengine/bevy/issues/13368 Looks it would probably be by putting the entire bevy app inside a webworker?

MOZGIII commented 5 months ago

I am thinking currently that having a separate, non-bevy world and ECS for game logic that it network-replicated is a good idea. It is definitely an option to add to the list above, because that thing can in theory run in a WebWorker and handle not only the packet buffering, but full processing of them.

The issue with this is that WebWorker to window communications can be permissively slow in terms of latency - in the 10s of milliseconds just to send a message. This is not great for any game - might be ok for some, but even there users could notice easily that the game is not very responsive. For other games that would be a hard blocker, I mean waay worse than freezes on tab switches.

So, for this crate, I'd suggest either building a portable core that can be used in any way - depending on the app needs, or supporting either one of in-WebWorker or in-window ways of running the networking, or explicitly both. What I mean is this is likely an important decision to select the target setup and optimize with that in mind.

cBournhonesque / lightyear

WASM examples broken if the user switches tabs #144