bevyengine / bevy

A refreshingly simple data-driven game engine built in Rust
https://bevyengine.org
Apache License 2.0
36.3k stars 3.58k forks source link

WebAssembly multithreading tracking issue #4078

Open colepoirier opened 2 years ago

colepoirier commented 2 years ago

UPDATE: This is on hold while TaskPool and Scope are reworked

Motivation

Currently, Bevy can only run single threaded on WebAssembly. Bevy's architecture was carefully designed to enable maximal parallelism so that it can utilize all cores available on a system. As of about six months ago the stable versions of all browsers have released the web platform features needed to accomplish this (SharedArrayBuffer and related CORS security functionality). I think now is a good time to attempt to make Bevy run natively in the browser like it does on the desktop: fully multithreaded.

There are three distinct tasks that will enable the accomplishment of this goal:

  1. Create a modified versions of task_pool::{Scope, TaskPool, TaskPoolBuilder} that run on wasm called wasm_task_pool::{Scope, TaskPool, TaskPoolBuilder}, and use those instead of the single_threaded_task_pool::{Scope, TaskPool, TaskPoolBuilder} (TODO: create issue and link here)
  2. Modify bevy_audio so that it runs mutlithreaded in the background (not on the main thread) on wasm (TODO: create issue and link here) a. Contribute the functionality needed to to do background multithreaded audio using WebAssembly via the web platform's AudioWorklet API to the upstream dependencies of bevy_audio (TODO: create issue and link here): i. cpal (TODO: create issue and link here) ii. rodio (TODO: create issue and link here) iii. NOTE: as outlined below in the "Insights provided by developers who have tried to make things that run multithreaded on wasm" section, it may be necessary to PR wasm-bindgen to solve this issue https://github.com/rustwasm/wasm-bindgen/issues/2367 b. Make necessary changes to bevy_audio to make use of this added upstream functionality (TODO: create issue and link here)
  3. Modify bevy_ecs, bevy_render, and the rest of bevy to run multithreaded on wasm (TODO: create issue and link here) a. Insights from @alice-i-cecile below. "For the ECS, there are two relevant bits:" i. Our multi-threaded executor ii. Parallel iteration over queries

NOTE: as outlined below in the "Insights provided by developers who have tried to make things that run multithreaded on wasm" section, if we need to do something where we cannot use wasm-bindgen we will need to manually set the stack pointer in our code because this is one of the things wasm-bindgen does. @kettle11 has put this functionality into a tiny crate: https://github.com/kettle11/wasm_set_stack_pointer

Background on SharedArrayBuffer

There is a good reason Bevy, and many of the existing projects that run on wasm only run single-threaded. Shortly after the time of the initial introduction of the SharedArrayBuffer web API - which would allow true unix-like pthread-style mutlithreading using wasm in the browser - the Spectre exploit was discovered.

Due to SharedArrayBuffer being a wrapper around shared memory, it was a particularly large vector for Spectre-style exploitation. In order to maintain their strong sandboxing security model, browsers decided to disable the feature while a proper solution was developed. Unfortunately, this eliminated the necessary functionality to allow true multithreading on wasm. What existed in the interim was a much slower emulation of threads using WebWorker message passing.

Thankfully, as of about six months ago all browsers have re-enabled a redesigned and secure version of SharedArrayBuffer. According to this article on the chrome development blog "Using WebAssembly threads from C, C++ and Rust" https://web.dev/webassembly-threads/, true pthread-style multithreading is now possible on wasm in all browsers, with the small corollary that users may need to write a small specialized javascript stub to get it working exactly in the manner they need. Given that it has been stable for this long, and that some chrome developers have even published a github repository with an implementation for this for rayon using wasm-bindgen, I think now is a good time to investigate how to make Bevy run natively in the browser like it does on the desktop, and try implementing this to see if it will actually work.

Insights provided by developers who have tried to make things that run multithreaded on wasm

@kettle11 provided some good insights into quirks and solutions to multithreaded wasm on discord here on 19 November 2021:

""" In the past I got AudioWorklet based audio working with multithreaded Rust on web. It's certainly possible.

When working with wasm-bindgen it requires some messy code because wasm-bindgen uses the Javascript API TextDecoder which isn't supported on AudioWorklet threads. The way I got around that is by not using wasm-bindgen on the AudioWorklet thread, but that requires a few hacks:

Scanning the Wasm module imports and importing stub functions that do nothing for every Wasm-bindgen import. This is OK because the audio thread can be made to be pretty simple and avoid doing direct wasm-bindgen calls.

Allocating a stack and thread local storage for the worker. wasm-bindgen's entry-point does this normally, but wasm-bindgen's entry point also calls the main function which we don't want for the AudioWorklet thread. So we need to use our own entry point and manually set up the stack / thread local storage.

I opened a wasm-bindgen issue about theTextDecoder thing about a year ago: https://github.com/rustwasm/wasm-bindgen/issues/2367

Also wasm-bindgen solves the "how do we set the stack pointer?" issue by preprocessing the Wasm binary and inserting the stack allocation code, but I found a way to do it without that which I put together into a tiny crate: https://github.com/kettle11/wasm_set_stack_pointer """

Resources

colepoirier commented 2 years ago

Resource: https://github.com/GoogleChromeLabs/wasm-bindgen-rayon

alice-i-cecile commented 2 years ago

For the ECS, there are two relevant bits:

  1. Our multi-threaded executor.
  2. Parallel iteration over queries.
kettle11 commented 2 years ago

Another thing to be aware of is that WebAssembly memory objects that are shared cannot be resized. They must declare an "initial" and "maximum" size.

There's good discussion of some of the quirks that introduces in this thread: https://github.com/WebAssembly/design/issues/1397

colepoirier commented 2 years ago

Another thing to be aware of is that WebAssembly memory objects that are shared cannot be resized. They must declare an "initial" and "maximum" size.

There's good discussion of some of the quirks that introduces in this thread: WebAssembly/design#1397

Thanks for the heads up! Oh boy is that ever a can of worms; I think I’ll defer the wasm memory stuff, keeping this as only a wasm multithreading MVP, and hopefully let someone else come up with a strategy for dealing with that. I will definitely add it to #4279, as this is a pretty big thing to keep in mind and investigate as we work on bevy’s Web UX story.

paul-hansen commented 2 years ago

Since SharedArrayBuffer requires some cors headers, I made a replacement for basic-http-server that allows setting headers. Might find it useful here for testing and running the examples when work on this resumes. https://crates.io/crates/http-serve-folder

Example with the headers needed for SharedArrayBuffer:

cargo install http-serve-folder
http-serve-folder --header "Cross-Origin-Opener-Policy: same-origin" --header "Cross-Origin-Embedder-Policy: require-corp" wasm/
colepoirier commented 2 years ago

This is really nice, thanks for sharing it here!

smessmer commented 1 year ago

What's the state on this? Has there been progress since it was put on hold half a year ago?

hymm commented 1 year ago

No real progress. No one is too motivated to do anything, since the memory model for shared array buffer is going to make it very hard to work with the ecs.

TotalKrill commented 1 year ago

Reading through the discussion from unity devs, linked above it seems that the issue is mainly a blocker for mobile devices.

Also hopeful is that the issue had some progress only 2 weeks ago, with from my limited understanding seems to be some kind of wasm equivalent of 'free()'. This means it should be possible to resize threads?

From how I am reading it, there really isn't much blocking multithreading wasm for desktop?

kettle11 commented 1 year ago

I spent the last few hours and wrote some very sloppy code that shows a few key areas Bevy needs to change to get Wasm multithreading support: https://github.com/kettle11/bevy/commit/c8c2eb51a872f18ede40bbef7055d9b45b29acb6

The code spawns web workers instead of threads and appears to almost work. Sometimes it will run for a few moments with multiple threads and successfully not crash! These issues prevent it from fully working:

I'm sure there are other issues once those two are resolved. That said, in my opinion there's no hard technical barrier preventing Bevy from being multi-threaded on web.


Snippets of the above code were taken from this blog-post: https://www.tweag.io/blog/2022-11-24-wasm-threads-and-messages/

allsey87 commented 1 year ago

The second issue that you found could be worked around by running Bevy on a web worker itself, using OffscreenCanvas for the rendering and using postMessage to forward keyboard and mouse events from the main event loop.

DanMcgraw commented 1 year ago

The second issue that you found could be worked around by running Bevy on a web worker itself, using OffscreenCanvas for the rendering and using postMessage to forward keyboard and mouse events from the main event loop.

Would adding that overhead to input be noticeable?

allsey87 commented 1 year ago

There is a good write up here about the performance of postMessage: https://surma.dev/things/is-postmessage-slow/

It depends on the size of the payload, but it seems anything up to 10kb would take less than a millisecond.

TotalKrill commented 1 year ago

I assume that less than a millisecond is acceptable for latency of input, but yeah its another addition to latency. Latency in input today is horrible comparet to the days before USB

So percentage and usagewise I doubt it will be very noticeable at all :)

TotalKrill commented 1 year ago

I tried to put the entire bevy app in a web worker, to try and solve the issue with async-executor. But then I ran into issues due to what I guess are Winit trying to things to document that isnt allowed from within a web-worker.

So maybe trying to put async-executor into a web worker, could be a feasible next solution.

also since there are some server headers required for web worker functionality, as in @kettle11 run script. And i could not get his devserver to work ( some kind of dependecy issue ) I also rolled my own bloated devserv based on rocket, can be found here: https://github.com/TotalKrill/devserv

TotalKrill commented 1 year ago

Rebased @kettle11 work to see if anything had changed with all the changes in bevy main. It doesnt crash, but it doesnt do anything either after initializing all the workers either. Could not see anything in the logs either...

Heres the rebased branch if anyone is curious: https://github.com/TotalKrill/bevy/tree/wasm_multithread

cormac-ainc commented 1 year ago

I agree with https://github.com/bevyengine/bevy/issues/4078#issuecomment-1472634600. I don't know whether resizing shared memory was truly a problem in 2021, but it certainly isn't now. Shared memory has an initial size, a maximum size, and is growable up to the maximum. There are zero functional limitations compared to non-shared memory, and the API is almost identical except that shared memory must have a maximum size set. Any single-threaded Bevy/wasm app that currently exists already has a maximum size. Wasm-bindgen sets one unconditionally, with a default maximum of 1GiB and an absolute max of 4GiB for wasm32 of course. (I detail how to change the maximum below.) The only practical difference is when (not if) you see an error when you try to set the maximum very high in a 32-bit browser. More details below if you want to learn more.

The only thing missing from the story in 2021 would have been browser support. Safari implemented shared memory in late 2021, and as of that moment, all major browsers do (with the HTTP headers of course).

The things to be concerned about remain spinning up workers at all, avoiding locking the main thread, the other things listed in wasm-bindgen's section on the caveats, and any other DOM objects or Web APIs that can't be used from a web worker (like HtmlCanvasElement directly instead of via OffscreenCanvas). Regarding canvases specifically, winit 0.29 beta supports the main thread + web workers scenario (https://github.com/rust-windowing/winit/pull/2778 and https://github.com/rust-windowing/winit/pull/2834) and Bevy will be able to take advantage of that if it can ship an OffscreenCanvas to a renderer web worker.

In summary, Bevy multithreading on wasm is probably much closer than it has seemed.

I have attempted to clear up any lingering confusion in some detail below. Click to expand. 1. **Shared Memory instances can be resized.** You can call grow() on them like normal, and your favourite wasm-supporting allocator will do just that. If you initialise a Memory instance as shared with an initial and maximum # of pages, then browsers just mmap a bunch of contiguous address space, and only commit the initial size. This will be the case on any OS/browser combo that is capable of playing a game. 2. **Growing a shared Memory [works on all major browsers](https://caniuse.com/mdn-javascript_builtins_webassembly_memory_memory_shared)** since Safari caught up and implemented shared memory in 2021. This is despite SharedArrayBuffer.grow not being available in some browsers (Firefox)! I wouldn't be surprised if people used WebAssembly.Memory as a polyfill for SharedArrayBuffer that supports grow(). 3. **There is no shrink function on either shared or non-shared Memory instances.** There is no need to wait for the ability to shrink the Memory or otherwise free up resources in memory-constrained environments. Nearly every single wasm module that has ever run in a browser, including all Bevy/wasm apps that currently exist, has this problem. 4. The **only** difference with shared Memory is the constructor needs a maximum count. The default is 1GiB for wasm-bindgen projects that emit `new WebAssembly.Memory(...)` in the glue code. Rustc can set one via LLD's `wasm-ld` linker (the default) and `-C link-arg=--max-memory=XXX` where XXX is in bytes must be a a multiple of page size. wasm-bindgen has env var `WASM_BINDGEN_THREADS_MAX_MEMORY` for the same purpose but won't override rustc. ```sh echo 'fn main() {}' | rustc --target wasm32-unknown-unknown -O - \ -C link-arg=--max-memory=65536000 \ -C target-feature=+atomics,+bulk-memory,+mutable-globals # enable threading wasm-dis ./rust_out.wasm | grep -F '(memory' ``` 6. Here's an example of creating a shared Memory with a **maximum of 4GiB (aka the maximum addressable space on wasm32)** and growing it. Copy/paste the script from [this gist](https://gist.github.com/cormac-ainc/f09a852bdb6d63a74e22642047bfbdcc) into your dev tools console. **It works great in 64-bit browsers, because the maximum is just the amount of virtual address space to map, and virtual address space is very cheap on 64-bit systems.** It does not "use" that much memory until you grow() to full size and touch all the pages. Even on Windows, where committing is more eager, simply creating a Memory will not commit a full 4GiB of pages to that space unless you set initial = 4GiB as well. Firefox on Linux/x86_64 shows 224KiB of memory used by the underlying SharedArrayBuffer after touching those few pages. Chrome manages it in 66KiB. Note that a normal 64-bit Chrome renderer process maps about 1TB (yes, 1TB) of virtual address space, and a normal Chrome instance has a dozen renderer processes. This is very normal on 64-bit systems. 7. Mapping large chunks of contiguous virtual address space will fail a lot in 32-bit browsers. 4GiB mappings will always fail. **It's up to developers to decide whether they need their apps to run in 32-bit browsers.** I think it's perfectly legitimate to make a choice to avail yourself of multiple GiB, even the full 4GiB, and ditch 32-bit browsers. Supporting only 64-bit systems is a very normal limitation for games. Mobile game companies are not trampled on by Bevy allowing them to choose shared memory and prohibitively large memory mappings. Currently rustc+wasm_bindgen let you choose whether to use shared memory, and how big to make the maximum size (= the mapping size with shmem). Everything is opt-in. 8. The problems in https://github.com/WebAssembly/design/issues/1397 do not really apply here. In my view they are specific to 32-bit systems like (usually older builds of) Chrome on Android, and the prospect of allocations failing due to lack of address space to map. To be clear, they are valid concerns, but they are not really relevant unless you are targeting 32-bit browsers, and, contrary to what you might have gleaned from this issue, **they are less of a problem if you use shared memory**. Let me recontextualise some of the points: - > in the wild we have reports that memory allocation success rate can be better when initially allocate K MB, versus if you first allocate less, and later try to grow [...] > > since an application will need to account for the largest memory usage it may need (or it will fail at some point of its lifetime), practically initial == maximum memory. The author, presumably using non-shared memory, wanted to force browsers to map all the address space at once in order to reduce failures in Memory.grow() due to 32-bit virtual address space fragmentation. That fragmentation probably came from the previous, smaller, non-shared ArrayBuffer(s) still being mapped. But SharedArrayBuffer and shared Memory _always_ map the maximum size upfront, only committing initial. So there are never any previous mappings to fragment the space! You are free to set initial as low as you like, as only the maximum has any impact on whether mapping the address space succeeds. So shared Memory is, if anything, more resizable than non-shared Memory once you've reserved the address space. - > one cannot set a gratuitous upper bound, since that can fail the allocation You can. I did it in the gist above. This is a concern that's specific to 32-bit systems, which is indeed more pronounced when shared memory tries to map the maximum upfront. Not everyone cares about that, as I have said. - > if shared memory is used, one does need to know an upper bound for the maximum memory usage. Not if you set the upper bound gratuitously. 9. For people who are concerned about resource-constrained environments, whether they are 32- or 64-bit, there are proposals floating about for a `memory.discard()` to release pages back to the OS until they're next touched. I.e. the opposite of committing them, so this reduces actual memory usage. The SpiderMonkey design would enable discarding a single 64KiB page at a time, for use in an allocator when such a 64KiB page is found to be completely free. (Behind a flag in FF Nightly: https://github.com/WebAssembly/memory-control/issues/6). Those pages can be reused later without any API calls, as long as the allocator doesn't think they're gone forever. Would require updating rust-dlmalloc, but that would result in solving a big issue with every rust wasm project out there. In all, shared memory is fine. I don't see any cans of worms. There is no need to do anything special in Bevy to handle resource limits, especially not within the ECS. Any wasm project should choose an appropriate amount of virtual address space to map, let the global allocator panic on OOM, and call it a day. I barely even think it's Bevy's job to tell people not to attempt to map 1GiB+ of memory if they want their apps to run on 32-bit systems, but you can list a caveat for that with some notes about which browsers that's likely to be. Frankly shared memory is a better choice because it lets you fail faster while choosing a maximum size.
Elabajaba commented 1 year ago

Wgpu/bevy's renderer isn't threadsafe on wasm (wgpu just used to lie about it before wgpu 0.17 because wasm threading wasn't really a thing and it's threadsafe on native, just not on wasm).

If you want to test if your multithreading actually works with the renderer on wasm you need to remove the "fragile-send-sync-non-atomic-wasm" feature here https://github.com/bevyengine/bevy/blob/de8a6007b7df5bd961511cd321344157fb4b531f/crates/bevy_render/Cargo.toml#L63 fix all the errors (without breaking threading on native backends), then see if it works.

It sounds like it might be possible to run the renderer in a web worker and have it work (as long as you pin it to that webworker and don't reference it's resources from other threads)?

edit: Tracking issue for renderer https://github.com/bevyengine/bevy/issues/9304

kettle11 commented 7 months ago

With https://github.com/bevyengine/bevy/pull/12205 a first step towards getting Bevy multithreaded on web has been merged. Now it is possible to build a Bevy project with multithreading enabled, even if Bevy internals are not yet multithreaded.

A short guide for how to try it out yourself:

  1. Install and use a nightly version of Rust
  2. Use / adapt the following script to build your Wasm project:
set -e
RUSTFLAGS='-C target-feature=+atomics,+bulk-memory' \
     cargo build --example breakout --target wasm32-unknown-unknown -Z build-std=std,panic_abort --release
 wasm-bindgen --out-name wasm_example \
   --out-dir examples/wasm/target \
   --target web target/wasm32-unknown-unknown/release/examples/breakout.wasm
RUSTFLAGS explanation Rust's default Wasm target `wasm32-unknown-unknown does not support multithreaded primitives out of the box. To enable them the standard library needs to be rebuilt with the `atomics` flag enabled. Only nightly Rust supports building the standard library.
  1. Use a server to host your project with the correct header flags set. Note: Using devserver specifically is not required. Any server that can set these CORS headers can be used.
 devserver --header Cross-Origin-Opener-Policy='same-origin' --header Cross-Origin-Embedder-Policy='require-corp' --path examples/wasm
CORS Explanation: When Rust is compiled for Wasm with `atomics` enabled it uses a Javascript type `SharedArrayBuffer` as its backing memory. To mitigate security vulnerabilities browsers only allow using `SharedArrayBuffer` when certain HTTP headers are set *server-side*. Those headers are known as "Cross-origin Resource Sharing" or CORS. Most HTTP libraries (like `devserver` used above) have ways to set `CORS` flags. Learn more here: [https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer#security_requirements](link)

Now to run some work on another thread you can use a crate like wasm_thread.

use wasm_thread as thread;

thread::spawn(|| {
    for i in 1..3 {
        log::info!("hi number {} from the spawned thread {:?}!", i, thread::current().id());
        thread::sleep(Duration::from_millis(1));
    }
});

Important: The browser forbids blocking on the main thread, so take care to never call any code on the main thread that will block / wait on another thread. If you absolutely need a workaround you can busy-loop instead, as Rust's memory allocator itself does.

You can also use crates like rayon to automatically parallelize iterators. For that look to the wasm-bindgen-rayon crate and then use rayon like normal.

james7132 commented 7 months ago

Note that the blocker on getting async-executor to properly initialize on multithreaded wasm should be resolved with https://github.com/smol-rs/async-executor/pull/108.

koteelok commented 5 months ago

Any progress on that?