AmbientRun / Ambient

The multiplayer game engine
https://ambient.run
Apache License 2.0
3.79k stars 122 forks source link

Wasm web runtime performance #1131

Closed ten3roberts closed 10 months ago

ten3roberts commented 11 months ago

wasm-bridge currently employs JCO to facilitate conversion of a WebAssembly component into a module and JavaScript bindings to call the WebAssembly interface, much like wasm-bindgen.

This allows calling a wasm function using a javascript Object such as { width: 10, height: 16.4, pos: { x: f32, y: f32 } }. This is used for easily calling a wasm module with a wit interface easily from javascript, and facilitates an easy adoption and integration of webassembly in javascript codebases.

Motivation

At the guest/host boundary, the rust arguments are converted to javascript Objects which are passed through the JCO bindings. They are then destructured and converted into the appropriate calling convention for use with the C-ABI wasm function.

This approach makes heavy use of dynamic object and list accesses and construction, which requires many javascript roundtrips per compound type and allocations, as well as increased memory bandwidth in the overhead of javascript objects and data duplication and deep cloning. This is the reason that component-query-eval and garbage collection in hot loops show up in so much in the profiling along with Vec contructions, as each transfer of a query's results require hundreds of Js roundtrips and calls to Reflect.set (and later Array.set) per item, and then equivalent linear complexity when destructured by JCO into pointers and traversed back and allocated once again into linear memory for the destination wasm.

Proposal

I propose to implement a custom transfer method which will allow almost direct transfer of data through the host <=> guest boundary. This will increase the throughput of the boundary and will most notably decrease the use of string decoding and encoding, as well as Array.set methods and allocations as things can be transferred as memory directly rather than going through an idiomatic Js representation. This will improve performance of our user side scripts and give a much bigger headroom for processing a frame in time and improve the frame rate.

Implementation

WebAssembly functions can only work with stack based primitives, such as integers and float. As such, transferring and calling functions with composite types such as structs, enum, vecs or strings is not possible directly.

wit-bindgen is used to transform these rich guest code functions into primitive C-ABI functions which then calls the user side rich version.

Structs and other composite types are converted into separate arguments in the function itself, so a function which takes a single struct as an argument will thus need to take n arguments corresponding to each field of the struct. The original struct will then be reassembled before your implementation is called.

Vectors, Strings, and other heap objects will be converted into an allocation in the local wasm linear memory, and two arguments, a pointer and a length, passed like structs like two separate arguments. wit-bindgen will then reconstruct your vector or string using the pointer and length. The pointer points to newly allocated owned memory within the wasm memory itself.

Noteworthy is that pointers can not be shared between wasm modules or the host due to the sandboxing. They will first have to be allocated and copied into the targeted module as an owned memory buffer as mentioned above.

Vectors of structs or vectors of vectors work similarly and the same rules apply recursively.

For this to work, the host implementation is thus responsible for being the all caring parent and nicely allocating memory inside the guest wasm and copying the relevant data into it, and then giving the new pointer to the function to call. Essentially setting the table for the guest.

This is what wasmtime does natively. On Javascript, there is absolutely no change on the guest side. To facilitate the host side of bargain wasm-bridge employs the help of JCO which takes care of allocating memory and filling the arguments, as well as the memory buffers using primitives thorugh the various Uint8Buffer, UInt32Buffer.

For example a Vec<Rectangle> with Rectangle { width: u32, height: u32 } will be transferred by allocating a heap buffer of suitable size, and then pushing two u32 for each rectangle. The pointer and the length of the buffer will be passed over the FFI boundary.

This is the steps which JCO provides by creating a javascript polyfill so that you can interact with the wasm component using javascript idiomatic objects and arrays. This is intended for using and calling WebAssembly conveniently from javascript as a user, say if you are writing a javascript website using react and need some high performance webassembly module for e.g an interactive canvas.

This human friendly format is not needed if it is used by a crate in Rust. It only creates more steps and work as Rust types from the host need to be converted into javascript objects through what is essentially json serialization but to an object instead of a string. These Objects and Arrays are then destructured by JCO to fill the byte buffers.

For example, a list of aforementioned rectangles will go through these steps

  1. Rust: [u32, u32, u32, u32...] (Vec<Rectangle> memory layout)
  2. Js: [ { width: Integer, height: Integer }, { width: Integer, height: Integer }, ...]Constructed incrementally using dynamic lookup and sets
  3. Js: ArrayBuffer [ u32, u32, u32, u32 ] This is done by iterating each element in the array and destructuring each and every struct into its fields, and casting them to integers. This code is generated from the wit interface as well.
  4. Rust(guest): Vec::from_raw_parts(pointer, len) => [ u32, u32, u32, u32 ] This step is done entirely in the guest code using wit-bindgen when the module is compiled. This is agnostic to the runtime, as the host above makes sure to fill a byte buffer using the expected memory layout and endianness.

We managed to improve performance multifold (about 5x when measuring tangent) by changing the javascript step from [ { k:v, k:v, ... }, { ... }, ... ] to [ [v, v,...], [...], [...] ] which drastically reduced reflection and dynamic string lookups as we omitted the fields.

However, this required us to fork JCO to use this format instead, which breaks the public API as users can no longer interact with it using javascript objects with named fields through E.g node or js. As such, it can not be upstreamed. When I did this I knew very well that it was a hack as we've essentially integrated an implementation detail and field ordering in JCO itself.

The new solution will be to omit the javascript polyfill entirely and instead write the ArrayBuffer from Rust directly using the appropriate javascript APIs. We will not be able to completely get around javascript, as we need to call a javascript function to write to an array buffer in much the same way that we bind to javascript to get the current time or date, or use webgpu. There is unfortunately no way around that, as Js is effectively our system interface. However, this is vastly more efficient than the current solution and does not require additional allocation or encoding. Think of it like a system or kernel call.

All the required APIs exist to make this possible exist as of today, and are in fact already employed by wgpu as it uses an ArrayBuffer to write POD structs as raw bytes from our Rust renderer to the javascript webgpu buffer and then ultimately to the gpu.

See https://rustwasm.github.io/wasm-bindgen/examples/hello-world.html for a brief introduction of why wasm-bindgen is needed.

Workflows

In order to provide proper workflows, we will need to construct a benchmark to properly test and ensure along the way that our new solution works and gives the expected performance improvements. This will be achieved by a normal example with a simple frametime UI.

Implementing this in wasm-bridge is not an issue, since we are already collaborators with the author and have permission to modify the crate and maintain it.

DouglasDwyer commented 11 months ago

Hey,

I had a number of problems with the wasm-bridge API (and its reliance on JCO) so I took it upon myself to create my own implementation of WASM components. The crate is called wasm_component_layer. It also tries to mirror the wasmtime API, but it is runtime agnostic - so you can use whatever WASM executor that you'd like with it. Given the problems you describe here, maybe wasm_component_layer would be a better fit for this project than wasm-bridge? It's totally up to you, I just figure it might be helpful to know about another option :)

ten3roberts commented 11 months ago

This is very interesting, as I just finished having a discussion with a colleague about using your crate instead.

What a coincidence 😄

Does wasm_component_layer already support javascript backend (web) through wasm_runtime_layer?

It doesn't look like it yet, but I'll be more than happy to provide that backend, which we can make use of, along with a wasi implementation which works for the web

DouglasDwyer commented 11 months ago

Haha that's a crazy coincidence for sure. The wasm_component_layer crate still is missing a few things, but in my opinion building runtime-agnostic component model is the way to go. Much less work to maintain, long-term.

To answer your questions:

Does wasm_component_layer already support javascript backend

No, not yet. It does support wasmi, which is an interpreter and can run on the web. But if you want performance, then it makes sense to implement a Javascript backend. That should not be problematic.

along with a wasi implementation which works for the web

wasm_component_layer does not innately have support for WASI, but I would hope that converting an existing implementation would not be too difficult.

The one other drawback that I want to highlight is that my crate handles lifting/lowering very differently than wasmtime and wasm-bridge. I do lifting and lowering using enums at runtime rather than compile-time traits. This makes it easier to define types at runtime, but will mean API differences if you port to this crate.

DouglasDwyer commented 11 months ago

Oh! One other thing, as it says in the docs for my crate, I do not yet have a macro for wit_bindgen. So if you are using that heavily, it will require some additional work to either implement the macro (or remove usage of it from your codebase). Hopefully, this gives you a better idea of the benefits/drawbacks of switching. Let me know if you have any other questions that I can answer :)

ten3roberts commented 11 months ago

Ok, that sounds amazing.

As for the lifting and lowering that I saw at a glance, I think it is doable.

I ran into a corner today when trying to parse and load a component manually without jco inside wasm-bridge, and the code was already much too coupled to jco and doing wasm-wasm directly.

The primary performance bottleneck is as said the javascript object creation, which easily takes 80% of the frames time, the enum approach will incurr some overhead compared to compile time resolution, but it will be far from the overhead of dynamic js objects. If it show to be an issue a hybrid solution can be made in the future.

I'll go ahead and start working on a web backend, and then open up a PR 😊

The structure and "driver-like" traits of your crate seems very good and organized and easy to work with.

@philpax do you want to provide any more input on this?

philpax commented 11 months ago

Nothing to add from me - just very glad to see we have a way forward for this :)

ten3roberts commented 10 months ago

When switching to wasm_component_layer and wasm_runtime_layer we will get the wasmtime/web abstraction for free, as well as the very gnarly component parsing, which was previously done by either wasmtime_component or jco.

However, as there is no such thing as a free lunch, there are additional things we will need to do.

In addition to this, there are further smaller improvements we can make in the new implementation to squeeze more performance:

Though before any of these are started, we need to do the above and measure to validate that they have improved performance from the wasm_component_layer baseline.

Tagging you @philpax just to make sure you see this :)

ten3roberts commented 10 months ago

As you all know, this has been quite the journey.

Mithun said in the beginning of this that implementing this would likely require us to implement some sort of WebAssembly compiler and make a subset of wasmtime, and he was correct.

We expected this to be large, as Nuno rightfully put the X-large label on it, and we were correct.

We initially started by implementing and moving away from JCO in wasm-bridge. However, it soon turned out to be too complicated to separate out the js glue and assumptions on JCO that were deeply ingrained in the whole codebase from the very bottom. With the help of Mithun I started to think and replan my approach, as attacking this from any angle was nigh impossible.

Thankfully, through my ever watchful eye of keeping track of the rust ecosystem, I had seen a new crate that abstracted over the component model and layered it on top of the two current WebAssembly runtimes in the passing.

After thinking through my approach, and realizing the immense rewrite that removing and re-implementing a JCO equivalent would bring, along with the inability to do it incrementally or even compile it beforehand I came to the conclusion that writing a WebAssembly runtime within the aforementioned abstraction crate would be a much better approach that will also provide better piping and flow of data. An additional bonus with this approach is that it would also solve our other approach which lies in conditionally using either wasmtime or the web depending on the platform. We previously used cfg guards for this, and a lot of traits and unsafe trying to cram it in to the right type. However, this crate already provides an abstraction layer between wasmtime, and wasmi, so by adding the web as a third backend we get the wasmtime abstraction for free.

There was a lot of code and a lot of different specs to read, learn and implement. As previously alluded, we had to parse the modules ourselves to figure out imports, exports, tables, and signatures, and essentially making a subset of wasmtime. Despite all this, and having to learn the binary spec for WebAssembly modules, and then the component model, it is all in a very working order.

I have managed to get a component level communication between the host and guest.

What this means is that it is now possible to load, parse and compile the core WebAssembly modules from the component module binary, and then instantiate them by linking it together with its sibling modules, and all the external host exports.

We can then invoke any function that the guest module exports and pass arguments and receive return types such as primitive integers and floats, as well as vectors and strings.

Passing a string is one of the most involved parts of the runtime, as the C-ABI that wasm modules operate on only allow for integral and floating point arguments, and a single primitive return value.

To pass a string, or other list type we first need to call another guest export called cabi_realloc, to allocate a block of memory in the host, and get an offset pointer to it returned. We then use this offset to write the string's or list's bytes into the guests memory (which importantly, is a different address space from the host's memory.

The previous solution to this was by using JCO which is a javascript library for loading and interfacing with WebAssembly. The library provides a very idiomatic JavaScript interface for invoking guest functions, such as representing arrays as js arrays [ value ], and structs as objects { key: value }. This is incredibly useful if your application mainly executes on its own inside the wasm module and you need a nice way to invoke it from the website javascript, such as Figma.

To use this from Rust we created bindings to the JCO exports and converted all rust types we wanted to pass to their idiomatic and recursive JavaScript equivalents, such as vectors becoming keyed objects with x, y, z. These JavaScript types were then passed to the library and decoded into the guest module. This process is exceedingly inefficient and creates many temporary heavy javascript objects, as well as requiring thousands of calls over the FFI between Rust and JavaScript to construct the argument object for the guest call. This is only worse for a list or structs, requiring setting every field of the struct (and every field of any struct it contains), for every item of the array. This is a huge amount of back and forths and JavaScript Reflect calls, and will create a large amount of garbage that we have seen be cleaned up in the middle of calls. All this happens even before we are able to call the guest function, and then the same process needs to happen again for the return value.

This is especially bad for strings, as they also need to be encoded into UTF-16, which is worse than linear. To put extra insult to the injury the string in javascript then needs to be converted into a UTF-8 byte array to match the WIT ABI and copied into a newly allocated location of memory in the guest.

This is intended behavior for JCO, and is not a performance bug in the library, but rather the way that we use the library, as it is intended to be used from JavaScript itself where construction is cheaper, and you already have the values at hand in your program state, so all you need to do is give a reference to that value, and no GC needs to happen.

This new solution within wasm_runtime_layer solves all of the aforementioned issues, as we now own the wasm module and its functions, and we can pass lists by copying all of the bytes directly of the list for plain-old-data structs no matter their nesting and the length of the list. The same goes for strings, which are now translated to an allocation in the guests memory, and a subsequent memcpy to the allocated buffer. The string is then passed as the pointer to that guest buffer in the function argument or struct field.

The runtime is very close to completion, and what is in essence missing is some todos and unimplemented areas in the code, which will need to be addressed before it is review and merged. Thankfully, the author has been super helpful and very active which has massively aided the development. There will likely be a good few areas and bugs we discover as we fully integrate this into Ambient, and given that the author is responsive we will open another PR to get those resolved.