Open dtolnay opened 5 years ago
This seems like it should be pretty straightforward to implement. You just need some way to RPC with the tool... which could just be passing token streams to STDIN and reading output / errors from STDOUT / STDERR.
It would also be possible to use an entirely different runtime for this, such as wasmtime
, which includes a JIT written in Rust. I'm not sure how much faster / lower-latency this is compared to the watt runtime, would be worth benchmarking. That would be especially worthwhile if this eventually gets added to the rust toolchain, since then users don't need to worry about the release-mode compile time.
Oh also, the tool should have some form of version check built-in.
I might be able to poke at this next weekend.
I am on board with using a JIT runtime for the precompiled one, but we should make sure that it caches the JIT artifacts. In typical usage you might invoke the same macro many times, and we don't want the JIT to need to run on the same code more than once.
wasmtime does indeed have a code cache, fwiw. +cc @sunfishcode
First wanted to say thanks for exploring this space @dtolnay, this is all definitely super useful user-experience for eventual stabilization in rustc/cargo themselves!
On the topic of an optimized runtime, I'd probably discourage making a watt-specific runtime since running WebAssembly at speed in the limit of time can be a very difficult project to keep up with. WebAssembly is evolving (albeit somewhat slowly) and as rustc/LLVM keep up it might be a pain to have another runtime to have to keep up to date and all. Would you be up for having some exploration done to see if wasmtime
could be suitable for this purpose?
The wasmtime
runtime would indeed be maintained going forward and would get all the new features as they come into WebAssembly itself. Additionally it will have its own installation which will involve downloading precompiled binaries, so users don't even have to worry about a long compilation process for an optimized wasm runtime. I'm imagining that the build scripts of the wasm runtime support crates here would detect wasmtime
on the host system (or something like that) and skip all the code currently compile (not even compile the interpreted runtime) and go straight to using that.
On a technical level it should be possible with using wasi APIs to communicate either with stdin/stdout or files. With wasi/wasmtime it's still somewhat early days so we can add features there too as necessary!
I wouldn't mind setting aside some time to investigate all this if this all sounds reasonable to you @dtolnay?
What I would have in mind by a watt-specific runtime isn't a whole new implementation of WebAssembly from scratch, but some existing maintained runtime like wasmtime wrapped with any additional proc macro specific logic we want compiled in release mode. Maybe that additional logic is nothing and we can use a vanilla wasmtime binary -- I just want to make sure we are running as little as possible in our debug-mode shim because the performance difference is extreme.
@alexcrichton what you wrote sounds reasonable to me and I would love if you had time to investigate further. Thanks!
I think an amazing milestone would be when proc macros built for Watt running in Wasmtime are faster than natively compiled proc macros in a typical cargo build
— because the performance boost from release-mode compilation of the wasm is bigger than any slowdown from the execution model. That seems like it should be within reach right?
Question: would it be possible to just bundle platform-specific binaries with Watt? You could make a bunch of watt-runtime-[OS]-(arch)]
packages with binaries in the crate, then add #[cfg]'d dependencies on them in watt-runtime, with a fallback of compiling from scratch. That would make installs pretty much instant for 99% of users, which fixes the main downside of using wasmtime / cranelift (compile time). I don't know if cargo allows baking binaries into crates, though.
I've settled on a strategy where my thinking is that to at least prove this out I'm going to attempt to dlopen libwasmtime.so
which has a C API. That C API would be bound in the watt runtime and watt would dynamically select, at build time, whether it'll link to libwasmtime.so
or whether it'll compile in the fallback interpreter runtime. It'll take me a few days I think to get all the fiddling right.
@dtolnay do you have some benchmarks in mind already to game out? Or are you thinking of "let's just compile some crates with serde things"
A good benchmark to start off would be derive(Deserialize) on some simple struct with 6 fields, using the wasm file published in wa-serde-derive.
@alexcrichton do you know if it would it be possible to bundle platform-specific libwasmtime.so
s with Watt on crates.io?
Ok this ended up actually being a lot easier than I thought it would be! Note that I don't actually have a "fallback mode" back to the interpreter, I just changed the whole crate and figured that if this panned out we could figure out how to have source code that simultaneously supports both later.
The jit code all lives on this branch, but it's a bit of a mess. Requires using RUSTFLAGS
to pass -L
to find libwasmtime_api.so
and also requires using LD_LIBRARY_PATH
to actually load it at runtime.
I compiled with cargo build
(debug mode) using this code:
#![allow(dead_code)]
#[derive(wa_serde_derive::Deserialize)]
struct Foo {
a: f32,
b: String,
c: (String, i32),
d: Option<u32>,
e: u128,
f: f64,
}
#[derive(serde_derive::Deserialize)]
struct Bar {
a: f32,
b: String,
c: (String, i32),
d: Option<u32>,
e: u128,
f: f64,
}
fn main() {
println!("Hello, world!");
}
I also instrumented a local checkout of serde_derive
to just print the duration of a derive(Deserialize)
. The execution time numbers look like:
Runtime | time |
---|---|
serde_derive |
9.36ms |
watt interpreter |
1388.82ms |
watt jit |
748.51ms |
The breakdown of the jit looks like:
Step | time |
---|---|
creation of instance | 706.11ms |
calling exported function | 24.10ms |
creating import map | 5.18ms |
creating the wasm module | 1.55ms |
Next I was curious about the compile time for the entire project. Here I just included one of the impls above and measured the compile time of cargo build
with nothing in cache (a full debug mode build).
Runtime | compile time |
---|---|
serde_derive |
10.78s |
serde + derive feature |
17.83s |
watt interpeter |
9.12s |
watt jit |
8.50s |
and finally, the compile time of the watt
crate (including dependencies in the jit
case that I've added) from scratch, for comparison:
Runtime | compile time |
---|---|
watt interpeter |
2.69s |
watt jit |
1.23s |
Some conclusions:
wasmtime
have screaming fast startup times, but there is work underway to improve this. Additionally I'm almost surely not using the code cache since I didn't actually enable it, I'd need to contact other folks to see how to get that enabled. The code cache would eliminate almost all of the 700ms runtimelibwasmtime_api.so
, which AFAIK is not at all optimized for performance yet.Overall seems promising! Not quite ready for prime time (but then again none of this really is per se), but I think this is a solid path forward.
@kazimuth sorry meant to reply earlier but forgot! I do think we can certainly distribute precompiled libwasmtime.so
crates on crates.io, but one of the downsides for proc macros (and serde) specifically is that cargo vendor
vendors everything for every platform and would download quite a few binaries that wouldn't end up being needed (one for every platform we have a precompiled object for). For that reason I'm not sure it'd be a great idea to do so, but I think we'd still have a good story for "install wasmtime
and your builds are now faster".
Ok dug in a bit more with the help of some wasmtime folks.
The wasmtime crate has support for a local code cache on your system, keyed off basically the checksum of the wasm module blob (afaik). That code cache vastly accelerates the instantiation phase since no compilation needs to happen. Above an instantiation on my machine took 700ms or so, but with the cache enabled it takes 45ms.
That means with a cached module expansion as a whole takes 65.97ms, which looks like a split between the loading of the cache (45ms), calling the exported function (16ms), creating the import map (3ms), and various change elsewhere.
Looks like loading the cache isn't going to be easy to change much, its 45ms breakdown is roughly:
This also doesn't take into account execution time of the macro which is still slower than the debug mode version, clocking in at 24-20ms vs the 9ms for serde in debug mode.
My read from this is that we'll want to heavily cache things (wasmtime's cache, cache instances in-process for lots of derive(Deserialize)
, etc.). I think the next thing to achieve is to get the macro itself executing faster than the debug mode, for which I'll need to do some profiling.
That's awesome! I am away at Rust Belt Rust until Sunday so I won't have a lot of time to engage with this until later, but I would be happy to start landing changes in this repo where it makes sense, for example all the simplified signatures in sym.rs in 5925e6002cf002473c7925722cc7fd088341ee3b. I've added @alexcrichton as a collaborator.
Adding my two cents regarding the Wasmtime cache system:
So, the things above might slightly affect the performance. I'll take a look at the SecondaryMap serialization.
@alexcrichton when I was considering if Wasmtime cache needs compression, the uncompressed cache had some places with really low entropy. I haven't investigated it, but my guess was that SecondaryMap
s were really sparse. I haven't profiled the code, but new deserialization might be faster. You can compile wasmtime with [patch.crates-io]
pointing to my cranelift branch (https://github.com/CraneStation/cranelift/pull/1158).
Thanks for the info @Mrowqa! It's good to know that we've got a lot of knobs if necessary when playing around with the cache here, and we can definitely investigate them trying to go forward!
One of my main worries now at this point for any viability whatsoever is to understand why the execution of a wasm optimized procedural macro is 2x slower than the execution of the native unoptimized version
Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.
Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.
Some of the poor performance may be caused by the shape of the wasm/native ffi boundary. For example, until #10, strings were copied into wasm byte-by-byte. As string passing is used frequently to convert things like Ident
and Literal
into useful values, direct memory copies should be much faster there. In a macro I was playing with, it improved runtime by seconds (although I was dealing with megabyte string literals, so ymmv...).
It might also be desirable to use a fork of proc_macro
's client-server code directly. It requires no unsafe code (except in the closure/buffer passing part, which we'd need to replace with wasm memory manipulation anyway), requires only a single ffi method, and is known to be fast-enough.
Ok back to some benchmarking. This is based on https://github.com/dtolnay/watt/pull/11 to gather timing information so it rules out the issue of bulk-data transfers. The benchmark here is:
#[derive(Serialize)]
struct S(f32, f32, f32, /* 1000 `f32` fields in total ..*/);
Here's the timings I'm getting:
debug | release | |
---|---|---|
serde_derive (native) |
163.29ms | 82.26ms |
wa-serde-derive |
1.02s | 753.30ms |
time in imported functions | 912.32ms | 676.61ms |
time in wasm | 77.87ms | 48.29ms |
time in instantiation | 26.88ms | 24.66ms |
time in making imports | 4.08ms | 3.35ms |
So it looks like almost all the time is spent in the imported functions. Taking a look at those with some instrumentation we get:
function | debug self time | release self time |
---|---|---|
watt::sym::token_stream_extend |
667.43ms | 609.87ms |
watt::sym::token_stream_push_punct |
48.89ms | 30.14ms |
watt::sym::token_stream_push_ident |
23.59ms | 10.28ms |
watt::sym::watt_string_new |
22.23ms | 1.43ms |
watt::sym::ident_eq_str |
18.69ms | 7.64ms |
watt::sym::punct_set_span |
10.05ms | 2.91ms |
My conclusion from this is that there's probably lower hanging fruit than further optimizing the wasm runtime. It appears that we basically get all the bang for the buck necessary with wasmtime
, and the remaining optimization work would be between the boundary of the watt
runtime as well as the proc-macro2
shim that's compiled to wasm and patched in.
@dtolnay or @mystor do you have ideas perhaps looking at this profile of ways that the watt APIs could be improved?
I should also mention that for this benchmark the interpreter takes 10.99s in debug mode and 1.15s in release mode. If the runtime API calls are themselves optimized then I think it's definitely be apparent that (as expected) the JIT is at least one order of magnitude faster than the interpreter, if not multiple. (debug ~10s in wasm vs 77ms, and release ~500ms in wasm vs 48.29ms)
Wow this is great!
Question about the "time in wasm" measurements -- how come there is a 60% difference between debug mode (78ms) and release mode (48ms)? Shouldn't everything going on inside the JIT runtime be the same between those two? Is it including some part of the overhead from the hostfunc calls?
It appears that we basically get all the bang for the buck necessary with
wasmtime
, and the remaining optimization work would be between the boundary of thewatt
runtime as well as theproc-macro2
shim that's compiled to wasm and patched in.
I agree.
My first thought for optimizing the boundary is: Right now we are routing every proc_macro API call individually out of the JIT. It would be good to experiment with how not to do that. For example we could provide a WASM compiled version of proc-macro2's fallback implementation that we hand off together with the caller's WASM into the JIT, such that the caller's macro runs against the emulated proc macro library and not real proc_macro calls. Then when their macro returns we translate the resulting emulated TokenStream into a proc_macro::TokenStream.
Basically the only tricky bit is indexing all the spans in the input and remapping each output span into which one of the input spans it corresponds to. The emulated Span type would hold just an index into our list of all input spans.
I believe this would be a large performance improvement because native serde_derive executes in 163ms while wa-serde-derive spends 912ms in hostfuncs -- the effect of this redesign would be that all our hostfunc time is replaced by doing a subset of the work that native serde_derive does, so I would expect the time for the translation from emulated TokenStream to real TokenStream to be less than 163ms in debug mode.
Yeah I was sort of perplexed at that myself. I did a quick check though and nothing appears awry so it's either normal timing differences (30ms even is basically just variance unless you run it a bunch of times) or as you mentioned the various surrounding "cruft". There's a few small pieces before/after the timing locations which could have attributed more to the wasm than was actually spent in wasm in debug mode, I was just sort of crudely timing things by instrumenting all calls with Instant::now()
and start.elapsed()
calls.
I agree with your intuition as well, that makes sense! To achieve that goal I don't think watt
would carry anything precompiled, but rather there could be a scheme where the actual wasm blob contains this instead of what it has today:
use watt_proc_macro2::TokenStream; // not a shadow of `proc-macro2`
#[no_mangle]
pub extern "C" fn my_macro(input: TokenStream) -> TokenStream {
// not necessary since `watt_proc_macro2` has a statically known initialization symbol
// we call first before we call `my_macro`, and that initialization function does this.
// proc_macro2::set_wasm_panic_hook();
let input = input.into_proc_macro2(); // creates a real crates.io `proc_macro2::TokenStream`
// .. do the real macro on `proc_macro2::TokenStream`, as you usually do
let ret = ...;
// and convert back into a watt token stream
input.into()
}
The conversion from a watt_proc_macro2::TokenStream
to proc_macro2::TokenStream
would be the "serialize everything into wasm" step and the other way would be the "deserialize out of wasm" and would ideally be the only two bridges, everything else would remain purely internal while the wasm is executing.
Furthermore you could actually imagine this being on steroids:
use proc_macro2::TokenStream;
use watt::prelude::*;
#[watt::proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
// ...
}
Basically watt
(or some similarly named crate) could provide all the proc-macro attributes and would do all the conversions for you. That way the changes would largely be in Cargo.toml
and build-wise rather than in the code.
Anyway I digress. Basically my main point is that the wasm blob I think will want the translation baked into it. We could play with a few different deserialization/serialization strategies as well to see which is fastest, but it would indeed be pretty slick if everything was internal to the wasm blob until only the very edges of the wasm.
Some of this may require coordination in proc_macro2
to have a third form of "foreign handle", so actually getting rid of the [patch]
may not be viable..
That sounds good!
I don't mind relying on [patch] so much for now, since it's only on the part of macro authors and not consumers. I think once the performance is in good shape we can revisit everything from the tooling and integration side.
👍
Ok I'll tinker with this and see what I can come up with
I think the most important thing is improving the transparency of the API to the optimizer. Many of the specific methods where a lot of time is being spent seem like obvious easy-to-optimize places, so it may be possible to make good progress with a better API (TokenStream::extend
is a known problem point from https://github.com/alexcrichton/proc-macro2/issues/198, as an example).
My first reservation about the "send all of the data into wasm eagerly" approach was that extra data, like the text of unused Literal
objects, may not be necessary. I suppose syn
is very likely to to_string
every token anyway, though, so we're probably better off sending it down eagerly.
As mentioned, one of the biggest issues there would be Span
objects, which can't be re-created from raw binary data. We could probably intern these and use u32
indexes to reference them from within wasm. Each item stored in the wasm heap could then start with one of these values attached, in addition to their string values.
On the wasm side, the data would probably look similar to how it looks today, but with #[repr(C)]
types, and u32
instead of pointers for indexes into the wasm address space. The wasm code would likely use wrapper methods to convert the u32s into pointers. We could have helper types like WattBox
which would only drop the contained memory when in wasm memory. We'd have to ask the wasm code to allocate the memory region for us first (probably with a series of watt_alloc(size: u32, align: u32)
calls?) and then read the final data back in before returning, but that seems quite doable.
I'm not sure how much other hidden data is associated with individual tokens beyond spans, but we'd also lose any such information with this model. I'm guessing that there is little enough of that for it to not matter.
Ok so it turns out that the API of proc_macro
is so conservative this is actually pretty easy to experiment with. Here's a pretty messy commit -- https://github.com/alexcrichton/watt/commit/18f23372ea45bc1bb622586f8ee2373f45af4eb6. The highlights of this commit are:
RawTokenStream
type.RawTokenStream
type is exactly a u32
handle, as-is today.RawTokenStream
to convert it to a TokenStream
. This performs a bulk serialization to a binary format in the host runtime, then returns a Bytes
handle. This Bytes
handle is then copied into the wasm userspace.TokenStream
into the same binary format as before. The binary blob is passed directly to watt
's native runtime for parsing. This is then deserialized into an actual proc_macro::TokenStream
Span
is handled by always being an opaque u32
in wasm. That way we still have a u32
-per-token in wasm, and it's all managed in the preexisting span
array we have in Data
today. All-in-all it should be lossless.Span::call_site()
in wasm probably returns some random span. Something about raw identifiers dosn't work b/c the serde expansion fails with try
thinking it's a keyword. In any case these bugs don't hinder the timing and proof-of-concept of the API, they're easy to fix later.So basically a macro looks like "serialize everything to a binary blob" which retains Span
information. Next "deserialize binary blob in wasm". Next, process in wasm. Next "serialize back to binary blob" in wasm. Finally "deserialize binary blob" in the native runtime. The goal here was to absolutely minimize the runtime of imported functions and completely maximize the time spent in wasm.
The timings are looking impressive!
debug | release | |
---|---|---|
serde_derive (native) |
163.29ms | 82.26ms |
wa-serde-derive |
334.37 | 305.54ms |
time in imported functions | 120.99ms | 98.81ms |
time in wasm | 176.89ms | 170.20ms |
time in instantiation | 34.95ms | 35.14ms |
time in making imports | 659.723µs | 584.408µs |
And for each imported function:
function | debug self time | release self time |
---|---|---|
watt::sym::token_stream_deserialize |
109.4704ms | 93.14ms |
watt::sym::token_stream_serialize |
9.020899ms | 4.87ms |
watt::sym::token_stream_parse |
1.681334ms | 725.114µs |
This is dramatically faster by minimizing the time spent crossing a chatty boundary. We're spending 9x less time in imported code in debug mode adn ~6x in release mode. It sort of makes sense here that the deserialization of what's probably like a megabyte of source code takes 110ms in debug mode.
The "time in wasm" should be break down as (a) deserialize the input, (b) do the processing, and (c) serialize the input. I would expect that (b) should be close to the release mode execution time within a small margin (ish), but (a) and (c) are extra done today. If my assertion about (b) is true (which it probably isn't since I think Cranelift is still working on perf) then there's room to optimize in (a) and (c). For example @mystor's idea for perhaps leaving literals as handles to the watt runtime might make sense, once you create a Literal
you almost never look at the actual textual literal.
From this I think I would conclude:
Overall this looks like a great way forward. I suspect further tweaking like @mystor mentions in trying to keep as much string-like data on the watt
-runtime side of things could further improve performance. Additionally watt::sym::token_stream_parse
is me being too lazy to implement a Rust syntax tokenizer in wasm (aka copy it from proc-macro2
), but we could likely optimize that slightly by running that in wasm as well.
Never mind, this is how you've done it already. Great! Would it be possible for the boundary to not involving serializing to Rust-like syntax but instead some Bincode-like handrolled compact representation? That should be an obvious win for parsing time, though it may take a bit longer to compile watt itself. Still I think it's likely to be the right tradeoff.
Is it possible to allow for the user's entry point to be written directly in terms of proc_macro2::TokenStream rather than a different RawTokenStream type?
use proc_macro2::TokenStream;
#[no_mangle]
pub extern "C" fn demo(input: TokenStream) -> TokenStream {
I am wondering whether there is anything we can do in how we set up the call into JIT such that we put the right things in memory and stack for this to just work.
So actually, as usual, experimenting is faster than actually typing up the comment saying what we may want to experiment with. Here's timing information where Literal
is not serialized across the boundary (and Span
and Ident
things are fixed). Here Literal
is always serialized as a handle, so wasm can either use these literally (ha!) or manufacture its own. I also did a few small optimizations to remove to_string
where I could.
debug | release | |
---|---|---|
serde_derive (native) |
154.957733ms | 86.821315ms |
wa-serde-derive |
300.809265ms | 278.288324ms |
time in imported functions | 121.252104ms | 99.493415ms |
time in wasm | 141.912692ms | 141.881934ms |
time in instantiation | 35.909112ms | 35.374174ms |
time in making imports | 871.961µs | 749.318µs |
And for each imported function:
function | debug self time | release self time |
---|---|---|
watt::sym::token_stream_deserialize |
111.502829ms | 94.597588ms |
watt::sym::token_stream_serialize |
7.348556ms | 4.056007ms |
watt::sym::token_stream_parse |
1.566301ms | 767.402µs |
So that was an easy 30ms win!
@dtolnay to answer your question about the signature, would you be opposed to a macro? Something like #[watt::proc_macro]
to hide the details?
It shouldn't require an attribute macro though, right? We control exactly what argument the main entry point receives here. I am imagining something like (pseudocode):
let raw_token_stream = Val::i32(d.tokenstream.push(input) as i32);
let input_token_stream = raw_to_pm2.call(&[raw_token_stream]).unwrap()[0];
let output_token_stream = main.call(&[input_token_stream]).unwrap()[0];
let raw_token_stream = pm2_to_raw.call(&[output_token_stream]).unwrap()[0];
return d.tokenstream[raw_token_stream].clone();
where main
is the user-provided no_mangle entry point and raw_to_pm2
+ pm2_to_raw
are no_mangle functions built into our patched proc-macro2, equivalent to RawTokenStream::into_token_stream
and TokenStream::into_raw_token_stream
.
That's possible but would require specifying the ABI of TokenStream
itself as a u32
, which today it's a Vec<TokenTree>
internally. I've generally found a macro to be useful for decoupling the API and the ABI because we don't necessarily want users to write down the ABI but rather we have an API we want them to adhere to.
Ah, makes sense. Yes I would be on board with an attribute macro to hide the ABI.
FWIW I experimented a bit, a while ago, with some really hacky macros around watt
to allow writing proc_macro
crates inline within the module you're working with (https://github.com/mystor/ctrs if anyone's interested, though It's pretty darn hacky). I included a transformation like the one you're talking about for #[watt::proc_macro]
. It's perhaps a bit dumber than is needed here, though.
Ok I've sent the culmination of all of this in as https://github.com/dtolnay/watt/pull/14
From some rough tests, Watt macro expansion when compiling the runtime in release mode is about 15x faster than when the runtime is compiled in debug mode.
Maybe we can set it up such that users can run something like
cargo install watt-runtime
and then our debug-mode runtime can detect whether that optimized runtime is installed; if it is, then handing off the program to it.