dtolnay / watt

Runtime for executing procedural macros as WebAssembly
Apache License 2.0
1.25k stars 29 forks source link

Hand off code to a preinstalled optimized runtime if available #2

Open dtolnay opened 4 years ago

dtolnay commented 4 years ago

From some rough tests, Watt macro expansion when compiling the runtime in release mode is about 15x faster than when the runtime is compiled in debug mode.

Maybe we can set it up such that users can run something like cargo install watt-runtime and then our debug-mode runtime can detect whether that optimized runtime is installed; if it is, then handing off the program to it.

kazimuth commented 4 years ago

This seems like it should be pretty straightforward to implement. You just need some way to RPC with the tool... which could just be passing token streams to STDIN and reading output / errors from STDOUT / STDERR.

It would also be possible to use an entirely different runtime for this, such as wasmtime, which includes a JIT written in Rust. I'm not sure how much faster / lower-latency this is compared to the watt runtime, would be worth benchmarking. That would be especially worthwhile if this eventually gets added to the rust toolchain, since then users don't need to worry about the release-mode compile time.

Oh also, the tool should have some form of version check built-in.

I might be able to poke at this next weekend.

dtolnay commented 4 years ago

I am on board with using a JIT runtime for the precompiled one, but we should make sure that it caches the JIT artifacts. In typical usage you might invoke the same macro many times, and we don't want the JIT to need to run on the same code more than once.

fitzgen commented 4 years ago

wasmtime does indeed have a code cache, fwiw. +cc @sunfishcode

alexcrichton commented 4 years ago

First wanted to say thanks for exploring this space @dtolnay, this is all definitely super useful user-experience for eventual stabilization in rustc/cargo themselves!

On the topic of an optimized runtime, I'd probably discourage making a watt-specific runtime since running WebAssembly at speed in the limit of time can be a very difficult project to keep up with. WebAssembly is evolving (albeit somewhat slowly) and as rustc/LLVM keep up it might be a pain to have another runtime to have to keep up to date and all. Would you be up for having some exploration done to see if wasmtime could be suitable for this purpose?

The wasmtime runtime would indeed be maintained going forward and would get all the new features as they come into WebAssembly itself. Additionally it will have its own installation which will involve downloading precompiled binaries, so users don't even have to worry about a long compilation process for an optimized wasm runtime. I'm imagining that the build scripts of the wasm runtime support crates here would detect wasmtime on the host system (or something like that) and skip all the code currently compile (not even compile the interpreted runtime) and go straight to using that.

On a technical level it should be possible with using wasi APIs to communicate either with stdin/stdout or files. With wasi/wasmtime it's still somewhat early days so we can add features there too as necessary!

I wouldn't mind setting aside some time to investigate all this if this all sounds reasonable to you @dtolnay?

dtolnay commented 4 years ago

What I would have in mind by a watt-specific runtime isn't a whole new implementation of WebAssembly from scratch, but some existing maintained runtime like wasmtime wrapped with any additional proc macro specific logic we want compiled in release mode. Maybe that additional logic is nothing and we can use a vanilla wasmtime binary -- I just want to make sure we are running as little as possible in our debug-mode shim because the performance difference is extreme.

@alexcrichton what you wrote sounds reasonable to me and I would love if you had time to investigate further. Thanks!

dtolnay commented 4 years ago

I think an amazing milestone would be when proc macros built for Watt running in Wasmtime are faster than natively compiled proc macros in a typical cargo build — because the performance boost from release-mode compilation of the wasm is bigger than any slowdown from the execution model. That seems like it should be within reach right?

kazimuth commented 4 years ago

Question: would it be possible to just bundle platform-specific binaries with Watt? You could make a bunch of watt-runtime-[OS]-(arch)] packages with binaries in the crate, then add #[cfg]'d dependencies on them in watt-runtime, with a fallback of compiling from scratch. That would make installs pretty much instant for 99% of users, which fixes the main downside of using wasmtime / cranelift (compile time). I don't know if cargo allows baking binaries into crates, though.

alexcrichton commented 4 years ago

I've settled on a strategy where my thinking is that to at least prove this out I'm going to attempt to dlopen libwasmtime.so which has a C API. That C API would be bound in the watt runtime and watt would dynamically select, at build time, whether it'll link to libwasmtime.so or whether it'll compile in the fallback interpreter runtime. It'll take me a few days I think to get all the fiddling right.

@dtolnay do you have some benchmarks in mind already to game out? Or are you thinking of "let's just compile some crates with serde things"

dtolnay commented 4 years ago

A good benchmark to start off would be derive(Deserialize) on some simple struct with 6 fields, using the wasm file published in wa-serde-derive.

kazimuth commented 4 years ago

@alexcrichton do you know if it would it be possible to bundle platform-specific libwasmtime.sos with Watt on crates.io?

alexcrichton commented 4 years ago

Ok this ended up actually being a lot easier than I thought it would be! Note that I don't actually have a "fallback mode" back to the interpreter, I just changed the whole crate and figured that if this panned out we could figure out how to have source code that simultaneously supports both later.

The jit code all lives on this branch, but it's a bit of a mess. Requires using RUSTFLAGS to pass -L to find libwasmtime_api.so and also requires using LD_LIBRARY_PATH to actually load it at runtime.

I compiled with cargo build (debug mode) using this code:

#![allow(dead_code)]

#[derive(wa_serde_derive::Deserialize)]
struct Foo {
    a: f32,
    b: String,
    c: (String, i32),
    d: Option<u32>,
    e: u128,
    f: f64,
}

#[derive(serde_derive::Deserialize)]
struct Bar {
    a: f32,
    b: String,
    c: (String, i32),
    d: Option<u32>,
    e: u128,
    f: f64,
}

fn main() {
    println!("Hello, world!");
}

I also instrumented a local checkout of serde_derive to just print the duration of a derive(Deserialize). The execution time numbers look like:

Runtime time
serde_derive 9.36ms
watt interpreter 1388.82ms
watt jit 748.51ms

The breakdown of the jit looks like:

Step time
creation of instance 706.11ms
calling exported function 24.10ms
creating import map 5.18ms
creating the wasm module 1.55ms

Next I was curious about the compile time for the entire project. Here I just included one of the impls above and measured the compile time of cargo build with nothing in cache (a full debug mode build).

Runtime compile time
serde_derive 10.78s
serde + derive feature 17.83s
watt interpeter 9.12s
watt jit 8.50s

and finally, the compile time of the watt crate (including dependencies in the jit case that I've added) from scratch, for comparison:

Runtime compile time
watt interpeter 2.69s
watt jit 1.23s

Some conclusions:

Overall seems promising! Not quite ready for prime time (but then again none of this really is per se), but I think this is a solid path forward.


@kazimuth sorry meant to reply earlier but forgot! I do think we can certainly distribute precompiled libwasmtime.so crates on crates.io, but one of the downsides for proc macros (and serde) specifically is that cargo vendor vendors everything for every platform and would download quite a few binaries that wouldn't end up being needed (one for every platform we have a precompiled object for). For that reason I'm not sure it'd be a great idea to do so, but I think we'd still have a good story for "install wasmtime and your builds are now faster".

alexcrichton commented 4 years ago

Ok dug in a bit more with the help of some wasmtime folks.

The wasmtime crate has support for a local code cache on your system, keyed off basically the checksum of the wasm module blob (afaik). That code cache vastly accelerates the instantiation phase since no compilation needs to happen. Above an instantiation on my machine took 700ms or so, but with the cache enabled it takes 45ms.

That means with a cached module expansion as a whole takes 65.97ms, which looks like a split between the loading of the cache (45ms), calling the exported function (16ms), creating the import map (3ms), and various change elsewhere.

Looks like loading the cache isn't going to be easy to change much, its 45ms breakdown is roughly:

This also doesn't take into account execution time of the macro which is still slower than the debug mode version, clocking in at 24-20ms vs the 9ms for serde in debug mode.

My read from this is that we'll want to heavily cache things (wasmtime's cache, cache instances in-process for lots of derive(Deserialize), etc.). I think the next thing to achieve is to get the macro itself executing faster than the debug mode, for which I'll need to do some profiling.

dtolnay commented 4 years ago

That's awesome! I am away at Rust Belt Rust until Sunday so I won't have a lot of time to engage with this until later, but I would be happy to start landing changes in this repo where it makes sense, for example all the simplified signatures in sym.rs in 5925e6002cf002473c7925722cc7fd088341ee3b. I've added @alexcrichton as a collaborator.

mrowqa commented 4 years ago

Adding my two cents regarding the Wasmtime cache system:

So, the things above might slightly affect the performance. I'll take a look at the SecondaryMap serialization.

mrowqa commented 4 years ago

@alexcrichton when I was considering if Wasmtime cache needs compression, the uncompressed cache had some places with really low entropy. I haven't investigated it, but my guess was that SecondaryMaps were really sparse. I haven't profiled the code, but new deserialization might be faster. You can compile wasmtime with [patch.crates-io] pointing to my cranelift branch (https://github.com/CraneStation/cranelift/pull/1158).

alexcrichton commented 4 years ago

Thanks for the info @Mrowqa! It's good to know that we've got a lot of knobs if necessary when playing around with the cache here, and we can definitely investigate them trying to go forward!

One of my main worries now at this point for any viability whatsoever is to understand why the execution of a wasm optimized procedural macro is 2x slower than the execution of the native unoptimized version

alexcrichton commented 4 years ago

Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.

mystor commented 4 years ago

Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.

Some of the poor performance may be caused by the shape of the wasm/native ffi boundary. For example, until #10, strings were copied into wasm byte-by-byte. As string passing is used frequently to convert things like Ident and Literal into useful values, direct memory copies should be much faster there. In a macro I was playing with, it improved runtime by seconds (although I was dealing with megabyte string literals, so ymmv...).

It might also be desirable to use a fork of proc_macro's client-server code directly. It requires no unsafe code (except in the closure/buffer passing part, which we'd need to replace with wasm memory manipulation anyway), requires only a single ffi method, and is known to be fast-enough.

alexcrichton commented 4 years ago

Ok back to some benchmarking. This is based on https://github.com/dtolnay/watt/pull/11 to gather timing information so it rules out the issue of bulk-data transfers. The benchmark here is:

#[derive(Serialize)]
struct S(f32, f32, f32, /* 1000 `f32` fields in total ..*/);

Here's the timings I'm getting:

debug release
serde_derive (native) 163.29ms 82.26ms
wa-serde-derive 1.02s 753.30ms
time in imported functions 912.32ms 676.61ms
time in wasm 77.87ms 48.29ms
time in instantiation 26.88ms 24.66ms
time in making imports 4.08ms 3.35ms

So it looks like almost all the time is spent in the imported functions. Taking a look at those with some instrumentation we get:

function debug self time release self time
watt::sym::token_stream_extend 667.43ms 609.87ms
watt::sym::token_stream_push_punct 48.89ms 30.14ms
watt::sym::token_stream_push_ident 23.59ms 10.28ms
watt::sym::watt_string_new 22.23ms 1.43ms
watt::sym::ident_eq_str 18.69ms 7.64ms
watt::sym::punct_set_span 10.05ms 2.91ms

My conclusion from this is that there's probably lower hanging fruit than further optimizing the wasm runtime. It appears that we basically get all the bang for the buck necessary with wasmtime, and the remaining optimization work would be between the boundary of the watt runtime as well as the proc-macro2 shim that's compiled to wasm and patched in.

@dtolnay or @mystor do you have ideas perhaps looking at this profile of ways that the watt APIs could be improved?

alexcrichton commented 4 years ago

I should also mention that for this benchmark the interpreter takes 10.99s in debug mode and 1.15s in release mode. If the runtime API calls are themselves optimized then I think it's definitely be apparent that (as expected) the JIT is at least one order of magnitude faster than the interpreter, if not multiple. (debug ~10s in wasm vs 77ms, and release ~500ms in wasm vs 48.29ms)

dtolnay commented 4 years ago

Wow this is great!

Question about the "time in wasm" measurements -- how come there is a 60% difference between debug mode (78ms) and release mode (48ms)? Shouldn't everything going on inside the JIT runtime be the same between those two? Is it including some part of the overhead from the hostfunc calls?

It appears that we basically get all the bang for the buck necessary with wasmtime, and the remaining optimization work would be between the boundary of the watt runtime as well as the proc-macro2 shim that's compiled to wasm and patched in.

I agree.

My first thought for optimizing the boundary is: Right now we are routing every proc_macro API call individually out of the JIT. It would be good to experiment with how not to do that. For example we could provide a WASM compiled version of proc-macro2's fallback implementation that we hand off together with the caller's WASM into the JIT, such that the caller's macro runs against the emulated proc macro library and not real proc_macro calls. Then when their macro returns we translate the resulting emulated TokenStream into a proc_macro::TokenStream.

Basically the only tricky bit is indexing all the spans in the input and remapping each output span into which one of the input spans it corresponds to. The emulated Span type would hold just an index into our list of all input spans.

I believe this would be a large performance improvement because native serde_derive executes in 163ms while wa-serde-derive spends 912ms in hostfuncs -- the effect of this redesign would be that all our hostfunc time is replaced by doing a subset of the work that native serde_derive does, so I would expect the time for the translation from emulated TokenStream to real TokenStream to be less than 163ms in debug mode.

alexcrichton commented 4 years ago

Yeah I was sort of perplexed at that myself. I did a quick check though and nothing appears awry so it's either normal timing differences (30ms even is basically just variance unless you run it a bunch of times) or as you mentioned the various surrounding "cruft". There's a few small pieces before/after the timing locations which could have attributed more to the wasm than was actually spent in wasm in debug mode, I was just sort of crudely timing things by instrumenting all calls with Instant::now() and start.elapsed() calls.

I agree with your intuition as well, that makes sense! To achieve that goal I don't think watt would carry anything precompiled, but rather there could be a scheme where the actual wasm blob contains this instead of what it has today:

use watt_proc_macro2::TokenStream; // not a shadow of `proc-macro2`

#[no_mangle]
pub extern "C" fn my_macro(input: TokenStream) -> TokenStream {
    // not necessary since `watt_proc_macro2` has a statically known initialization symbol
    // we call first before we call `my_macro`, and that initialization function does this.
    // proc_macro2::set_wasm_panic_hook();

    let input = input.into_proc_macro2(); // creates a real crates.io `proc_macro2::TokenStream`

    // .. do the real macro on `proc_macro2::TokenStream`, as you usually do

    let ret = ...;

    // and convert back into a watt token stream
    input.into()
}

The conversion from a watt_proc_macro2::TokenStream to proc_macro2::TokenStream would be the "serialize everything into wasm" step and the other way would be the "deserialize out of wasm" and would ideally be the only two bridges, everything else would remain purely internal while the wasm is executing.

Furthermore you could actually imagine this being on steroids:

use proc_macro2::TokenStream;
use watt::prelude::*;

#[watt::proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
    // ...
}

Basically watt (or some similarly named crate) could provide all the proc-macro attributes and would do all the conversions for you. That way the changes would largely be in Cargo.toml and build-wise rather than in the code.

Anyway I digress. Basically my main point is that the wasm blob I think will want the translation baked into it. We could play with a few different deserialization/serialization strategies as well to see which is fastest, but it would indeed be pretty slick if everything was internal to the wasm blob until only the very edges of the wasm.

Some of this may require coordination in proc_macro2 to have a third form of "foreign handle", so actually getting rid of the [patch] may not be viable..

dtolnay commented 4 years ago

That sounds good!

I don't mind relying on [patch] so much for now, since it's only on the part of macro authors and not consumers. I think once the performance is in good shape we can revisit everything from the tooling and integration side.

alexcrichton commented 4 years ago

👍

Ok I'll tinker with this and see what I can come up with

mystor commented 4 years ago

I think the most important thing is improving the transparency of the API to the optimizer. Many of the specific methods where a lot of time is being spent seem like obvious easy-to-optimize places, so it may be possible to make good progress with a better API (TokenStream::extend is a known problem point from https://github.com/alexcrichton/proc-macro2/issues/198, as an example).

My first reservation about the "send all of the data into wasm eagerly" approach was that extra data, like the text of unused Literal objects, may not be necessary. I suppose syn is very likely to to_string every token anyway, though, so we're probably better off sending it down eagerly.

As mentioned, one of the biggest issues there would be Span objects, which can't be re-created from raw binary data. We could probably intern these and use u32 indexes to reference them from within wasm. Each item stored in the wasm heap could then start with one of these values attached, in addition to their string values.

On the wasm side, the data would probably look similar to how it looks today, but with #[repr(C)] types, and u32 instead of pointers for indexes into the wasm address space. The wasm code would likely use wrapper methods to convert the u32s into pointers. We could have helper types like WattBox which would only drop the contained memory when in wasm memory. We'd have to ask the wasm code to allocate the memory region for us first (probably with a series of watt_alloc(size: u32, align: u32) calls?) and then read the final data back in before returning, but that seems quite doable.

I'm not sure how much other hidden data is associated with individual tokens beyond spans, but we'd also lose any such information with this model. I'm guessing that there is little enough of that for it to not matter.

alexcrichton commented 4 years ago

Ok so it turns out that the API of proc_macro is so conservative this is actually pretty easy to experiment with. Here's a pretty messy commit -- https://github.com/alexcrichton/watt/commit/18f23372ea45bc1bb622586f8ee2373f45af4eb6. The highlights of this commit are:

So basically a macro looks like "serialize everything to a binary blob" which retains Span information. Next "deserialize binary blob in wasm". Next, process in wasm. Next "serialize back to binary blob" in wasm. Finally "deserialize binary blob" in the native runtime. The goal here was to absolutely minimize the runtime of imported functions and completely maximize the time spent in wasm.

The timings are looking impressive!

debug release
serde_derive (native) 163.29ms 82.26ms
wa-serde-derive 334.37 305.54ms
time in imported functions 120.99ms 98.81ms
time in wasm 176.89ms 170.20ms
time in instantiation 34.95ms 35.14ms
time in making imports 659.723µs 584.408µs

And for each imported function:

function debug self time release self time
watt::sym::token_stream_deserialize 109.4704ms 93.14ms
watt::sym::token_stream_serialize 9.020899ms 4.87ms
watt::sym::token_stream_parse 1.681334ms 725.114µs

This is dramatically faster by minimizing the time spent crossing a chatty boundary. We're spending 9x less time in imported code in debug mode adn ~6x in release mode. It sort of makes sense here that the deserialization of what's probably like a megabyte of source code takes 110ms in debug mode.

The "time in wasm" should be break down as (a) deserialize the input, (b) do the processing, and (c) serialize the input. I would expect that (b) should be close to the release mode execution time within a small margin (ish), but (a) and (c) are extra done today. If my assertion about (b) is true (which it probably isn't since I think Cranelift is still working on perf) then there's room to optimize in (a) and (c). For example @mystor's idea for perhaps leaving literals as handles to the watt runtime might make sense, once you create a Literal you almost never look at the actual textual literal.

From this I think I would conclude:

Overall this looks like a great way forward. I suspect further tweaking like @mystor mentions in trying to keep as much string-like data on the watt-runtime side of things could further improve performance. Additionally watt::sym::token_stream_parse is me being too lazy to implement a Rust syntax tokenizer in wasm (aka copy it from proc-macro2), but we could likely optimize that slightly by running that in wasm as well.

dtolnay commented 4 years ago
alexcrichton commented 4 years ago

So actually, as usual, experimenting is faster than actually typing up the comment saying what we may want to experiment with. Here's timing information where Literal is not serialized across the boundary (and Span and Ident things are fixed). Here Literal is always serialized as a handle, so wasm can either use these literally (ha!) or manufacture its own. I also did a few small optimizations to remove to_string where I could.

debug release
serde_derive (native) 154.957733ms 86.821315ms
wa-serde-derive 300.809265ms 278.288324ms
time in imported functions 121.252104ms 99.493415ms
time in wasm 141.912692ms 141.881934ms
time in instantiation 35.909112ms 35.374174ms
time in making imports 871.961µs 749.318µs

And for each imported function:

function debug self time release self time
watt::sym::token_stream_deserialize 111.502829ms 94.597588ms
watt::sym::token_stream_serialize 7.348556ms 4.056007ms
watt::sym::token_stream_parse 1.566301ms 767.402µs

So that was an easy 30ms win!


@dtolnay to answer your question about the signature, would you be opposed to a macro? Something like #[watt::proc_macro] to hide the details?

dtolnay commented 4 years ago

It shouldn't require an attribute macro though, right? We control exactly what argument the main entry point receives here. I am imagining something like (pseudocode):

let raw_token_stream = Val::i32(d.tokenstream.push(input) as i32);
let input_token_stream = raw_to_pm2.call(&[raw_token_stream]).unwrap()[0];
let output_token_stream = main.call(&[input_token_stream]).unwrap()[0];
let raw_token_stream = pm2_to_raw.call(&[output_token_stream]).unwrap()[0];
return d.tokenstream[raw_token_stream].clone();

where main is the user-provided no_mangle entry point and raw_to_pm2 + pm2_to_raw are no_mangle functions built into our patched proc-macro2, equivalent to RawTokenStream::into_token_stream and TokenStream::into_raw_token_stream.

alexcrichton commented 4 years ago

That's possible but would require specifying the ABI of TokenStream itself as a u32, which today it's a Vec<TokenTree> internally. I've generally found a macro to be useful for decoupling the API and the ABI because we don't necessarily want users to write down the ABI but rather we have an API we want them to adhere to.

dtolnay commented 4 years ago

Ah, makes sense. Yes I would be on board with an attribute macro to hide the ABI.

mystor commented 4 years ago

FWIW I experimented a bit, a while ago, with some really hacky macros around watt to allow writing proc_macro crates inline within the module you're working with (https://github.com/mystor/ctrs if anyone's interested, though It's pretty darn hacky). I included a transformation like the one you're talking about for #[watt::proc_macro]. It's perhaps a bit dumber than is needed here, though.

alexcrichton commented 4 years ago

Ok I've sent the culmination of all of this in as https://github.com/dtolnay/watt/pull/14