Open bjorn3 opened 3 years ago
One of the possible usecases would be to compile rustc to webassembly and then let it produce webassembly modules using cg_clif.
This would be awesome, another use case might be implementing custom compiler frontends for custom languages that could then compile to wasm!
I'm working a Java Virtual Machine implementation in Rust and one of my plans is to write a WASM backend along with a general JIT implementation using Cranelift, and if everything were unified to just be Cranelift that would be great. I might give creating a WASM backend a shot, would benefit my project as well as Cranelift
I would be interested in implementing this!
I had a look, and I'm not sure where the correct place to implement this would be. My thought was in cranelift_codegen::isa
, however the MachBackend
trait seems to reply upon the existence of registers in the target architecture (which makes sense for other architectures, but WebAssembly is sort-of a stack machine). Would it make sense to emulate registers by using WebAssembly locals, or is that a bad idea?
Wasm is a fundamentally different architecture from all other supported architectures in that it is a stack machine rather than a register machine, it uses structured control flow rather than arbitrary jumps, allows more than one memory, ... For this reason I think you should avoid using the existing backend infrastructure. One possible implementation method would be to have an implementation of the Module
trait from cranelift_module
in a separate crate and then write the lowering from clif ir to wasm bytecode in this crate.
Indeed, this would best be structured as a separate framework that consumes Cranelift's IR, I think.
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).
We'll need to decide what we do about runtime-specific implementations of Wasm memories, tables, globals, and imports as well (EDIT: when using the cranelift-wasm
frontend). Traditionally cranelift-wasm
has these bits filled in by the "environment", and our embedding (Wasmtime, Lucet, SpiderMonkey, something else) fills in some IR that uses e.g. machine-native pointers to implement Wasm memory/table accesses. If we compile back to Wasm, should a table just become a table, and a memory become a memory? Probably, but this will require some careful thought in the frontend/environment too. Possibly we add new IR operators (primitives) and just lower table.get
1-to-1 to this primitive, then back to table.get
in the backend.
FWIW, I have a little side-project (waffle) that is attempting to be a Wasm-to-CFG-of-SSA-to-Wasm framework, which might be useful to look at to see what it takes to compile SSA IR with blockparams back to Wasm. (It kind of works: it roundtrips a hello-world, but has some bugs beyond that. Side-project, no guarantees, etc!)
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with,
https://docs.rs/relooper/latest/relooper/ might be a good way to easily implement this (I believe it actually implements the stackifier algorithm even though it's called relooper)
Thank you for the advice!
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).
I'm not sure how regalloc
could be used to translate operators to stack code. Would you mind elaborating (a link to something explaining the idea would be great if you have one).
My thoughts were that this should be done by recursively inserting WebAssembly instructions to compute Value
s.
So something like:
function instruction_code()
for each operand
instruction_code(operand)
code to push result onto stack (apply operator and then push)
That scheme is a good start, and works for tree-shaped code, but not DAG-shaped code: as soon as an operator (SSA value) is used more than once, you either need to recompute it or store it in a local and then reload it at each use-site.
The use of locals to store SSA values is the register allocation problem -- with a range of solutions, from a "one local (register) per SSA value" approach up to more sophisticated approaches that reuse locals. Given Wasm engines' limits on locals, and given the way that baseline Wasm compilers allocate separate stack slots for each local (leading to huge stackframes if we are not frugal with locals), we probably want to do something better than one-local-per-value.
The reason I described this as a "combined regalloc / instruction scheduling problem" is that the two are intertwined. Once we've decided to use locals to carry values, we could do the simplest thing and emit local.get
for every operand, then the operator, then local.set
. But that's also inefficient; we want to use the stack when we can. So the order of operators matters. Your scheme (recursive value generation) is actually doing implicit code-motion in order to take advantage of stack dataflow (by sinking operators to use-sites), which is great, and what we want. But we have to be careful about it for two reasons:
In the most general form of the problem, we could also consider rematerialization: some operations (such as, e.g., i32.const 0
) are trivial to compute and so it would probably not be a good idea to compute this once at the top of the function (GVN will often dedup constants like this) and store it in a local thereafter. Rematerialization can be considered part of the regalloc problem (because a regalloc might not want to spill and remat is then an option) and some regalloc implementations can do this; one can also do it in a more ad-hoc way in the code one generates prior to allocating registers.
A scheme that might work is to "treeify" the operands before generating code. Basically each operand of an operator could then be:
SingleUse(Value)
: We are the only use of the value, and the operator that produces it is pure, so it will be generated in-place on the stack (your scheme);NormalUse(Value)
: We are one of several uses of the value; it should be generated into a local at its original location in the program, then loaded here; orRematerialize(Value)
: We replicate the operator here. Used for pure ops like iconst
only.So one would generate ops in original program location if used by a NormalUse
elsewhere, and otherwise only at the use-site. This would result in code that then refers to values, which we can subsequently do register allocation over to assign to locals.
There's not really a good document on this that I know of; it's just applications of general concepts (regalloc, remat, instruction scheduling) to a specific and somewhat odd target. We're inventing as we go here. That said it'd probably be good to understand how e.g. LLVM's Wasm backend does this; I haven't really looked at the dataflow side (just the stackifier).
Anyway hopefully this helps!
Thanks so much!
I have started to implement something in: https://github.com/teymour-aldridge/cranelift_codegen_wasm
Unfortunately there are quite a few bugs and many things (e.g. data objects, instructions) are not yet supported.
Since I chanced upon this issue again, I'll leave a note that my waffle crate now has a working Wasm compilation pipeline from an SSA-based IR that is very similar to Cranelift's IR. It could serve either as the model for, or as a codegen library for, a WebAssembly backend for Cranelift. It currently requires reducible control flow (will panic when given an irreducible CFG) but this is "just" an implementation question, as the API is based on a standard CFG. The crate has been fuzzed with roundtrip + differential execution and is in a fairly robust state (can roundtrip complex Wasms like SpiderMonkey).
The main challenge that remains is in the semantic gap between Wasm and CLIF: lowering loses information about tables, heaps, globals and the like, and we would need to "lift" CLIF's raw memory accesses back to Wasm-level concepts. That's maybe not so bad though: one could use a "null" environment when lowering the Wasm, leaving heap accesses to memory zero as-is (address computation is the identity function), and then compile accesses to raw memory in CLIF back to accesses to a single Wasm memory. Function pointers (taking the address of symbols) will require some care. Also, CLIF is not exactly 1-to-1 with Wasm opcodes, especially the SIMD ones. But, doable. This is probably a 1-2 month project for someone at this point.
(I'll note too that some of my thoughts above are specific to a "Wasm roundtripping" context, which probably is actually the least interesting application; serving as a backend for something that starts with a native-code model, like cg_clif, is more interesting, and for that the mappings are a little simpler. One raw address space goes to one Wasm memory, and all function pointers are indices into one table, with entries populated as we observe function-address operators. I guess there is just the question of "relocations" for external symbol references.)
I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate? This way everything the wasm frontend and other frontends need is forced to be public.
I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate? This way everything the wasm frontend and other frontends need is forced to be public.
Yes, for sure; the crate's API polish hasn't gotten much attention, it was just codeveloped for a single initial purpose (roundtripping with modifications). Happy to discuss further there and/or take PRs from folks :-)
A WebAssembly backend would likely be very useful for projects like Ruffle as well
Hi, thanks for the amazing work! Any plans for this? Really excited for the feature, will allow me to drop wasm-opt as a dependency in lots of my projects 👀 (+ working with cranelift is generally very pleasant!).
@bjorn3
@skyne98 we don't have anyone looking at this right now, and as it's a large engineering project, it's unlikely to be prioritized by the folks who work on Cranelift full-time. That doesn't mean we wouldn't like to have it, just that there are other things that are more important, and time is limited.
As always with an open-source project, I'd be happy to review a PR from anyone who wants this and builds it!
Of course, thanks for the reply, much appreciated! It seems like a project that requires quite a bit of expertise, based on your comments, so I hope someone can pick it up one day! (Or I get confident enough with it to help out 😅).
One quick question, if you don't mind: does running WASM in embedded wasmtime (with optimizations turned on & cranelift) already fully utilizes cranelift's optimization capability? And, also, is there potentially a way I get yonk the optimized IR out of the pipeline to print it and check it?
Wasmtime does turn on Cranelift's optimizations by default, yes (this can be controlled with -O opt-level=N
, 0 for off and 2 for all on. You can get unoptimized IR with wasmtime compile --emit-clif
; optimized IR could be printed with a similar option, but you'd need to add similar code around here (and handle the cached case somehow?) using logic similar to this. Or, as a hack, println!()
it directly at that point.
Would have been nice to be able to pass optional callbacks to hook into different parts of the pipeline 👀
Feature
Add a Cranelift backend that compiles to Webassembly.
Benefit
This would allow a compiler to run in the browser and compile code to run in the browser.
Implementation
No plan yet
Alternatives
Not adding a Webassembly backend.