Wasmachine, WebAssembly in a FPGA

piranna commented 7 years ago

I have been involved during the last weeks on the development of an implementation of WebAssembly binary format on FPGA called wasmachine. It need still some work (memory access is not implemented yet, FPU operations are only partially implemented, and needs some work on optimizations), but so far is capable to do some operations and functions calling.

While developing it I have found some issues that lead me to deviate from my initial intention to use directly the WebAssembly binary format to what I've called "WebASsembly extendeD" (wasd, to follow the joke of the keys used to control movements on videogames :-P). This changes are minimal and will be done by an auxiliar loader module (both on the chip itself, or previously generated), but probably would be discussed for alternatives or integrated in the binary format itself:

include extra inmediate field on blocks with their size, so it's possible to move program counter after them. if blocks has another inmediate field with the length of the true block to be able to move to the else one.
br_table target labels are decoded, so all of them are of the same size and label could be get in O(1).

kanaka commented 7 years ago

@piranna In wac (WebAssembly in C) I end up doing a pre-scanning phase during module load in order to create a block lookup table with the branch and end addresses for each block. This was probably the second most complex part of wac to implement after FFI thunks (import/exports). Having block end offsets and branch offsets already resolved in the binary format would certainly simplify creating interpreters for wasm (as opposed to compilers or JITs for wasm where this is a minor part of all the transformations and optimizations that will need to be done).

I'm less concerned about point 2 since that's a runtime performance issue (and we're talking about interpreters after all). However, I would expand point 1 to include all block branch/loop commands so that no runtime scanning (forward or backward) or load pre-processing is required when implementing interpreters.

On the other hand, for compilers/JITs, having the logical branch and block locations is probably simpler for validation and optimization. However, if we want the Wasm embedding ecosystem to bloom I think removing complexity from that space (i.e. moving some complexity to JITs/compilers) might be worthwhile. A compromise would be having both the logical and the bytecode offset in the branches/blocks. The downside is that it would expand the binary module size slightly.

So +1 from another Wasm non-web embedder :-)

piranna commented 7 years ago

I end up doing a pre-scanning phase during module load in order to create a block lookup table with the branch and end addresses for each block

This was my first approach, but this would need too much accesses to the memory to find the address of the next instruction, that's why I decided to inline it. The same for br_table, since I would need to parse the target_table inmediate to find the label, so by pre-decoding its values to uint32 values they can be accessed just by offset.

I would expand point 1 to include all block branch/loop commands so that no runtime scanning (forward or backward) or load pre-processing is required when implementing interpreters

+1. I understand that the current format is mostly focused on reduce size of data instead of being directly executable (that's why it uses LEB128 values everywhere instead of regular integers of fixed size, since they need less bytes), that's why it's using end opcodes as delimiter instead of the actual size of the blocks.

So +1 from another Wasm non-web embedder :-)

Thank you, yours looks nice too! :-D

piranna commented 7 years ago

Another issue I have seen is that operations are not grouped using power-of-two prefixes, so it's not easy to decode them fast if not using a big select case. Ideally, from the opcode bits, higher ones should be able to use to decide what submodule will fully decode and interpret the instruction, for example grouping them in 16x16 or 8x32 groups, using the higher 4 or 3 bits as selector. This could be easily done by adding more blank spaces on the opcodes. This also has the drawback of extending the opcodes space and probably would need two bytes for them, but as alternatives the smaller groups could be joined and/or the less used instructions would be put on the extended instructions area as indicated post-mvp.

RyanLamansky commented 7 years ago

You should see all the trouble we have to go through to get WASM to run on X86... it's almost as if direct hardware execution wasn't a design goal! 😉

Personally (as someone who's trying to write a .NET Core JIT implementation in his spare time), I think the MVP format (which is the final 1.0 format, any changes from here on will be backward compatible) is very good. All platforms are expected to make at least one pass to convert the code to something closer to native.

jfbastien commented 7 years ago

Unless I misunderstand the proposed changes are binary-incompatible. We can't really change the binary format because Firefox and Chrome already ship WebAssembly. Is that correct?

Assuming it is so, I'll close the issue. Happy to reopen if not.

piranna commented 7 years ago

Unless I misunderstand the proposed changes are binary-incompatible

Mostly yes, they are incompatible... I have tried so far to be aligned with WebAssembly spec, but at least these two points (and seems other ones related to the module structure) make it impossible or really dificult and poor performant to have direct execution, so I'm trying to write a new "decoded" spec that would allow direct execution. This post was mostly intented as a discussion point about what other minimal changes would be needed to achieve that.

As @RyanLamansky has told, WebAssembly spec is not bad, in fact the binary format is somewhat easy to understand, but it's focused to use the smallest space possible. Being that it's purposse, probably I would have done similar design decissions.

titzer commented 7 years ago

FWIW V8's interpreter for WASM is written in C++ and used for debugging. It uses a side table of jump targets that are looked up by origin (e.g. the position of a br, br_if, if, else, or br_table) opcode. This lookup table could be implemented in hardware. Another idea is instead of generating a side-table, generate a separate (much shorter) snippet of bytecode that contains only control instructions and refers back to the original code. Since most programs have far fewer control instructions than regular instructions, the special control instructions on the side will be a small fraction of the size. The processor would look up the control instructions for a function upon entry and then maintain both a program counter in the control instructions and the original instructions; all operations would be constant time.

piranna commented 7 years ago

Thanks for your input, but since I'm going to need to do some other optimizations too, I'm doing a decoded spec with inline target address, LEB128 integers decoded and an optimized opcodes structure.

WebAssembly / design

Wasmachine, WebAssembly in a FPGA #1050