Compile hot code blocks to JS functions to speed up CPU

bvibber commented 7 years ago

In a quick test with a Fibonacci sequence generator, I see about 320x-400x slowdown in v86 (Arch Linux demo in Firefox on a 64-bit Mac) compared to running on the same machine in a native VM. This is ok for old DOS games but is too slow to run a semi-modern desktop.

There are probably a lot of potential gains in the emulator's instruction decoding and interpreter loop, but there will still be constant overhead from the loop, decoding, and dispatch.

It should be possible to do dynamic translation of runs of x86 instructions into JavaScript code, similarly to how QEMU dynamically translates runs of guest instructions to the host ISA and then executes them directly -- at least for hotspots like tight loops and frequently called functions, it may be worth the overhead of creating a JS function and letting the environment compile it down to native code.

I've done a couple manually 'compiled' tests on short functions from x86 disassembly just to see if this sounds feasible in theory. Initial results are encouraging for a loop with arithmetic in it (fibonacci sequence linear algo runs up to 40x faster than v86), though I saw only modest gains on a heavily function-call-based test (fibonacci sequence recursive algo runs only 2-3x faster).

I plan to do some more experimenting with this technique and will write up some more feasibility notes.

nepx commented 7 years ago

I'm writing my own little dynamic recompiling x86 emulator right now, and I can tell you this is totally feasible, but it's no "small endeavor". One of the main challenges concerning dynamic recompilation in JavaScript is making it neat. The function overhead is surprisingly low, as long as you try to make the functions as long as possible.

My first attempt at creating a dynarec hard-coded each instruction. Basically, it was "instruction in, JavaScript out". Maintaining it was hard and debugging was a nightmare.

My current implementation involves some easily-parsed microcode and translating that into JavaScript. In addition to making it easy to scale up, the front-end (part that translates the code) and the back-end (part that translates it to JavaScript) are completely seperated, thus making it easier to debug and maintain. One thing that you should definitely look into is optimizing the code on the fly. Tokenizing and parsing JavaScript at runtime is extremely expensive performance-wise, but with a microcode, all you have to do is just remove or change a few ops.

But performance ratings of a (contrived) test look promising: using a simple loop-optimization system, I was able to run over 2 billion instructions per second (admittedly, that was just one test case, but things are looking up). That was because I translated the code into regular JavaScript loops and variables. The test was compiled directly from here and compiled with -O2 with GCC on MinGW.

I'll have a repo up as soon as I stabilize the microcode format (which needs a complete makeover) and have something working (currently, everything prints undefined, thanks to a complete rewrite I have not finished yet).

As a final note, there seems to be a dynamic translator in this file. I'm not exactly sure why that was removed, but it's worth a look. It has some cool tricks that I'm currently trying to integrate.

bvibber commented 7 years ago

@nepx awesome! Using an intermediate microcode sounds smart, similar to the micro-ops QEMU uses, and should indeed make optimization before the JS source stage easier. I'll keep an eye out for your repo...

nepx commented 7 years ago

The main problem with my implementation is the sheer amount of opcodes that need to be translated. I'm trying to decide on whether I should use a binary encoding or a regular text-based encoding scheme. I would use a binary encoding scheme, but then it gets really messy with the decoding and stuff:

function dispatch(code) { // code is an Int32Array
    var i = 0, js = "";
    while(i < code.length){
        js += table[code[i++]](i, code);
        i++;
    }
    return js;
}
// ...
table[0x1234] = function(i, code) { // ADD r0, imm32
    // How do we increase "i" and have it propagate to the outer loop performantly? (i.e. no globals).
    // ? 
};

The problem with text is that it takes time to parse, and using split/splice is not very performant, especially when those functions are used repeatedly.

Another problem is choosing which blocks to compile. Currently, my emulator compiles all of them, which is not a very smart idea.

nepx commented 7 years ago

A similar emulation strategy that takes less time to implement is called threaded interpretation. All the compiler would do is generate a few function calls to do the "heavy lifting" instead of translating the instruction itself. The benefit is the function calls can be put in an array instead of re-decoding every single instruction. Most interpretation-based emulators can benefit from this "cache". The problem is, v86 doesn't seperate decoding from execution, meaning that it's hard to keep the decoded instructions in a cache. This way happens to be a lot faster and cleaner (compared to the way Bochs decodes instructions), but it provides a challenge to dynamic instruction translation/threaded interpretation.

copy commented 7 years ago

Yes, this is an idea that has been spinning around my head for a long time. In fact, I've already written a prototype based on this emulator, but I couldn't get significant improvements out of it. It was a bit faster for benchmarks (like compression), but significantly slower for general execution (like booting), for several reasons:

Creating JavaScript functions is slow: They can only be created from strings, which need to be parsed. The first few runs are in slow mode before enough type information is available for the JIT to optimize them. They also take a lot of memory (I had to limit the number of functions to something like 10000)
Translating every instruction was prohibitively expensive in my experiment (does QEMU translate every instruction?), so there need to be some heuristics to check if some piece of code is called very often, which also slows down normal execution. I have some idea how to do this more efficiently now
Checking if code has changed (necessary on every memory write)
Paging is also a pain point, because you can't translate at a page boundary
We'll want to keep cached compiled blocks even if the OS schedules different processes

Overall I concluded that this approach doesn't feasibly provide a significant speed boost. I only spent two weeks on the prototype though and I certainly missed some optimizations.

A very promising technology that could help us is Web Assembly. It will be much cheaper to generate than JavaScript, and also have support for implementing JITs: http://webassembly.org/docs/future-features/#platform-independent-just-in-time-jit-compilation. I give this idea another try when I have some free time next year. In the first step by rewriting instructions.js to allow plugging in a code generator instead of executing code directly.

A similar emulation strategy that takes less time to implement is called threaded interpretation.

This sounds interesting, I haven't considered yet.

nepx commented 7 years ago

Threaded interpretation has the same "performance losses" like self-modifying code, etc. The only thing it has going for it is that it doesn't need to parse code -- all it needs to do is make an array like:

[
 read_dword_to_reg0(addr),
 add_number_to_reg0(12345),
 write_reg0_to_eax(addr)
]

and loop:

for(var i=0;i<arr.length;i++){
    arr[i]();
}

And also, JIT'ing doesn't play nice with minifiers.

An interesting idea for self-modifying code is to have pages in memory. I'm using 4k pages. The pages are internal, meaning that the regular paging/reading functions are abstracted, so that PAE or other page-size-extensions could be implemented without changing the internal page size. The cool thing is that you can change around the page type:

0-----------1------2---------3...
| Regular   | Code | Regular |
+-----------+------+---------+

so when they read from a "regular" page, no self-modifying-code handlers are invoked while on a "Code" page, they are. This also opens up interesting ideas like read-only pages or execution-protected pages.

czenzel commented 7 years ago

I would like to help with this project. Can anybody give me any updates or pointers to try to help? Thanks!

nepx commented 7 years ago

@czenzel Thanks for the interest! Currently, I've been working on rewriting it, since previous iterations didn't work at all (they kept getting stuck on Inconsistency detected in ld.so and one version accidentally loaded a part of a string in libc.so.6 and ran cmpsd...). As soon as I get something working (preferably bash), then I'll create a repo and push all the code there. For now, the emulator halts with the ld.so help message even though it shouldn't... In the future, I'll definitely try to publish working parts (i.e. ELF parser) in their own separate repositories.

czenzel commented 7 years ago

Thanks @nepx! I am doing some other interesting work with v86 including qemu user mode emulation inside the image. I have been able to do a hello world emulation from the ARM/AARCH64 platform inside v86 with no issue. It would be nice to get a speed boost to emulate those programs inside the image. It was just a little experiment I was playing with. If you want me to take a second look at your code let me know and I can spend some time debugging it even if it is not ready for prime time.

nepx commented 6 years ago

@czenzel Well, after several months of hard work, I got a few programs to run, including dash, QEMU, and a little performance program I found somewhere (bash still doesn't work, unfortunately). It supports about 100 system calls, which are enough to run about 50% of applications. SSE support is patchy, and x87 is a hack.

download

I discovered that dynamic recompilation does speed up programs, and so does block chaining (linking compiled blocks together). perf1, which I took from this site and rewrote it with Linux timing code, takes just 2 seconds to run, which indicates that the CPU by itself can run at 60 million "basic blocks" per second (one basic block = ~7 instructions) Currently, it's a mess, due to the FPU being a direct port of some Assembly and C (it has 80-bit precision, but gcc causes a division by zero and nano causes the stack to overflow, so there's a little bit of work to be done there).

I'll create a repo as soon as I clean it up and upload the filesystem it's using (apparently Github doesn't like 1 GB uploads so I'll probably have to update in chunks and/or compress files)

BelleNottelling commented 6 years ago

Could we add this to v86 with a cache function? It would be helpful for devs as they could run the emulator and let it build the code blocks and then they could save it to a file which gets loaded with the save state. That's my idea at least. It wouldn't help for regular users, but for devs it could be useful

nepx commented 6 years ago

After a several more months of development (and rewriting code), I've made a few observations about dynamic recompilation/JITting:

asm.js/WebAssembly context switch overhead is too high for any performance gain. In addition, the only way to create new asm.js/WebAssembly functions is to instantiate a whole new module... The only viable method seems to be creating a regular native JavaScript function.
While I believe QEMU compiles every instruction, the TCG backend is designed to be fast. However, that results in somewhat inefficient code being emitted. We can avoid that by marking bits of code, or by translating the decoded operations (which would serve as a kind of IR, see below). If needed, we can come back and "re-optimize" the code. You can cache registers in variables, for instance.
Firefox seems to have trouble with generated functions; it is at least four times slower than Chrome in this regard. I'm not sure if this is a problem with my code or a bug in SpiderMonkey.
A better optimization would be to stash decoded operations somewhere. This would be a massive performance boost, especially for an architecture as complex as x86. This would avoid the overhead incurred by repeated ModR/M decoding and would also let us bypass reading instructions from memory (which requires us to . For instance, decoding add [eax+edx*4], ecx would look like:
```
DECODE_EFFECTIVE_ADDRESS [eax+edx*4]
TRANSLATE_ADDR_WRITE
LOAD_OP1_MEMORY
LOAD_OP2_ECX
ADD
STORE_OP1_MEMORY
INCREMENT_EIP 6 // This is a made-up value
```
Code blocks straddling a page boundary would have to be handled carefully, and so would self-modifying code. One solution is to not compile cross-page code blocks at all (just interpret that instruction)?
Dynamic recompilation can also create subtle CPU bugs as well (things like exceptions, interrupts, etc. are particularly hard to implement properly). However, these can be mitigated with a good test suite and usage of the interpreter.

Overall, though, compiling hot code blocks definitely increases performance drastically -- if done right.

BelleNottelling commented 6 years ago

Hey @nepx you got a working codebase for this? I'm really curious to check it out and think it would be super cool if you posted a fork with it!

nepx commented 6 years ago

No, not for v86, but for a similar emulator that I'm writing right now (it's an application-level VM for Linux executables). I thought some of the discoveries I made would be helpful here.

On Tue, Oct 2, 2018, 12:27 AM Ben notifications@github.com wrote:

Hey @nepx https://github.com/nepx you got a working codebase for this? I'm really curious to check it out and think it would be super cool if you posted a fork with it!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/copy/v86/issues/122#issuecomment-426175603, or mute the thread https://github.com/notifications/unsubscribe-auth/AWOJiNuhxEqS5pPKnV-1SlLwUySWtKwqks5ugxV1gaJpZM4LKRZV .

BelleNottelling commented 6 years ago

Oh that sounds cool! I've thought about something like that, but with reactos for Windows support 😁 I don't know how to do it though

copy commented 3 years ago

This is now implemented in the wasm branch. A crude benchmark is in https://github.com/copy/v86/pull/388. During regular emulation it collects entry points (functions, indirect jump targets) on a per-page basis and also counts how many instructions are executed per page. Once a certain threshold is hit, it compiles the page and any pages it refers to into a WebAssembly module (which happens asynchronously, so the slow backend keeps running while the module is being compiled). That way, generated code can run long without leaving the wasm module and only few modules (several 100s, with an upper limit of ~1000) need to be generated. You can find some statistics at the bottom of debug.html.

Overall, it's pretty clear that this is pushing the boundaries of what wasm is able to do. Memory usage of wasm modules is very high, there is only structured control flow, locals (registers) don't live across functions and modules can at runtime only be linked through a function table.

BelleNottelling commented 1 year ago

I believe this can now be closed as completed

copy commented 1 year ago

I believe this can now be closed as completed

Indeed!

copy / v86

Compile hot code blocks to JS functions to speed up CPU #122