Closed zherczeg closed 4 months ago
@clover2123 the 64 bit code is mostly working now, only one test case fails. The i32.reinterpret_f32
(and similar opcodes) are implemented as moves, which confuses the type tracking of register allocation. How could we fix this? Shall we introduce new opcodes for this, or do you have another suggestion? Thank you!
The 32 bit code will require more work, that is far more complex because of register pairs (64 bit values are stored in register pairs).
Register allocation results on x86-64:
Test case | interpreter | baseline_jit | regalloc_jit |
-----------------------+------------------+------------------+------------------+
change | 3.48s | 1.69s (2.06x) | 0.83s (4.20x) |
factorial | 3.64s | 1.92s (1.90x) | 1.32s (2.77x) |
fannkuch | 3.20s | 1.20s (2.66x) | 0.75s (4.28x) |
fibonacci | 5.21s | 3.17s (1.64x) | 2.32s (2.24x) |
gregory | 3.79s | 1.56s (2.43x) | 0.88s (4.33x) |
hanoi | 4.72s | 2.35s (2.00x) | 1.69s (2.79x) |
heapsort | 5.22s | 3.09s (1.69x) | 2.40s (2.18x) |
huffman | 4.62s | 1.59s (2.91x) | 0.86s (5.37x) |
kNucleotide | 3.98s | 1.49s (2.67x) | 0.67s (5.96x) |
mandelbrotFloat | 3.59s | 1.39s (2.58x) | 0.64s (5.57x) |
mandelbrotDouble | 3.60s | 1.37s (2.63x) | 0.83s (4.32x) |
matrixMultiply | 5.74s | 2.21s (2.60x) | 1.44s (3.99x) |
miniWalrus | 5.35s | 2.61s (2.05x) | 1.81s (2.95x) |
nbody | 4.59s | 1.74s (2.64x) | 0.82s (5.56x) |
nqueens | 3.97s | 1.54s (2.59x) | 0.98s (4.05x) |
prime | 4.29s | 1.86s (2.30x) | 0.78s (5.50x) |
quickSort | 4.41s | 1.42s (3.09x) | 0.83s (5.30x) |
redBlack | 5.35s | 2.39s (2.24x) | 1.37s (3.90x) |
rsa | 3.80s | 1.75s (2.17x) | 0.60s (6.37x) |
salesman | 4.35s | 1.78s (2.44x) | 1.08s (4.01x) |
simdMandelbrotFloat | 4.42s | 1.45s (3.04x) | 0.86s (5.11x) |
simdMandelbrotDouble | 5.62s | 2.03s (2.76x) | 1.24s (4.54x) |
simdNbody | 2.82s | 1.04s (2.71x) | 0.63s (4.46x) |
simdMatrixMultiply | 11.13s | 4.67s (2.39x) | 2.27s (4.90x) |
ticTacToe | 4.99s | 1.94s (2.57x) | 1.06s (4.73x) |
-----------------------+------------------+------------------+------------------+
Average speedup | | 2.43x | 4.38x |
x86-32 results:
Test case | interpreter | baseline_jit | regalloc_jit |
-------------------------+------------------+------------------+------------------+
change | 7.86s | 2.03s (3.87x) | 1.27s (6.21x) |
factorial | 7.66s | 2.36s (3.24x) | 1.56s (4.90x) |
fannkuch | 6.43s | 1.69s (3.81x) | 1.00s (6.45x) |
fibonacci | 13.23s | 3.77s (3.51x) | 3.18s (4.16x) |
gregory | 9.42s | 2.57s (3.67x) | 1.18s (7.98x) |
hanoi | 10.03s | 3.08s (3.26x) | 2.25s (4.45x) |
heapsort | 11.60s | 3.52s (3.30x) | 2.46s (4.71x) |
huffman | 9.85s | 1.75s (5.64x) | 0.94s (10.50x) |
kNucleotide | 7.70s | 1.56s (4.95x) | 0.88s (8.77x) |
mandelbrotFloat | 6.56s | 1.47s (4.45x) | 0.92s (7.14x) |
mandelbrotDouble | 8.06s | 1.82s (4.42x) | 1.23s (6.53x) |
matrixMultiply | 13.12s | 2.12s (6.19x) | 1.36s (9.65x) |
miniWalrus | 10.97s | 3.22s (3.41x) | 2.26s (4.86x) |
nbody | 8.81s | 1.69s (5.21x) | 0.96s (9.15x) |
nqueens | 8.56s | 1.82s (4.72x) | 0.97s (8.80x) |
prime | 12.11s | 2.47s (4.90x) | 1.48s (8.16x) |
quickSort | 8.83s | 1.63s (5.43x) | 0.82s (10.79x) |
redBlack | 9.55s | 2.46s (3.87x) | 1.74s (5.50x) |
rsa | 10.80s | 2.48s (4.34x) | 1.36s (7.95x) |
salesman | 8.31s | 1.85s (4.49x) | 1.15s (7.21x) |
simdMandelbrotFloat | 9.05s | 1.44s (6.27x) | 0.88s (10.32x) |
simdMandelbrotDouble | 13.77s | 2.40s (5.73x) | 1.35s (10.23x) |
simdNbody | 5.56s | 0.89s (6.24x) | 0.73s (7.58x) |
simdMatrixMultiply | 25.78s | 4.81s (5.36x) | 2.67s (9.66x) |
ticTacToe | 9.03s | 2.06s (4.38x) | 1.32s (6.82x) |
-------------------------+------------------+------------------+------------------+
Average speedup | | 4.59x | 7.54x |
@clover2123 may I ask your help? The only remaining issue for this patch is i32.reinterpret_f32
and similar opcodes. These are converted to moves, or omitted in some cases. This confuses the register allocator, since it cannot detect these opcodes. What should we do? I have two ideas:
1) Introducing new interpreter operations for them. Their implementation is simple.
2) Introduce an additional type descriptor bitset (vector<bool>
) for the byte code. This is only used by the jit compiler. Small memory increase.
I am open to other ideas as well. What do you think?
@zherczeg IMHO 1st option seems more proper than the other because its simple as you mentioned
Introducing new interpreter operations for them. Their implementation is simple.
The patch now passes the regression tests. I have added a --jit-no-reg-alloc
(JITFlagValue::disableRegAlloc
) to disable the register allocation at runtime.
Why did you add an option (--jit-no-reg-alloc
) for register allocation? Just for comparison?
And for debugging. It is not necessary, I can remove it. Or it can be removed later.
This branch contains an early version of a register allocator for Walrus. It is capable of compiling some WebAssembly code now. For example, our simple fibonacci code:
Code with register allocation (R0-R3 are scratch registers):
Release mode runtime on an x86-64 machine:
interp: 0m0.994s base-jit: 0m0.557s (1.78 times as fast) reg-alloc-jit: 0m0.152s (6.53 times as fast)