Samsung / walrus

WebAssembly Lightweight RUntime
Apache License 2.0
39 stars 10 forks source link

Implement register allocator in JIT #244

Closed zherczeg closed 4 months ago

zherczeg commented 5 months ago

This branch contains an early version of a register allocator for Walrus. It is capable of compiling some WebAssembly code now. For example, our simple fibonacci code:

(module
 (func $func1 (export "func1") (param $num i32) (result i32) (local i32 i32 i32)
  i32.const 0
  local.set 0
  i32.const 1
  local.set 1

  i32.const 0
  local.set 2

  loop
    local.get 0
    local.get 1
    i32.add
    local.get 1
    local.set 0
    local.set 1

    local.get 2
    i32.const 1
    i32.add
    local.tee 2

    i32.const 100000000
    i32.lt_u
    br_if 0
  end

  local.get 2
)
)

(assert_return (invoke "func1" (i32.const 10)) (i32.const 100000000))

Code with register allocation (R0-R3 are scratch registers):

[[[[[[[  Function   0  ]]]]]]]
1: Const32 (0x5567783b3540)
  Unused
  Result[0]: 18.I32 (I:0x5567783b3540)
2: Const32 (0x5567783b2ff0)
  Unused
  Result[0]: 19.I32 (I:0x5567783b2ff0)
3: Const32 (0x5567783b3730)
  Unused
  Result[0]: 20.I32 (I:0x5567783b3730)
4: MoveI32 (0x5567783b3600)
  Param[0]: 18.I32 (I:0x5567783b3540)
  Result[0]: 21.I32 (R0) [4-14]
5: MoveI32 (0x5567783b3780)
  Param[0]: 19.I32 (I:0x5567783b2ff0)
  Result[0]: 22.I32 (R1) [5-14]
6: MoveI32 (0x5567783b37e0)
  Param[0]: 18.I32 (I:0x5567783b3540)
  Result[0]: 23.I32 (R2) [6-15]
7: Label (0x5567783b2800)
  Jump from: 14
8: I32Add (0x5567783b3840)
  Param[0]: 21.I32 (R0) [4-14]
  Param[1]: 22.I32 (R1) [5-14]
  Result[0]: 24.I32 (R3) [8-10]
9: MoveI32 (0x5567783b38a0)
  Param[0]: 22.I32 (R1) [5-14]
  Result[0]: 21.I32 (R0) [4-14]
10: MoveI32 (0x5567783b3900)
  Param[0]: 24.I32 (R3) [8-10]
  Result[0]: 22.I32 (R1) [5-14]
11: I32Add (0x5567783b3960)
  Param[0]: 23.I32 (R2) [6-15]
  Param[1]: 19.I32 (I:0x5567783b2ff0)
  Result[0]: 27.I32 (R3) [11-13]
12: MoveI32 (0x5567783b39c0)
  Param[0]: 27.I32 (R3) [11-13]
  Result[0]: 23.I32 (R2) [6-15]
13: I32LtU (0x5567783b3a20)
  MergeCompare
  Param[0]: 27.I32 (R3) [11-13]
  Param[1]: 20.I32 (I:0x5567783b3730)
  Result[0]: (flag)
14: JumpIfTrue (0x5567783b3a80)
  Jump to: 7
  Param[0]: (flag)
15: End (0x5567783b3ae0)
  Param[0]: 23.I32 (R2) [6-15]

Release mode runtime on an x86-64 machine:

interp: 0m0.994s base-jit: 0m0.557s (1.78 times as fast) reg-alloc-jit: 0m0.152s (6.53 times as fast)

zherczeg commented 5 months ago

@clover2123 the 64 bit code is mostly working now, only one test case fails. The i32.reinterpret_f32 (and similar opcodes) are implemented as moves, which confuses the type tracking of register allocation. How could we fix this? Shall we introduce new opcodes for this, or do you have another suggestion? Thank you!

The 32 bit code will require more work, that is far more complex because of register pairs (64 bit values are stored in register pairs).

zherczeg commented 5 months ago

Register allocation results on x86-64:

       Test case       |     interpreter  |    baseline_jit  |    regalloc_jit  |
-----------------------+------------------+------------------+------------------+
               change  |           3.48s  |   1.69s (2.06x)  |   0.83s (4.20x)  |
            factorial  |           3.64s  |   1.92s (1.90x)  |   1.32s (2.77x)  |
             fannkuch  |           3.20s  |   1.20s (2.66x)  |   0.75s (4.28x)  |
            fibonacci  |           5.21s  |   3.17s (1.64x)  |   2.32s (2.24x)  |
              gregory  |           3.79s  |   1.56s (2.43x)  |   0.88s (4.33x)  |
                hanoi  |           4.72s  |   2.35s (2.00x)  |   1.69s (2.79x)  |
             heapsort  |           5.22s  |   3.09s (1.69x)  |   2.40s (2.18x)  |
              huffman  |           4.62s  |   1.59s (2.91x)  |   0.86s (5.37x)  |
          kNucleotide  |           3.98s  |   1.49s (2.67x)  |   0.67s (5.96x)  |
      mandelbrotFloat  |           3.59s  |   1.39s (2.58x)  |   0.64s (5.57x)  |
     mandelbrotDouble  |           3.60s  |   1.37s (2.63x)  |   0.83s (4.32x)  |
       matrixMultiply  |           5.74s  |   2.21s (2.60x)  |   1.44s (3.99x)  |
           miniWalrus  |           5.35s  |   2.61s (2.05x)  |   1.81s (2.95x)  |
                nbody  |           4.59s  |   1.74s (2.64x)  |   0.82s (5.56x)  |
              nqueens  |           3.97s  |   1.54s (2.59x)  |   0.98s (4.05x)  |
                prime  |           4.29s  |   1.86s (2.30x)  |   0.78s (5.50x)  |
            quickSort  |           4.41s  |   1.42s (3.09x)  |   0.83s (5.30x)  |
             redBlack  |           5.35s  |   2.39s (2.24x)  |   1.37s (3.90x)  |
                  rsa  |           3.80s  |   1.75s (2.17x)  |   0.60s (6.37x)  |
             salesman  |           4.35s  |   1.78s (2.44x)  |   1.08s (4.01x)  |
  simdMandelbrotFloat  |           4.42s  |   1.45s (3.04x)  |   0.86s (5.11x)  |
  simdMandelbrotDouble  |           5.62s  |   2.03s (2.76x)  |   1.24s (4.54x)  |
            simdNbody  |           2.82s  |   1.04s (2.71x)  |   0.63s (4.46x)  |
   simdMatrixMultiply  |          11.13s  |   4.67s (2.39x)  |   2.27s (4.90x)  |
            ticTacToe  |           4.99s  |   1.94s (2.57x)  |   1.06s (4.73x)  |
-----------------------+------------------+------------------+------------------+
    Average speedup    |                  |           2.43x  |           4.38x  |
zherczeg commented 5 months ago

x86-32 results:

        Test case        |     interpreter  |    baseline_jit  |    regalloc_jit  |
-------------------------+------------------+------------------+------------------+
                 change  |           7.86s  |   2.03s (3.87x)  |   1.27s (6.21x)  |
              factorial  |           7.66s  |   2.36s (3.24x)  |   1.56s (4.90x)  |
               fannkuch  |           6.43s  |   1.69s (3.81x)  |   1.00s (6.45x)  |
              fibonacci  |          13.23s  |   3.77s (3.51x)  |   3.18s (4.16x)  |
                gregory  |           9.42s  |   2.57s (3.67x)  |   1.18s (7.98x)  |
                  hanoi  |          10.03s  |   3.08s (3.26x)  |   2.25s (4.45x)  |
               heapsort  |          11.60s  |   3.52s (3.30x)  |   2.46s (4.71x)  |
                huffman  |           9.85s  |   1.75s (5.64x)  |  0.94s (10.50x)  |
            kNucleotide  |           7.70s  |   1.56s (4.95x)  |   0.88s (8.77x)  |
        mandelbrotFloat  |           6.56s  |   1.47s (4.45x)  |   0.92s (7.14x)  |
       mandelbrotDouble  |           8.06s  |   1.82s (4.42x)  |   1.23s (6.53x)  |
         matrixMultiply  |          13.12s  |   2.12s (6.19x)  |   1.36s (9.65x)  |
             miniWalrus  |          10.97s  |   3.22s (3.41x)  |   2.26s (4.86x)  |
                  nbody  |           8.81s  |   1.69s (5.21x)  |   0.96s (9.15x)  |
                nqueens  |           8.56s  |   1.82s (4.72x)  |   0.97s (8.80x)  |
                  prime  |          12.11s  |   2.47s (4.90x)  |   1.48s (8.16x)  |
              quickSort  |           8.83s  |   1.63s (5.43x)  |  0.82s (10.79x)  |
               redBlack  |           9.55s  |   2.46s (3.87x)  |   1.74s (5.50x)  |
                    rsa  |          10.80s  |   2.48s (4.34x)  |   1.36s (7.95x)  |
               salesman  |           8.31s  |   1.85s (4.49x)  |   1.15s (7.21x)  |
    simdMandelbrotFloat  |           9.05s  |   1.44s (6.27x)  |  0.88s (10.32x)  |
   simdMandelbrotDouble  |          13.77s  |   2.40s (5.73x)  |  1.35s (10.23x)  |
              simdNbody  |           5.56s  |   0.89s (6.24x)  |   0.73s (7.58x)  |
     simdMatrixMultiply  |          25.78s  |   4.81s (5.36x)  |   2.67s (9.66x)  |
              ticTacToe  |           9.03s  |   2.06s (4.38x)  |   1.32s (6.82x)  |
-------------------------+------------------+------------------+------------------+
     Average speedup     |                  |           4.59x  |           7.54x  |
zherczeg commented 4 months ago

@clover2123 may I ask your help? The only remaining issue for this patch is i32.reinterpret_f32 and similar opcodes. These are converted to moves, or omitted in some cases. This confuses the register allocator, since it cannot detect these opcodes. What should we do? I have two ideas:

1) Introducing new interpreter operations for them. Their implementation is simple. 2) Introduce an additional type descriptor bitset (vector<bool>) for the byte code. This is only used by the jit compiler. Small memory increase.

I am open to other ideas as well. What do you think?

clover2123 commented 4 months ago

@zherczeg IMHO 1st option seems more proper than the other because its simple as you mentioned

Introducing new interpreter operations for them. Their implementation is simple.

zherczeg commented 4 months ago

The patch now passes the regression tests. I have added a --jit-no-reg-alloc (JITFlagValue::disableRegAlloc) to disable the register allocation at runtime.

clover2123 commented 4 months ago

Why did you add an option (--jit-no-reg-alloc) for register allocation? Just for comparison?

zherczeg commented 4 months ago

And for debugging. It is not necessary, I can remove it. Or it can be removed later.