Samsung / walrus

WebAssembly Lightweight RUntime
Apache License 2.0
37 stars 10 forks source link

x86 slowdown compared to old jit #97

Closed zherczeg closed 11 months ago

zherczeg commented 1 year ago

We have noticed a very interesting slowdown on x86. Consider the following simple WebAssembly code (it computes Fibonaccy numbers):

(module
(func $func1 (export "func1") (param $num i32) (result i32) (local i32 i32 i32)
  i32.const 0
  local.set 0
  i32.const 1
  local.set 1
  i32.const 0
  local.set 2

  loop
  local.get 0
  local.get 1
  i32.add
  local.get 1
  local.set 0
  local.set 1

  local.get 2
  i32.const 1
  i32.add
  local.tee 2

  i32.const 10000000
  i32.lt_u
  br_if 0
  end

  local.get 2
)
)

Let me show the machine code generated by the old jit (which does not use the walrus byte code), and the new jit:

Old JIT: 0.101s

movl $0x0,0x4(%r15)
movl $0x1,0x8(%r15)
movl $0x0,0xc(%r15)
mov 0x4(%r15),%edx
add 0x8(%r15),%edx
mov %edx,(%r15)
mov 0x8(%r15),%edx
mov %edx,0x4(%r15)
mov (%r15),%edx
mov %edx,0x8(%r15)
addl $0x1,0xc(%r15)
cmpl $0x989680,0xc(%r15)
jb 0x7ffff7a4909

New JIT: 0.158s

movl $0x0,(%r15)
movl $0x1,0x8(%r15)
movl $0x0,0x10(%r15)
mov (%r15),%edx
add 0x8(%r15),%edx
mov %edx,0x20(%r15)
mov 0x8(%r15),%rdx
mov %rdx,(%r15)
mov 0x20(%r15),%rdx
mov %rdx,0x8(%r15)
addl $0x1,0x10(%r15)
cmpl $0x989680,0x10(%r15)
jb 0x7ffff7a49094

These are basically the same, except they use different locations for local variables. And another difference: it uses 64 bit copy in "mov 0x8(%r15),%rdx", which is just a 32 bit value. We have measured it on multiple systems, and somehow the old code is 50% (or more) faster.

The byte code dump of the interpreter:

0: const_32 dstOffset: 0 value: 0
24: const_32 dstOffset: 8 value: 1
48: const_32 dstOffset: 16 value: 0
72: i32.add srcOffset[0]: 0 srcOffset[1]: 8 dstOffset: 32
96: move_64 srcOffset: 8 dstOffset: 0
120: move_64 srcOffset: 32 dstOffset: 8
144: const_32 dstOffset: 40 value: 1
168: i32.add srcOffset[0]: 16 srcOffset[1]: 40 dstOffset: 16
192: const_32 dstOffset: 40 value: 10000000
216: i32.lt_u srcOffset[0]: 16 srcOffset[1]: 40 dstOffset: 32
240: jump_if_true srcOffset: 32 dst: 72
264: end resultOffsets: 16

It uses move_64 operations, and jit simply translates them to 64 bit movs.

ksh8281 commented 1 year ago

IMO Using mov64 in interpreter is wrong I will look it

clover2123 commented 1 year ago

(Actually not related to this issue) I have a question about stack area. In JIT code, where are operands and temporal values of bytecode located? Native stack area like interpreter mode or separate heap area allocated only for JIT? Is this also right that register %r15 is reserved to point to stack address?

zherczeg commented 1 year ago

For simplicity, it uses the area allocated by interpreter. Basically the interpret() call is replaced by a native call. I don't think it is worth to have a different stack layout for jit. As for optimizations, any stack area improvement would be beneficial for both jit and interpreter in the same way.

ksh8281 commented 12 months ago

Recent version of interp produces this result

required stack size: 48 bytes
required stack size due to local: 24 bytes
bytecode size: 242 bytes

     0 const32 dstOffset: 32 value: 0
    16 move32 srcOffset: 32 dstOffset: 0
    32 const32 dstOffset: 32 value: 1
    48 move32 srcOffset: 32 dstOffset: 8
    64 const32 dstOffset: 32 value: 0
    80 move32 srcOffset: 32 dstOffset: 16
    96 I32Add src1: 0 src2: 8 dst: 32
   112 move32 srcOffset: 8 dstOffset: 0
   128 move32 srcOffset: 32 dstOffset: 8
   144 const32 dstOffset: 40 value: 1
   160 I32Add src1: 16 src2: 40 dst: 16
   176 const32 dstOffset: 40 value: 10000000
   192 I32LtU src1: 16 src2: 40 dst: 32
   208 jump_if_true srcOffset: 32 dst: 96
   224 end resultOffsets: 16
zherczeg commented 11 months ago

this looks better! Thank you! The simple test is twice as fast now.