Epoch performance issues

wjr-z commented 1 year ago

At present, there seem to be serious issues with the epoch mechanism and register usage。 For example, the following is a simple comparison of native and epoch assemblies for a double loop

wasmtiem release-13.0.0

native :

       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   4c 8b 57 08             mov    0x8(%rdi),%r10
       8:   4d 8b 12                mov    (%r10),%r10
       b:   49 39 e2                cmp    %rsp,%r10
       e:   0f 87 2a 00 00 00       ja     3e <wasm[0]::function[0]+0x3e>
      14:   31 c9                   xor    %ecx,%ecx
      16:   45 31 c9                xor    %r9d,%r9d
      19:   41 83 c1 01             add    $0x1,%r9d
      1d:   41 81 f9 00 12 7a 00    cmp    $0x7a1200,%r9d
      24:   0f 8c ef ff ff ff       jl     19 <wasm[0]::function[0]+0x19>
      2a:   83 c1 01                add    $0x1,%ecx
      2d:   81 f9 40 9c 00 00       cmp    $0x9c40,%ecx
      33:   0f 8c dd ff ff ff       jl     16 <wasm[0]::function[0]+0x16>
      39:   48 89 ec                mov    %rbp,%rsp
      3c:   5d                      pop    %rbp
      3d:   c3                      retq   
      3e:   0f 0b                   ud2

epoch :

       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   4c 8b 57 08             mov    0x8(%rdi),%r10
       8:   4d 8b 12                mov    (%r10),%r10
       b:   49 39 e2                cmp    %rsp,%r10
       e:   0f 87 04 01 00 00       ja     118 <wasm[0]::function[0]+0x118>
      14:   48 83 ec 20             sub    $0x20,%rsp
      18:   48 89 1c 24             mov    %rbx,(%rsp)
      1c:   4c 89 6c 24 08          mov    %r13,0x8(%rsp)
      21:   4c 89 7c 24 10          mov    %r15,0x10(%rsp)
      26:   48 8b 77 08             mov    0x8(%rdi),%rsi
      2a:   4c 8b 4e 10             mov    0x10(%rsi),%r9
      2e:   4c 8b 6f 18             mov    0x18(%rdi),%r13
      32:   4d 8b 55 00             mov    0x0(%r13),%r10
      36:   4d 39 ca                cmp    %r9,%r10
      39:   0f 83 56 00 00 00       jae    95 <wasm[0]::function[0]+0x95>
      3f:   45 31 ff                xor    %r15d,%r15d
      42:   4d 8b 55 00             mov    0x0(%r13),%r10
      46:   4d 39 ca                cmp    %r9,%r10
      49:   0f 83 6f 00 00 00       jae    be <wasm[0]::function[0]+0xbe>
      4f:   31 db                   xor    %ebx,%ebx
      51:   4d 8b 5d 00             mov    0x0(%r13),%r11
      55:   4d 39 cb                cmp    %r9,%r11
      58:   0f 83 8d 00 00 00       jae    eb <wasm[0]::function[0]+0xeb>
      5e:   83 c3 01                add    $0x1,%ebx
      61:   81 fb 00 12 7a 00       cmp    $0x7a1200,%ebx
      67:   0f 8c e4 ff ff ff       jl     51 <wasm[0]::function[0]+0x51>
      6d:   41 83 c7 01             add    $0x1,%r15d
      71:   41 81 ff 40 9c 00 00    cmp    $0x9c40,%r15d
      78:   0f 8c c4 ff ff ff       jl     42 <wasm[0]::function[0]+0x42>
      7e:   48 8b 1c 24             mov    (%rsp),%rbx
      82:   4c 8b 6c 24 08          mov    0x8(%rsp),%r13
      87:   4c 8b 7c 24 10          mov    0x10(%rsp),%r15
      8c:   48 83 c4 20             add    $0x20,%rsp
      90:   48 89 ec                mov    %rbp,%rsp
      93:   5d                      pop    %rbp
      94:   c3                      retq   
      95:   4d 39 ca                cmp    %r9,%r10
      98:   0f 82 a1 ff ff ff       jb     3f <wasm[0]::function[0]+0x3f>
      9e:   48 8b 47 38             mov    0x38(%rdi),%rax
      a2:   48 8b 80 b0 00 00 00    mov    0xb0(%rax),%rax
      a9:   48 83 ec 20             sub    $0x20,%rsp
      ad:   48 89 f9                mov    %rdi,%rcx
      b0:   ff d0                   callq  *%rax
      b2:   48 83 c4 20             add    $0x20,%rsp
      b6:   49 89 c1                mov    %rax,%r9
      b9:   e9 81 ff ff ff          jmpq   3f <wasm[0]::function[0]+0x3f>
      be:   4c 8b 4e 10             mov    0x10(%rsi),%r9
      c2:   4d 39 ca                cmp    %r9,%r10
      c5:   0f 82 84 ff ff ff       jb     4f <wasm[0]::function[0]+0x4f>
      cb:   48 8b 47 38             mov    0x38(%rdi),%rax
      cf:   48 8b 80 b0 00 00 00    mov    0xb0(%rax),%rax
      d6:   48 83 ec 20             sub    $0x20,%rsp
      da:   48 89 f9                mov    %rdi,%rcx
      dd:   ff d0                   callq  *%rax
      df:   48 83 c4 20             add    $0x20,%rsp
      e3:   49 89 c1                mov    %rax,%r9
      e6:   e9 64 ff ff ff          jmpq   4f <wasm[0]::function[0]+0x4f>
      eb:   4c 8b 4e 10             mov    0x10(%rsi),%r9
      ef:   4d 39 cb                cmp    %r9,%r11
      f2:   0f 82 66 ff ff ff       jb     5e <wasm[0]::function[0]+0x5e>
      f8:   48 8b 4f 38             mov    0x38(%rdi),%rcx
      fc:   48 8b 91 b0 00 00 00    mov    0xb0(%rcx),%rdx
     103:   48 83 ec 20             sub    $0x20,%rsp
     107:   48 89 f9                mov    %rdi,%rcx
     10a:   ff d2                   callq  *%rdx
     10c:   48 83 c4 20             add    $0x20,%rsp
     110:   49 89 c1                mov    %rax,%r9
     113:   e9 46 ff ff ff          jmpq   5e <wasm[0]::function[0]+0x5e>
     118:   0f 0b                   ud2

The above example assigns some registers, such as ax and cx, to the check block of epoch. Actually, this is just a simple example, and more complex workloads have a significant performance impact on the box_seal.wasm, the cost has reached 25%! And after trying to manually fix the issue with epoch (Unstable), the cost was only less than 7%. Especially for inner and outer loops, the outer loop uses r10 for storage, but the inner loop uses r11, which I cannot understand

alexcrichton commented 1 year ago

Thanks for the report! Would you be able to share a wasm file or an example loop in source code to help reproduce this locally?

wjr-z commented 1 year ago

Thanks for the report! Would you be able to share a wasm file or an example loop in source code to help reproduce this locally?

Thank you for your reply. In fact, I am actively searching for the reason . This is link to box_seal. wasm https://github.com/jedisct1/webassembly-benchmarks/blob/master/2021-Q1/wasm/box_seal.wasm Then, this is the code for the example loop.

(module
 (export "_start" (func $_start))
 (func $_start (; 0 ;)
    (local $i i32)
    (local $i2 i32)
    i32.const 0
    local.set $i
    loop $loop
        i32.const 0
        local.set $i2
        loop $loop2
            local.get $i2
            i32.const 1
            i32.add
            local.set $i2
            local.get $i2
            i32.const 80000
            i32.lt_s 
            br_if $loop2
        end $loop2
        local.get $i
        i32.const 1
        i32.add
        local.set $i
        local.get $i
        i32.const 40000
        i32.lt_s 
        br_if $loop
    end $loop
 )
)

alexcrichton commented 1 year ago

Thanks! Could you detail a bit more what you mean by "manually fix the issue with epoch (Unstable), the cost was only less than 7%"?

Looking at the disassembly it's not obvious to me what the issue is and how such a large win could be gained, so I'm curious how you were able to achieve it!

wjr-z commented 1 year ago

Thanks! Could you detail a bit more what you mean by "manually fix the issue with epoch (Unstable), the cost was only less than 7%"?

Looking at the disassembly it's not obvious to me what the issue is and how such a large win could be gained, so I'm curious how you were able to achieve it!

Unfortunately, the data on the server was lost. I'll try to reproduce it next week.

bytecodealliance / wasmtime

Epoch performance issues #7244