bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
15.41k stars 1.3k forks source link

Epoch performance issues #7244

Open wjr-z opened 1 year ago

wjr-z commented 1 year ago

At present, there seem to be serious issues with the epoch mechanism and register usage。 For example, the following is a simple comparison of native and epoch assemblies for a double loop

wasmtiem release-13.0.0

native :

       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   4c 8b 57 08             mov    0x8(%rdi),%r10
       8:   4d 8b 12                mov    (%r10),%r10
       b:   49 39 e2                cmp    %rsp,%r10
       e:   0f 87 2a 00 00 00       ja     3e <wasm[0]::function[0]+0x3e>
      14:   31 c9                   xor    %ecx,%ecx
      16:   45 31 c9                xor    %r9d,%r9d
      19:   41 83 c1 01             add    $0x1,%r9d
      1d:   41 81 f9 00 12 7a 00    cmp    $0x7a1200,%r9d
      24:   0f 8c ef ff ff ff       jl     19 <wasm[0]::function[0]+0x19>
      2a:   83 c1 01                add    $0x1,%ecx
      2d:   81 f9 40 9c 00 00       cmp    $0x9c40,%ecx
      33:   0f 8c dd ff ff ff       jl     16 <wasm[0]::function[0]+0x16>
      39:   48 89 ec                mov    %rbp,%rsp
      3c:   5d                      pop    %rbp
      3d:   c3                      retq   
      3e:   0f 0b                   ud2    

epoch :

       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   4c 8b 57 08             mov    0x8(%rdi),%r10
       8:   4d 8b 12                mov    (%r10),%r10
       b:   49 39 e2                cmp    %rsp,%r10
       e:   0f 87 04 01 00 00       ja     118 <wasm[0]::function[0]+0x118>
      14:   48 83 ec 20             sub    $0x20,%rsp
      18:   48 89 1c 24             mov    %rbx,(%rsp)
      1c:   4c 89 6c 24 08          mov    %r13,0x8(%rsp)
      21:   4c 89 7c 24 10          mov    %r15,0x10(%rsp)
      26:   48 8b 77 08             mov    0x8(%rdi),%rsi
      2a:   4c 8b 4e 10             mov    0x10(%rsi),%r9
      2e:   4c 8b 6f 18             mov    0x18(%rdi),%r13
      32:   4d 8b 55 00             mov    0x0(%r13),%r10
      36:   4d 39 ca                cmp    %r9,%r10
      39:   0f 83 56 00 00 00       jae    95 <wasm[0]::function[0]+0x95>
      3f:   45 31 ff                xor    %r15d,%r15d
      42:   4d 8b 55 00             mov    0x0(%r13),%r10
      46:   4d 39 ca                cmp    %r9,%r10
      49:   0f 83 6f 00 00 00       jae    be <wasm[0]::function[0]+0xbe>
      4f:   31 db                   xor    %ebx,%ebx
      51:   4d 8b 5d 00             mov    0x0(%r13),%r11
      55:   4d 39 cb                cmp    %r9,%r11
      58:   0f 83 8d 00 00 00       jae    eb <wasm[0]::function[0]+0xeb>
      5e:   83 c3 01                add    $0x1,%ebx
      61:   81 fb 00 12 7a 00       cmp    $0x7a1200,%ebx
      67:   0f 8c e4 ff ff ff       jl     51 <wasm[0]::function[0]+0x51>
      6d:   41 83 c7 01             add    $0x1,%r15d
      71:   41 81 ff 40 9c 00 00    cmp    $0x9c40,%r15d
      78:   0f 8c c4 ff ff ff       jl     42 <wasm[0]::function[0]+0x42>
      7e:   48 8b 1c 24             mov    (%rsp),%rbx
      82:   4c 8b 6c 24 08          mov    0x8(%rsp),%r13
      87:   4c 8b 7c 24 10          mov    0x10(%rsp),%r15
      8c:   48 83 c4 20             add    $0x20,%rsp
      90:   48 89 ec                mov    %rbp,%rsp
      93:   5d                      pop    %rbp
      94:   c3                      retq   
      95:   4d 39 ca                cmp    %r9,%r10
      98:   0f 82 a1 ff ff ff       jb     3f <wasm[0]::function[0]+0x3f>
      9e:   48 8b 47 38             mov    0x38(%rdi),%rax
      a2:   48 8b 80 b0 00 00 00    mov    0xb0(%rax),%rax
      a9:   48 83 ec 20             sub    $0x20,%rsp
      ad:   48 89 f9                mov    %rdi,%rcx
      b0:   ff d0                   callq  *%rax
      b2:   48 83 c4 20             add    $0x20,%rsp
      b6:   49 89 c1                mov    %rax,%r9
      b9:   e9 81 ff ff ff          jmpq   3f <wasm[0]::function[0]+0x3f>
      be:   4c 8b 4e 10             mov    0x10(%rsi),%r9
      c2:   4d 39 ca                cmp    %r9,%r10
      c5:   0f 82 84 ff ff ff       jb     4f <wasm[0]::function[0]+0x4f>
      cb:   48 8b 47 38             mov    0x38(%rdi),%rax
      cf:   48 8b 80 b0 00 00 00    mov    0xb0(%rax),%rax
      d6:   48 83 ec 20             sub    $0x20,%rsp
      da:   48 89 f9                mov    %rdi,%rcx
      dd:   ff d0                   callq  *%rax
      df:   48 83 c4 20             add    $0x20,%rsp
      e3:   49 89 c1                mov    %rax,%r9
      e6:   e9 64 ff ff ff          jmpq   4f <wasm[0]::function[0]+0x4f>
      eb:   4c 8b 4e 10             mov    0x10(%rsi),%r9
      ef:   4d 39 cb                cmp    %r9,%r11
      f2:   0f 82 66 ff ff ff       jb     5e <wasm[0]::function[0]+0x5e>
      f8:   48 8b 4f 38             mov    0x38(%rdi),%rcx
      fc:   48 8b 91 b0 00 00 00    mov    0xb0(%rcx),%rdx
     103:   48 83 ec 20             sub    $0x20,%rsp
     107:   48 89 f9                mov    %rdi,%rcx
     10a:   ff d2                   callq  *%rdx
     10c:   48 83 c4 20             add    $0x20,%rsp
     110:   49 89 c1                mov    %rax,%r9
     113:   e9 46 ff ff ff          jmpq   5e <wasm[0]::function[0]+0x5e>
     118:   0f 0b                   ud2    

The above example assigns some registers, such as ax and cx, to the check block of epoch. Actually, this is just a simple example, and more complex workloads have a significant performance impact on the box_seal.wasm, the cost has reached 25%! And after trying to manually fix the issue with epoch (Unstable), the cost was only less than 7%. Especially for inner and outer loops, the outer loop uses r10 for storage, but the inner loop uses r11, which I cannot understand

alexcrichton commented 1 year ago

Thanks for the report! Would you be able to share a wasm file or an example loop in source code to help reproduce this locally?

wjr-z commented 1 year ago

Thanks for the report! Would you be able to share a wasm file or an example loop in source code to help reproduce this locally?

Thank you for your reply. In fact, I am actively searching for the reason . This is link to box_seal. wasm https://github.com/jedisct1/webassembly-benchmarks/blob/master/2021-Q1/wasm/box_seal.wasm Then, this is the code for the example loop.

(module
 (export "_start" (func $_start))
 (func $_start (; 0 ;)
    (local $i i32)
    (local $i2 i32)
    i32.const 0
    local.set $i
    loop $loop
        i32.const 0
        local.set $i2
        loop $loop2
            local.get $i2
            i32.const 1
            i32.add
            local.set $i2
            local.get $i2
            i32.const 80000
            i32.lt_s 
            br_if $loop2
        end $loop2
        local.get $i
        i32.const 1
        i32.add
        local.set $i
        local.get $i
        i32.const 40000
        i32.lt_s 
        br_if $loop
    end $loop
 )
)
alexcrichton commented 1 year ago

Thanks! Could you detail a bit more what you mean by "manually fix the issue with epoch (Unstable), the cost was only less than 7%"?

Looking at the disassembly it's not obvious to me what the issue is and how such a large win could be gained, so I'm curious how you were able to achieve it!

wjr-z commented 1 year ago

Thanks! Could you detail a bit more what you mean by "manually fix the issue with epoch (Unstable), the cost was only less than 7%"?

Looking at the disassembly it's not obvious to me what the issue is and how such a large win could be gained, so I'm curious how you were able to achieve it!

Unfortunately, the data on the server was lost. I'll try to reproduce it next week.