Bad performance in Wasmtime

hungryzzz commented 1 year ago

Hi, I run the attached case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is 4x slower than which of WasmEdge(measured by time tool).

Wasmtime: 4.20s
WasmEdge: 1.05s

I try to use the perf tool to locate the time consuming part of code, the report is as followed.

99.66%  wasmtime  jitted-2354298-2.so  [.] wasm[0]::function[3]                                                                                                ▒
0.03%  wasmtime  jitted-2354298-1.so  [.] wasm[0]::function[2]                                                                                                ▒
0.03%  wasmtime  [kernel.kallsyms]    [k] update_min_vruntime                                                                                                 ▒
0.02%  wasmtime  [kernel.kallsyms]    [k] zap_pte_range.isra.0                                                                                                ▒
0.02%  wasmtime  libc-2.31.so         [.] __memset_avx2_erms                                                                                                  ▒
0.02%  wasmtime  [kernel.kallsyms]    [k] __vma_adjust                                                                                                        ◆
0.02%  wasmtime  [kernel.kallsyms]    [k] error_entry                                                                                                         ▒
0.02%  wasmtime  [kernel.kallsyms]    [k] filemap_map_pages                                                                                                   ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] page_add_file_rmap                                                                                                  ▒
0.01%  wasmtime  wasmtime             [.] wasmtime_cranelift_shared::compiled_function::collect_address_maps::cvt                                             ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe                                                                                      ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] mem_cgroup_from_task                                                                                                ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] charge_memcg                                                                                                        ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] kmem_cache_alloc                                                                                                    ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] ___slab_alloc                                                                                                       ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] update_cfs_rq_h_load                                                                                                ▒
0.01%  wasmtime  [kernel.kallsyms]    [k] kmem_cache_free                                                                                                     ▒
0.01%  wasmtime  libc-2.31.so         [.] realloc

Also, I try to dump the machine code generated by WasmEdge(LLVM) and I get a shorter machine code than Wasmtime.

# Machine code generated by LLVM, only  36 instructions compared with 127 instructions generated by Wasmtime.
0000000000000140 <f3>:
 140:   40 f6 c6 20             test   $0x20,%sil
 144:   75 2b                   jne    171 <f3+0x31>
 146:   4c 8b 07                mov    (%rdi),%r8
 149:   40 0f b6 c6             movzbl %sil,%eax
 14d:   4d 8b 00                mov    (%r8),%r8
 150:   42 8b 74 00 10          mov    0x10(%rax,%r8,1),%esi
 155:   42 2b 74 00 14          sub    0x14(%rax,%r8,1),%esi
 15a:   39 ce                   cmp    %ecx,%esi
 15c:   73 05                   jae    163 <f3+0x23>
 15e:   e9 00 00 00 00          jmpq   163 <f3+0x23>
 163:   66 83 fa ff             cmp    $0xffff,%dx
 167:   75 09                   jne    172 <f3+0x32>
 169:   42 83 7c 00 50 00       cmpl   $0x0,0x50(%rax,%r8,1)
 16f:   74 01                   je     172 <f3+0x32>
 171:   c3                      retq   
 172:   41 89 d1                mov    %edx,%r9d
 175:   89 c8                   mov    %ecx,%eax
 177:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 17e:   00 00 
 180:   85 c0                   test   %eax,%eax
 182:   75 10                   jne    194 <f3+0x54>
 184:   89 ce                   mov    %ecx,%esi
 186:   81 e6 ff 7f 00 00       and    $0x7fff,%esi
 18c:   81 fe ff 7f 00 00       cmp    $0x7fff,%esi
 192:   74 dd                   je     171 <f3+0x31>
 194:   66 41 83 f9 ff          cmp    $0xffff,%r9w
 199:   74 15                   je     1b0 <f3+0x70>
 19b:   ff c1                   inc    %ecx
 19d:   8d 34 02                lea    (%rdx,%rax,1),%esi
 1a0:   ff c8                   dec    %eax
 1a2:   40 0f b6 f6             movzbl %sil,%esi
 1a6:   41 ff c1                inc    %r9d
 1a9:   41 80 3c 30 0a          cmpb   $0xa,(%r8,%rsi,1)
 1ae:   75 d0                   jne    180 <f3+0x40>
 1b0:   ff c2                   inc    %edx
 1b2:   e9 00 00 00 00          jmpq   1b7 <f3+0x77>

I think maybe there are some missing optimizations in Cranelift so Wasmtime generates a suboptimal machine code and gets longer execution time than WasmEdge.

Versions and Environment

Wasmtime: e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)
WasmEdge: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870
kernel version: Linux 5.15.0-83-generic

case.zip

alexcrichton commented 1 year ago

Thanks for this and https://github.com/bytecodealliance/wasmtime/issues/7495! Would you be able to provide the original source code as well? That'd likely help in investigating this.

hungryzzz commented 1 year ago

@alexcrichton Hi, does the original source code refer to C code? If yes, I don't have the original C code file either unfortunately.

alexcrichton commented 1 year ago

Ah I was curious about whatever source it came from, not necessarily C, but if you don't have the source then no worries.

hungryzzz commented 1 year ago

The earilest original source is from C. After compiling it to wasm, we mutate it and get the above case, so it will be quite different with the original source code. Actually, I have reduced the original wasm file, so the size of the buggy case I submmit is quite small(around 120 wasm instructions, only 3 functions) compared to the original one. I hope this will help your further investigating.

fitzgen commented 1 year ago

Echoing Alex: thanks very much for reporting these perf bugs!

@hungryzzz, out of curiosity, are you fuzzing performance across wasm runtimes? Do you have any additional details you can share about what you're doing and what your goals are?

hungryzzz commented 12 months ago

@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.

alexcrichton commented 12 months ago

VTune outputs this:

which seems to point a smoking-ish gun at DSB Coverage/Misses. I just learned about this today and still know very little about it. For individual instructions this is what I'm seeing:

I have not yet dug further at this time.

alexcrichton commented 12 months ago

Sort of where I'm leaning for this and https://github.com/bytecodealliance/wasmtime/issues/7495 is that this may not be too too actionable without the original source.

In general Cranelift is not expected to be able to beat LLVM, which I believe WasmEdge is using. Cranelift is intended to be competitive with "peer JITs" such as v8 and SpiderMonkey but it won't ever be able to reach parity with LLVM on all inputs. So in that sense it's not always a "fair" comparison to compare LLVM and Cranelift.

Now that being said Wasmtime and Cranelift as a whole still want to be fast. Much of the speed of Wasmtime/Cranelift comes from pre-optimized WebAssembly modules. For example Rust/C programs originally go through LLVM's optimizer which takes care of advanced optimizations that Cranelift may not do. Afterwards Cranelift is intended to generate high-quality machine code with a high-quality instruction selector as well as optimizations around Wasmtime's injection of runtime management code (e.g. egraphs plus table bounds checks). So despite LLVM-vs-Cranelift not being "fair" it's something we're still interested in (hence thanks again for these issues!).

It sounds, though, like the pipeline to produce these wasm files was to take an origin C program and mutate it. This starts moving into the realm of the original program no longer being optimized by LLVM, meaning that the onus is on Cranelift to perform more advanced optimizations that LLVM implements. This is unlikely to happen, however, given Cranelift's use case.

So given all this that's what leads me to the conclusion that this may not be actionable. The bugs, from what I can tell, here and on https://github.com/bytecodealliance/wasmtime/issues/7495 rely on optimizations that LLVM does but Cranelift doesn't do. I am guessing, however, that if the original source were C or Rust that when optimized and then run through both wasm runtimes the performance would be similar because LLVM's optimizations would kick in when producing the wasm. So overall my current conclusion is that Cranelift is missing optimizations, but it's intentional.

Now a lot of what I'm saying above is merely conjecture. I haven't bottomed out the exact performance differences and what's going on where. It may very well be a small tweak to Cranelift fixes performance of these cases. That being said I'm personally having a difficult time analyzing this without a comparison of the original source code. So I'd ask again, I know that this is reduced and mutated from an original source program, but is it possible to get the source program and the list of mutations?

hungryzzz commented 12 months ago

@alexcrichton Hi, the codes are generated by randomizing wasm bytecodes. Thank you for your efforts. I will get back to you later once I have new clues.

alexcrichton commented 11 months ago

Ok if the original source + mutations aren't available then no worries.

I'll again restate though that what I'm saying above is currently a hypothesis and isn't something I've been able to definitively prove. In the abstract though LLVM is expected to beat Cranelift when fed arbitrary code and there's only so much that we can do about that. Cranelift should be more competitive when fed LLVM-optimized wasm code, however.

bytecodealliance / wasmtime

Bad performance in Wasmtime #7492

Versions and Environment