Open hungryzzz opened 1 year ago
Thanks for this and https://github.com/bytecodealliance/wasmtime/issues/7495! Would you be able to provide the original source code as well? That'd likely help in investigating this.
@alexcrichton Hi, does the original source code refer to C
code? If yes, I don't have the original C code file either unfortunately.
Ah I was curious about whatever source it came from, not necessarily C, but if you don't have the source then no worries.
The earilest original source is from C. After compiling it to wasm
, we mutate it and get the above case, so it will be quite different with the original source code. Actually, I have reduced the original wasm
file, so the size of the buggy case I submmit is quite small(around 120 wasm instructions, only 3 functions) compared to the original one. I hope this will help your further investigating.
Echoing Alex: thanks very much for reporting these perf bugs!
@hungryzzz, out of curiosity, are you fuzzing performance across wasm runtimes? Do you have any additional details you can share about what you're doing and what your goals are?
@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.
VTune outputs this:
which seems to point a smoking-ish gun at DSB Coverage/Misses. I just learned about this today and still know very little about it. For individual instructions this is what I'm seeing:
I have not yet dug further at this time.
Sort of where I'm leaning for this and https://github.com/bytecodealliance/wasmtime/issues/7495 is that this may not be too too actionable without the original source.
In general Cranelift is not expected to be able to beat LLVM, which I believe WasmEdge is using. Cranelift is intended to be competitive with "peer JITs" such as v8 and SpiderMonkey but it won't ever be able to reach parity with LLVM on all inputs. So in that sense it's not always a "fair" comparison to compare LLVM and Cranelift.
Now that being said Wasmtime and Cranelift as a whole still want to be fast. Much of the speed of Wasmtime/Cranelift comes from pre-optimized WebAssembly modules. For example Rust/C programs originally go through LLVM's optimizer which takes care of advanced optimizations that Cranelift may not do. Afterwards Cranelift is intended to generate high-quality machine code with a high-quality instruction selector as well as optimizations around Wasmtime's injection of runtime management code (e.g. egraphs plus table bounds checks). So despite LLVM-vs-Cranelift not being "fair" it's something we're still interested in (hence thanks again for these issues!).
It sounds, though, like the pipeline to produce these wasm files was to take an origin C program and mutate it. This starts moving into the realm of the original program no longer being optimized by LLVM, meaning that the onus is on Cranelift to perform more advanced optimizations that LLVM implements. This is unlikely to happen, however, given Cranelift's use case.
So given all this that's what leads me to the conclusion that this may not be actionable. The bugs, from what I can tell, here and on https://github.com/bytecodealliance/wasmtime/issues/7495 rely on optimizations that LLVM does but Cranelift doesn't do. I am guessing, however, that if the original source were C or Rust that when optimized and then run through both wasm runtimes the performance would be similar because LLVM's optimizations would kick in when producing the wasm. So overall my current conclusion is that Cranelift is missing optimizations, but it's intentional.
Now a lot of what I'm saying above is merely conjecture. I haven't bottomed out the exact performance differences and what's going on where. It may very well be a small tweak to Cranelift fixes performance of these cases. That being said I'm personally having a difficult time analyzing this without a comparison of the original source code. So I'd ask again, I know that this is reduced and mutated from an original source program, but is it possible to get the source program and the list of mutations?
@alexcrichton Hi, the codes are generated by randomizing wasm bytecodes. Thank you for your efforts. I will get back to you later once I have new clues.
Ok if the original source + mutations aren't available then no worries.
I'll again restate though that what I'm saying above is currently a hypothesis and isn't something I've been able to definitively prove. In the abstract though LLVM is expected to beat Cranelift when fed arbitrary code and there's only so much that we can do about that. Cranelift should be more competitive when fed LLVM-optimized wasm code, however.
Hi, I run the attached case in
Wasmtime
andWasmEdge
respectively and I find that the execution time ofWasmtime
is 4x slower than which ofWasmEdge
(measured bytime
tool).Wasmtime
: 4.20sWasmEdge
: 1.05sI try to use the
perf
tool to locate the time consuming part of code, the report is as followed.Also, I try to dump the machine code generated by
WasmEdge
(LLVM
) and I get a shorter machine code thanWasmtime
.I think maybe there are some missing optimizations in
Cranelift
soWasmtime
generates a suboptimal machine code and gets longer execution time thanWasmEdge
.Versions and Environment
Wasmtime
: e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)WasmEdge
: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870case.zip