[Fast Interpreter] Slow performance when handling complicated arithmetic expression in loop

hungryzzz commented 1 year ago

Description

Hi, I run the following attached cases in different Wasm runtimes(after being compiled by Emscripten), and I find some performance differences between wamr(fast-interp) and wasm3.

The execution time(collected by perf-tool, probe begins when starting to execute the wasm code(wasm_call_function in wamr) and end in sched:sched_process_exit) in wamr(fast-interp) is 2x slower than which in wasm3.

	flops-8	flops-5	flops-4	flops-3
wamr(fast-interp)	9597840.99 us	8475859.04 us	4882332.11 us	10700224.61 us
wasm3	4401260.85 us	4105807.93 us	2574588.03 us	5633284.86 us
wamr(AOT)	879322.56 us	880584.59 us	418496.13 us	934592.44 us

I run other test cases on such runtimes, the average execution time on wamr(fast-interp) is 1.2x times faster than on wasm3. I also see the previous report in https://github.com/bytecodealliance/wasm-micro-runtime/wiki/Performance and find the similar results. So maybe the above results are a little strange.

Then I look though the above cases, and I find they are all about the complicated arithmetic expressions in loop. So I guess maybe wamr(fast-interp) suffers from slow performance when handling such cases.

Hardware & OS

Ubuntu 20.04
CPU: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Memory: 32GB

Emscripten

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.24 (68a9f990429e0bcfb63b1cde68bad792554350a5) clang version 16.0.0 (https://github.com/llvm/llvm-project 277c382760bf9575cfa2eac73d5ad1db91466d3f) Target: wasm32-unknown-emscripten Thread model: posix

Wasm runtime version

wamr: iwasm 1.2.1
wasm3: Wasm3 v0.5.0 on x86_64

Repreduce

Compile the above C case using Emscripten emcc -sENVIRONMENT=shell -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.c -o flops.wasm
Execute the wasm file in different wasm runtimes and collect the execution time, all the compilation and execution options are default.

c.zip wasm.zip

TianlongLiang commented 1 year ago

Hi, maybe you are testing classic interpreter mode. But you are right about those cases. Interpreters are not great for such cases. The fast interpreter is also about 10x times compared to AOT modes. I tested other running modes too, Fast JIT and LLVM JIT can handle those cases relatively well, about 3x times/1.1x times compared to AOT

hungryzzz commented 1 year ago

Hi, I use the fast interpreter mode to run those cases, so I compare the execution results with Wasm3, which is also an fast interpreter to run Wasm. I also build the Wamr with classic mode(-DWAMR_BUILD_FAST_INTERP=0), the execution time in classic mode is 2 or more times than the fast-interp mode.

bytecodealliance / wasm-micro-runtime