bytecodealliance / wasm-micro-runtime

WebAssembly Micro Runtime (WAMR)
Apache License 2.0
4.89k stars 623 forks source link

Very slow AOT code generation for large WASM files #2085

Open csegarragonz opened 1 year ago

csegarragonz commented 1 year ago

Hi,

I have been experiencing some very slow code generation times for large WASM files.

I include a little benchmark I have done with this WASM file: large_code.zip. I compare:

  1. wamrc from the current main tip (build in Release mode)
  2. WAVM from our fork
  3. wasmtime: using v7.0.0

For each, I include the instructions to build, the command to generate the machine code, and the time it took.

wamrc:

Build wamrc with CMAKE_BUILD_TYPE=Release from the latest commit.

time wamrc -o  large_code.aot large_code.wasm
# this takes around 3' in my system

WAVM:

Already built in the docker image in the previous command, you may find the Dockerfile here.

# Mounting `pwd` to have access to the .wasm file inside the container,
# all the build artifacts are in /build, so we can actually overwrite the /code
# directory
docker run --rm -it -w /code -v $(pwd):/code csegarragonz/wavm:faasm bash
time /build/bin/wavm compile large_code.wasm large_code.aot
# this takes around 1'20" in my system!

wasmtime:

Install using:

curl https://wasmtime.dev/install.sh -sSf | bash
time ~/.wasmtime/bin/wasmtime compile large_code.wasm
# this takes arround 6" in my system (?!?!)

Admittedly, I am not very familiar with wasmtime nor have I any idea why is it so much faster. I suspect I am doing something wrong. That being said, wasmtime uses a different code generator, but WAVM is also LLVM-based, so how come it is more than two times faster?

NB: this results are specific to my machine but, at least for the WAMR/WAVM comparison, I have seen consistent numbers in a variety of Intel x86 CPUs.

NB2: the attached WASM file contains a lot of custom native symbols only defined in our embedder, so you can not run it with iwasm. I thought it did not really matter to get the point across.

wenyongh commented 1 year ago

Hi, WAMR and WAVM are LLVM-based, wasmtime is cranelift based, it really takes more time for the former to compile wasm files. And WAMR uses llvm new pass manager and may apply more optimizations than WAVM, so it may take more time to compile wasm file than it. There may be some methods to reduce the compile time for wamrc:

wenyongh commented 1 year ago

@csegarragonz Recently we implemented the segue optimization for LLVM AOT/JIT, see #2230, normally (for many cases) it can improve the performance, reduce the compilation time of AOT/JIT and reduce the size of AOT/JIT code generated. Currently it supports linux platform and linux-sgx platform on x86-64, could you have a try? The usage is: wamrc --enable-segue or wamrc --enable-segue=<flags> iwasm --enable-segue or iwasm --enable-segue=<flags> (iwasm is built with LLVM JIT enabled)

flags can be:

    i32.load, i64.load, f32.load, f64.load, v128.load,
    i32.store, i64.store, f32.store, f64.store, v128.store

Use comma to separate them, e.g. --enable-segue=i32.load,i64.store.

csegarragonz commented 1 year ago

Hey @wenyongh thanks for pointing this out!

Just to double check, will this optimisations benefit me if I am using x86-64 on linux with HW bound checks enabled?

As far as I can tell, bound checks weren't performed anyway, and were delegated to the OS by placing the linear memory at the begining of a contiguous patch of 8GB of virtual memory and protecting memory pages? Please correct me if I am wrong!

(Not for SGX, I understand the segue optimisation could benefit my SGX use cases)

wenyongh commented 1 year ago

Yes, it may benefit no matter the --bounds-checks=1 is added for wamrc or not. The memory access boundary check in the aot code only depends on the i + memarg.offset (i is popped from stack, memarg.offset is encoded in bytecode), it doesn't related to the base address of linear memory.

Normally the compilation time and the binary size can be reduced since the optimization simplifies the LLVM IRs to load/store the linear memory and decreases the size of load/store instructions. The performance may be degraded in some cases, we found that some LLVM optimizations may not take effect if the optimization is enabled, and it depends on which flags are enabled, for example for CoreMark workload, the performance gets worse if using warmc --enable-segue while gets better if using wamrc --enable-segue=i32.store.