Exploring Architecture-Specific Optimization in WAMR Without Altering Bytecode Generation

I am exploring the potential for optimizing the WAMR runtime performance by leveraging architecture-specific instructions. The key challenge is to achieve this while maintaining the universal applicability of the bytecode, ensuring it remains architecture-agnostic and can be executed on any platform, including ARM, Xtensa, and others.

Currently, WAMR generates bytecode that is generic and can run across different architectures. This approach is excellent for portability but doesn't take full advantage of the unique capabilities and instructions specific to each architecture, which could potentially enhance performance.

I am curious if there is a way to incorporate these architecture-specific optimizations in a manner that does not require changes to the bytecode generation process. I am not sure where to start right now.

Any insights, suggestions, or discussions on how we might approach this would be greatly appreciated. I believe this could be a significant step forward in optimizing WAMR for specific hardware, without sacrificing its cross-platform utility.

Thank you for any suggestions.

I honestly don't have a definitive answer but an opinion, which I can share. My opinion may be wrong. Please tell me if it is. And @wenyongh @XuJun2019 @TianlongLiang , if you agree or disagree, please jump in.

If this is for interpreter mode, need to make sure using architecture-specific instructions in bytecode handlers. quit simple and straight, right?

But it starts to become complex when talking about jit mode and aot mode. In WAMR, we depends on LLVM to do optimization and code generations. In a nutshell, we transfer Wasm bytecodes to LLVM IR firstly and let LLVM finish the show. So, about "architecture-specific instructions*, I think there are some directions:

Following the latest LLVM release. All chipset vendors will apply their specific instr. into popular toolchains asap. Keep checking LLVM IR reference is a good start. In some cases, LLVM codegen will generate architecture-specific instr. based on target triple. In others, need to use specific intrinsic functions instead of general IR to let codegen fire specifi instrs. So we should keep checking latest the IR reference and pick up the right IR or intrinsic functions and use them during Wasm translation.
Target-specific optimizations. It requires to write private passes and add them into the pipeline. Those passes can be not a part of runtime and installed as a .so
Patch llvm codegen. It is required when vendors' general patches can't fit users' specific situations(especially when it hurts other situations).

I hope these thoughts helpful? Please let me know what you think.

Yes, not sure what WAMR generates bytecode means here? Is it the pre-compiled bytecode generated by wasm loader for fast-interp, or the LLVM IR (or machine code) generated by the aot compiler?

For interpreter, I have no idea how to generate architecture-specific bytecodes, since they are general bytecodes currently. Maybe like @lum1n0us mentioned, architecture-specific instructions in bytecode handlers is a good way, another way may be to implement the interpreter with assembly code to improve the performance, but it is really complex.

For aot compiler, per my understanding, there may be three ways: 1) generating architecture-specific LLVM IR, for example, register translation callback for each wasm opcode, if the callback is found, then the aot compiler calls it to generate the LLVMR IRs for that opcode, otherwise, calls the common translation function to generate the LLVM IRs

2) adding architecture-specific LLVM passes, allow to register new passes (e.g. from wamr built-in implementation or or from .so file) and apply them, and note that wamrc already has option --enable-llvm-passes=<passes> now

3) using architecture-specific codegen, no idea how to affect the codegen process yet, since now aot compiler just calls LLVMTargetMachineEmitToMemoryBuffer or LLVMTargetMachineEmitToFile to get the object file which contains the machine code.

bytecodealliance / wasm-micro-runtime

Exploring Architecture-Specific Optimization in WAMR Without Altering Bytecode Generation #3243