WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 696 forks source link

Tech spec / table about all instruction costs #1434

Open MaxGraey opened 2 years ago

MaxGraey commented 2 years ago

It would be great to have a consistent and accurate cost table of all wasm instructions in terms of code generation on the VM side. This would make it very easy for tools and compilers. For example, there are a couple of places in Binaryen where it is very difficult to determine the weight of a particular operation without a good knowledge of the particular VM:

https://github.com/WebAssembly/binaryen/pull/4245#discussion_r729377538 https://github.com/WebAssembly/binaryen/blob/main/src/ir/cost.h#L589 https://github.com/WebAssembly/binaryen/blob/main/src/ir/cost.h#L593

It is clear that the costs may be slightly different on different VMs, but it would be possible to have some average value, which would also be averaged for x86, x64 and ARM32, AArch64. Besides, we could even give detailed recommendations for instructions where one of the arguments is a constant (for example, divider of i32.div_u or i64.rem_s in such cases computational cost would be much lower than for general case).

In addition, after some optimization of the codegen of some introduction in one of the VMs record for such a table could be promptly updated.

lars-t-hansen commented 2 years ago

@MaxGraey, you may be interested in the related effort (for Wasm SIMD) in https://nemequ.github.io/waspr/instructions.

SIMD is an many ways an easier domain than wasm proper, since SIMD instructions tend to map pretty cleanly to specific hardware sequences without too many optimization opportunities (there are exceptions, esp around shuffle/swizzle).

I think "consistent and accurate" is a pretty tall order:

Cost can vary with the execution environment beyond the architecture, so your "averaging" matrix may be quite large. For example, Firefox will fall back to explicit bounds checking for heap accesses if it detects that the 64-bit system it is running on has little virtual memory and thus that using large-VM reservations for the heap (to lower the cost of bounds checking) creates the risks of running out of VM. This pertains to phones, but also to virtualized desktop environments. Memory64 accesses will likely have explicit bounds checks always, but maybe they won't look quite like the explicit Memory32 bounds checks. Multi-memory will likely impose an extra cost on some memory accesses because there will be several memory areas to handle. Cross-instance indirect calls will likely be more expensive than same-instance indirect calls. Indirect calls with complicated signatures are slightly more expensive that ditto with simple signatures. And so on.

To add to that, the various architectures have large families of implementations that can have different performance profiles. An in-order low-power arm64 core (A53) will likely behave very differently from an out-of-order high-performance core (M1, X1).

Finally, the optimizing compilers will tend to make it hard to predict precise code sequences. Firefox eliminates bounds checks it can prove to be redundant, for example.

This is not to say it's a bad idea or shouldn't be done but a measure that is significantly better than a simple VM-agnostic estimate must take all of the above into account.

MaxGraey commented 2 years ago

you may be interested in the related effort (for Wasm SIMD)

This is of course a useful resource for those who implement SIMD commands or need interactive documentation, but I meant a little differently - namely, the official documentation with computational costs for each wasm instruction.

Cost can vary with the execution environment beyond the architecture, so your "averaging" matrix may be quite large

That's right, which is why I say it would be nice if such a summary table were born of consensus. Maybe to simplify, we can not include old architectures (such as x86-32 and arm32) but only x64, aarch64, riscv5. At the same time take into account the latest processor models (or give them more weight).

Finally, the optimizing compilers will tend to make it hard to predict precise code sequences. Firefox eliminates bounds checks it can prove to be redundant, for example.

This is required range analysis (global analysis) and it has nothing to do with low-level scheduling of instructions and the actual costs. Most compilers such as LLVM, Binaryen and GCC already have a simple table as a switch-case function that takes an operation type as input and returns a cost value. But now all this is done empirically and very importantly is not coordinated. And as we know LLVM and Binaryen often work together

titzer commented 2 years ago

I think the best that can be hoped for would be somewhat vague. First, I believe confounding factors like register allocation and calling conventions will make predicting performance very hard.

But maybe that's not what we need; I think the OP is asking for an instruction-by-instruction guide, which can be instructive in some cases. Most of the core instructions represent single, widely-supported machine instructions, which basically need no documentation. As instructions get more complicated, it might make sense to have a cheat sheet of machine operations to which they correspond. Engine implementors have those sequences kind of "in their head" but they aren't written down anywhere.