Optimizations and their impact on the instruction set

We had a good discussion about optimizations and how they fit into the instruction set at the 2021-01-22 meeting. What I believe we are discussing is whether there are or should be instructions that are de facto in the instruction set but whose encoding (seen as an expression tree, not as a sequence of bytecodes) is quite literally a combination of other existing instructions.

For example, hardware has a dedicated right-shift-by-constant operation, and we have agreed to encode that operation in the wasm SIMD instruction set as the tree ShiftRight(expr, I32Const(n)), rather than having a single dedicated ShiftRightByConstant(expr,n) instruction, which we could have had.

For another example, the very general Shuffle operation is inefficiently implemented by hardware if translated directly, and we rely on the compiler to recognize specific shuffle patterns (and to recognize when both inputs are the same, or when one is a constant zero, say) as expressions of specific hardware instructions that are faster than the naive translation. We could instead have chosen to expose these hardware instructions directly in our instruction set as blends, rotates, shifts, packs, unpacks, and so on.

In summary, de facto, ShiftRightByConstant and Blend et al are in the instruction set, but are expressed in terms of trees of other operations, and we count on the compiler to know this. I don't think we have an explicit notion of what the criteria are for the instructions that fall into this category, but to me it looks limited to very simple classification of the inputs, eg, "lhs == rhs" and "operand is constant" and "operand is constant zero" and of course recognizing shuffle mask patterns.

As a counterexample, we have agreed to introduce i16x8.extmul_low_i8x16_s(a, b) as a dedicated wasm SIMD instruction to capture the hardware widening multiplication rather than encode it as the tree i16x8.mul(i16x8.widen_low_i8x16_s(a), i16x8.widen_low_i8x16_s(b)), which has precisely the same semantics. So the complexity of this tree is clearly over some limit.

The larger the tree, the more brittle it is, as each part of the tree is subject to other optimizations that the compiler may perform; the pattern may be broken by eg the commoning of subexpressions (a widen operation that's used several places may be commoned and may confuse the pattern matcher, or confuse the compiler as to whether to use an existing widening result or to recompute as part of a widening multiply). This becomes even trickier if the synthesized instruction has to be expressed as a combination of operations at the C++ level, and LLVM/binaryen get to perform optimizations too. And finally, encoding an instruction as a combination of other instructions does not simplify the wasm compiler at all; quite to the contrary, it makes the compiler more complicated. The only benefit the encoding has is to shrink the wasm SIMD instruction set (a little), and opcode space in a virtual instruction set such as ours is cheap.

It is possibe the outcome of this discussion ought to be some explicit ideas - written up in the spec - about what the complexity limit is for deciding that an instruction tree is sufficient to express a desirable hardware instruction. This could be phrased in terms of tree depth and operand type for example, and in terms of the contrast between the naive translation and the good translation. Given where the bar is currently, however, i think this complexity limit should be low. Having it would go some way toward justifying some of the choices in the instruction set, and toward guiding us when considering new instructions.

(And given a low complexity limit, I think signselect ought to be a dedicated instruction, the tree required to express it being relatively large: Bitselect(compare(x, v128.const(0)), a, b).)

WebAssembly / simd

Optimizations and their impact on the instruction set #430