Out-of-bounds behaviour

akirilov-arm commented 3 years ago

Hello,

I am part of a team at Arm that works on various WebAssembly runtimes. We have been reviewing the Wasm proposals, focusing on how they map to the Arm architecture (in particular the 64-bit execution state of the A profile). As a result, we have a couple of questions about the out-of-bounds behaviour of some operations in the flexible vectors proposal:

What is the expected behaviour of the extract_lane and replace_lane operations for out-of-bounds indices?
What is the expected behaviour of the left and right lane-wise shifts when the shift amount is larger than the vector length? Related question - is the shift amount meant to be unsigned?

Of course, these questions should be answered explicitly by the specification text.

The second question is probably a bit more important because the shift amounts would be unknown at compile time in the general case (they are not immediates). To give a concrete example, here's how the vec.i64.lshl operation could be expressed with Scalable Vector Extension (SVE) instructions:

whilelo p0.d, wzr, w0
mov     z1.d, #0
splice  z1.d, p0, z1.d, z0.d

This sequence uses vector length-agnostic code generation, which could be the approach chosen by an ahead-of-time Wasm compiler - probably not the expected situation, but IMHO it demonstrates the worst case scenario. Inputs are in Z0 and W0 respectively, while the output is in Z1. Coming back to the second question, the assumptions are that the shift amount is unsigned and that the result vector must be filled with zeros if the shift amount is larger than the vector length (let's ignore the possibility of setting the vector length dynamically for a moment to make the discussion simpler - it's a third tier operation anyway).

The equivalent Neon mapping could be:

  ldr   q1, .Lindex_vector
  cmp   w0, #16
  mov   w1, #16
  csel  w1, w1, w0, hi
  lsl   w1, w1, #3
  dup   v2.16b, w1
  sub   v1.16b, v1.16b, v2.16b
  tbl   v1.16b, { v0.16b }, v1.16b
...
.Lindex_vector:
  .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

If the shift amount is actually known statically, it is possible to simplify - e.g. to shift by 1:

movi v1.2d, #0
ext  v1.16b, v1.16b, v0.16b, #8

A general remark - explicit indices are awkward from the point of view of SVE because the instructions usually rely on predicates (i.e. masks) for the same functionality. This increases compiler complexity for sequences like an index-generating operation and extract_lane, which could result in performance issues due to the round trip between predicate and general-purpose registers.

lemaitre commented 3 years ago

If we don't care that much about determinism and/or browser fingerprinting, then make the out-of-bound undefined would be the most performant. Otherwise, I think wrap-around (ie: modulo the length) is the best tradeoff as it is super fast to implement in all architecture (just a bitwise and).

Concerning shifts, I proposed in #27 a more general shift that takes 2 inputs (https://github.com/lemaitre/flexible-vectors/blob/master/proposals/flexible-vectors/README.md#lane-shift). It would be somewhat equivalent to splice, but with integer.

programmerjake commented 3 years ago

If we don't care that much about determinism and/or browser fingerprinting, then make the out-of-bound undefined would be the most performant. Otherwise, I think wrap-around (ie: modulo the length) is the best tradeoff as it is super fast to implement in all architecture (just a bitwise and).

Wrap-around is not always a bitwise-and: all of RISC-V V, Arm SVE, and SimpleV support non-power-of-2 vector lengths.

lemaitre commented 3 years ago

Wrap-around is not always a bitwise-and: all of RISC-V V, Arm SVE, and SimpleV support non-power-of-2 vector lengths.

That's true that those ISAs support non-power of 2 vector lengths, but the architectural vector length is, AFAICT, always a power of 2, so restricting the native vector length of wasm registers to a power of 2 is a sensible choice that would make my assertion valid.

programmerjake commented 3 years ago

Wrap-around is not always a bitwise-and: all of RISC-V V, Arm SVE, and SimpleV support non-power-of-2 vector lengths.

That's true that those ISAs support non-power of 2 vector lengths, but the architectural vector length is, AFAICT, always a power of 2,

Last I checked, it is a valid implementation of RISC-V V (and also SVE) where the maximum vector length is not a power of 2.

so restricting the native vector length of wasm registers to a power of 2 is a sensible choice that would make my assertion valid.

akirilov-arm commented 3 years ago

... the architectural vector length is, AFAICT, always a power of 2...

That's not true for SVE - the architecture permits implementations whose vector length is 384 bits, for example, so proper vector length-agnostic (VLA) code generation can't make any assumptions about the vector length other than its being a multiple of 128 up to 2048 bits inclusive. In that case wrap-around would require the following instruction sequence, taking 16-bit elements as an example (index in W0):

cnth x1
udiv w2, w0, w1
msub w1, w1, w2, w0

Note that there is no instruction to calculate the remainder of a division.

However, it is true that one option is to force all Wasm implementations to constrain the vector length to the largest power of 2 that is less than or equal to the hardware vector length (an ability that is a requirement of the architecture). It's awkward and potentially a waste of hardware resources, but possible.

P.S. Changing the vector length in SVE is risky because a scalable vector register could be saved somewhere on the stack by a function up the call chain. That's why in practice probably nobody is going to do it, unless starting a new process. However, it could work in a strictly controlled environments such as Wasm runtimes - they tend to do weird stuff anyway (e.g. the way linear memory bounds checking is implemented).

penzn commented 3 years ago

Good point - this needs to be stated in the spec.

With operations that select lanes, I second the option of making out-of-bounds either platform-specific (see WebAssembly/relaxed-simd#22), which is a softer form of "undefined", or doing some form of truncation. Wasm SIMD tried assigning special meaning for out-of-bounds indices in swizzle, and it does not scale on x86-based platforms (WebAssembly/simd#93).

akirilov-arm commented 3 years ago

@penzn While you are at it, the specification text relating to the lane-wise shifts mentions two input vectors a and b, while the operation signatures and the pseudocode use only one, naturally. Also, square root shouldn't be a binary operation.

I just remembered that there was an alternative way to enforce a power-of-2 vector length in a SVE-based implementation - so far my assumption has been that the governing predicates that would be used by the generated instructions would be initialized with ptrue p0.s for 32-bit elements, for example, which is equivalent to ptrue p0.s, all. However, there is also the option of using ptrue p0.s, pow2 (and cntw x0, pow2 in some cases), which should achieve the same effect as changing the vector length to a power of 2, at least for the operations that are currently defined by the proposal. While arguably safer, this approach would still potentially be a waste of hardware resource.

WebAssembly / flexible-vectors

Out-of-bounds behaviour #35