Consider read-only limited constrained vector lengths

WebAssembly / flexible-vectors

Vector operations for WebAssembly

https://webassembly.github.io/flexible-vectors/

Other

46 stars 6 forks source link

Consider read-only limited constrained vector lengths #20

Open zeux opened 3 years ago

zeux commented 3 years ago

As I understand it, right now the specification:

allows changing vector lengths (to a value smaller than the maximum)
allows different / unrelated lengths for different types
allows unbounded lengths for vector types

I'd suggest potentially changing all three to prohibit these :)

1 seems problematic in terms of lowering to AVX2, and also seems challenging to specify and implement in general as the length setup can be conditional. In the limit, efficient compilation with set_length might require dynamic recompilation. 2 seems problematic in terms of programming model (if an implementation exposes a different length for i32 vs f32 vectors, it's going to be difficult to use) 3 seems problematic in terms of memory allocation, as it means one can't allocate temporary memory on stack easily short of using the vector type as storage

So I would suggest requiring all implementations to have one consistent bit-length of all vectors (and all other lengths can be computed by taking that bit-length and dividing by element width), not having a way to change that at runtime, and providing some bound on that length, e.g. [128, 2048] (which AFAIK matches ARM SVE limits).

lemaitre commented 3 years ago

I agree with all your points, especially the 2 first.

Dynamic vector length introduces a global state, and can be inefficient if it changes at a high frequency (eg: when dealing with data having different cardinality).

Different length for different types of the same size is problematic on so many points. Why, AVX? Why? Joke aside, there is also an obscur Power ISA extension that has different sizes for different types. But I think it is a fair and sane choice to drop those. Concerning AVX, there is mainly 2 possibility: drop 8xF32 (and 4xF64), or emulate 8xI32 (and 4xI64) with smaller register. The latter might be preferable.

Unbounded lengths is another beast. To me, the main reason it was defined is to allow RISC-V vector extension. But I think it will be really hard to have a generic representation that can be efficient on RISC-V without crippling too much other ISAs. The reason is that the preferred way to handle varying sized data is not with masks (even though there is mask support) but with dynamic vector length while all other current ISAs do this with masks. So I think dropping the support for RISC-V entirely is fair.

That being said, I don't think bounding the length of the vectors is necessary to handle stack allocation. Stack allocation is just like VLA, but with the added benefit that it is actually constant behind the hood (WASM engine could optimize this constant on fixed length ISAs).

However, there is a reason why SVE limits the overall length to 2048 bits. It is the largest size possible where 8-bit lane indices can still fit within a 8-bit integer (2048 = 8*256). For WASM, this would allow to keep the lowest common denominator as low as it could be.

penzn commented 3 years ago

@zeux, thank you for the input!

allows changing vector lengths (to a value smaller than the maximum)

It hasn't been officially prohibited yet, but it is considered problematic - it can be (sort of) done with masking SIMD and vector ISAs, but would be expensive otherwise. Originally, I wanted to leave this out completely, but there was some interest in this during the presentation, so I left it in.

allows different / unrelated lengths for different types

Different length instructions are provided for convenience, to avoid conversion arithmetic in the program. Since we don't have hardware with different SIMD register sizes for different lane types, the expectation is that all the lengths would be the same. However there is nothing stopping us from mandating that explicitly.

allows unbounded lengths for vector types

Not necessarily, even with the variant where you can change the length, there is an upper limit to what it can be. It is just not known to the program.

arunetm commented 3 years ago

requiring all implementations to have one consistent bit-length of all vectors (and all other lengths can be computed by taking that bit-length and dividing by element width), not having a way to change that at runtime

All three recommendations look good to me too and will be nice to be mentioned explicitly. Can't think of any good reasons to keep the current flexibility of sizes for a given target.

providing some bound on that length, e.g. [128, 2048] (which AFAIK matches ARM SVE limits).

Thanks, @lemaitre for sharing the reasoning behind SVE choice for 2048 as the limit. Interesting point. I think it's beneficial to have 2048 or any reasonable static upper bound for vector sizes to be enforced by implementations rather than being limited by spec. This will leave more flexibility to go beyond 2048 if a need arises in the future.

zeux commented 3 years ago

I think it's beneficial to have 2048 or any reasonable static upper bound for vector sizes to be enforced by implementations rather than being limited by spec. This will leave more flexibility to go beyond 2048 if a need arises in the future.

This prohibits static buffer allocation. For example, when implementing algorithms where buffering of data is required, it's commonplace to allocate a static reasonably-sized buffer and use it as a chunk size for data processing - copy data into the buffer, perform necessary processing, potentially involving other buffers, copy data out.

If SIMD were to be used, this buffer must be a multiple of the SIMD width. When SIMD width is dynamic and unbounded, this makes it impossible to use static allocation. This may require significant changes in data layout. I don't see something like 2048 bit an unreasonable limitation, but if this limitation isn't enforced in the spec, programs have to be written assuming unbounded size, so having an implementation-defined upper bound isn't really helpful.

penzn commented 3 years ago

If SIMD were to be used, this buffer must be a multiple of the SIMD width. When SIMD width is dynamic and unbounded, this makes it impossible to use static allocation.

This reminds me - for practical reasons, the maximum length should be a power of two and at least 128 bits.

However, @zeux, what do you mean by unbounded length? What do you think about the size of SIMD registers on the platform being the upper bound?

arunetm commented 3 years ago

This prohibits static buffer allocation. For example, when implementing algorithms where buffering of data is required, it's commonplace to allocate a static reasonably-sized buffer and use it as a chunk size for data processing - copy data into the buffer, perform necessary processing, potentially involving other buffers, copy data out.

If SIMD were to be used, this buffer must be a multiple of the SIMD width. When SIMD width is dynamic and unbounded, this makes it impossible to use static allocation. This may require significant changes in data layout. I don't see something like 2048 bit an unreasonable limitation, but if this limitation isn't enforced in the spec, programs have to be written assuming unbounded size, so having an implementation-defined upper bound isn't really helpful.

Thanks! Makes sense. 2048 is a reasonable choice if we are to specify a bound.

zeux commented 3 years ago

However, @zeux, what do you mean by unbounded length? What do you think about the size of SIMD registers on the platform being the upper bound?

I mean that if the spec doesn't impose a limit, an implementation may decide that the vector length should be 1M bits, and I would expect that it would be very hard to design an algorithm that handles that for a class of use cases.

penzn commented 3 years ago

The idea is that operations in this proposal corresponds to a non-repeating sequence of native ops, so the maximum vector length should be same as hardware register size. With this it might be still possible to allocate larger buffers, and we want to give runtimes some flexibility to do that, but we can't be responsible for everything they would ultimately do (allocating a megabyte buffer for a less than a kilobyte worth of data, for example).

I am not sure setting a strict size limit (unlike the other two points) would eliminate the issue of potentially inefficient implementations. Allocating 2048 bit buffers, for, say 512 bit values would still be wasteful.

lemaitre commented 3 years ago

The idea is that operations in this proposal corresponds to a non-repeating sequence of native ops, so the maximum vector length should be same as hardware register size.

In theory, yes. But in practice, as the hardware register size is not know prior to runtime, the maximum vector length cannot match. Unless you were talking about the runtime constant, in that case they should match.

I am not sure setting a strict size limit (unlike the other two points) would eliminate the issue of potentially inefficient implementations. Allocating 2048 bit buffers, for, say 512 bit values would still be wasteful.

To me, vector constants fall in multiple categories:

broadcasted constants: as those are broadcasted, they should be encoded as such, and the WASM engine should decide to materialize a vector constant in memory, or use a broadcast instruction if available.
Algorithm related arrays: Those are specific to the very algorithm you are trying to implement. The size of the array depends more on your problem than on the vector length. Having a upper bound on the vector length would still be a good thing here.
index-like constant (0, 1, 2, 3, 4, ...): those are super specific to the vector length, and should have a special encoding at WASM level.
shuffling constants: shuffling is a really hard bit of flexible vectors. This needs more thoughts. I imagine that most common shuffling would have a special encoding at WASM level.
LUT within a vector: it could probably be handled by 128-bits sub-SIMD, with a broadcast of a 128-bit lane. Or with a large LUT, loop to process the whole LUT (number of iteration would depend on vector length).
LUT of vectors: Basically the same problem as shuffling constants, but with the additional constraint that offsets between vectors should be known and fixed.

A thing that could be done is a way to specify a vector by giving all its element, and when the WASM engine instantiates the code, the constant is truncated to the actual vector length. The encoding of such a constant could also support some special formats to deal with some patterns, like broadcating a 128-bit element to all 128-bit lanes, or adding the lane index to each value.

akirilov-arm commented 10 months ago

I have a couple of remarks about setting the vector length dynamically, as currently defined by the proposal. I am not sure if this is the best place to write them down, but the suggestion at the beginning would resolve them, so I will comment here.

First of all, what is the expected behaviour if the program pushes a vector value onto the stack (I am referring to the WebAssembly operand stack, not the actual machine stack in an implementation), changes the vector length, and then pops the value? In particular, consider the case in which the vector length has been decreased below the hardware limit before, and then it is increased back to the maximum imposed by the machine. Should this sequence of operations fail validation? If not, what would the value of the expanded part of the vector be?

While technically it is orthogonal to the proposal, perhaps it is a good idea to think a bit about how setting the vector length dynamically interacts with function calls, or, to put it formally, a potential procedure call standard/ABI for functions that use this operation. Would it be the responsibility of the compiler to generate code to save the previous vector length value in the "prologue" of every function that sets it (and to restore it in the "epilogue"; equivalently, both operations could be done on the caller side)? Or would it be the responsibility of the programmer (in which case saving and restoring would be performed around a vector loop, I presume)?