Adaptive vector lengths without global state

WebAssembly / flexible-vectors

Vector operations for WebAssembly

https://webassembly.github.io/flexible-vectors/

Other

48 stars 6 forks source link

Adaptive vector lengths without global state #21

Open sunfishcode opened 3 years ago

sunfishcode commented 3 years ago

Context: #4, #20. Global state is problematic, but even if we make it explicit, the ability to have arbitrary length switches is hard to implement on many architectures. The following is a sketch of a design which avoids both of these, while preserving some key advantages of flexible vectors.

Similar to the explicit loop variant of @aardappel's post here, it has a vec_loop, but this proposal is lower-level, making state explicit and avoiding unnecessary restrictions, while still hiding information as needed by implementations. I haven't prototyped this, but in theory it should be implementable on RISC-V V, SVE, AVX-512, as well as simd128-style architectures.

Implementations with dynamic vectors or masking could use them. But implementations could also chose to emit multiple loops, to handle alignments or remainders. Such implementations could also chose to run some vector loops at twice or more the hardware length, making a register-pressure vs. speed tradeoff as they see fit.

New Types:

vec<n>: a flexible vector of element bitwidth n
vec_len<n>: a scalar but opaque element count, limited to n as the widest bitwidth

New Opcodes:

vec_loop<vl, n>: like loop but
- vl: immediate identifying a local variable to set to the number of elements being processed within an iteration of the loop, with type vec_len<n>. Using this local outside of a vec_loop is prohibited.
- n: immediate which declares the bitwidth of the widest element type which will be processed in the loop
- has one operand, the total number of elements the program wants to process in the loop
vec_step<vl, scale>: (i32) -> (i32)
- vl: immediate identifying a vec_len local variable
- scale: immediate scaling factor
- computes operand + vl * scale
vec_load<vl, n>, vec_store<vl, n>: like load and store` but
- vl: immediate identifying a vec_len local variable
- n: immediate specifying the vector element bitwidth to use
The arithmetic operations from simd128, with a similar vl immediate added.

Example

Add two arrays of length $n starting at $A and $B and store the result in an array starting at $C.

  (local $A i32) (local $B i32) (local $C i32) (local $n i32)
  (local $vl vec_len.32)
  ...

  local.get $n        ;; number of elements in array to process (passing zero is ok!)
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements

    (local.set $t0 (vec_load $vl 32 (local.get $A)))
    (local.set $t1 (vec_load $vl 32 (local.get $B)))
    (local.set $t2 (vec_add $vl (local.get $t0) (local.get $t1)))
    (vec_store $vl 32 (local.get $C) (local.get $t2))

    (local.set $A (vec_step $vl 4 (local.get $A)))
    (local.set $B (vec_step $vl 4 (local.get $B)))
    (local.set $C (vec_step $vl 4 (local.get $C)))
    (local.set $n (vec_step $vl -1 (local.get $n)))

    (br_if 0 (local.get $n) (local.get $n)) ;; pass the count back to the top
  end                ;; end vector loop

Nondeterminism

The length in each iteration of a vec_loop is nondeterministic. It's visible in vec_step, in the number of elements loaded and stored in vec_load and vec_store, and in any side effects of the loop. It's expensive to completely hide the hardware vector length with any flexible-vector proposal, so whether or not there should be nondeterminism is an independent question.

One thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program. Wasm doesn't currently have any nondeterminism that behaves like that, and it would have implications for suspending a program and resuming it on another architecture, or for distributing a program over multiple architectures.

Vector subroutines

A call within a vec_loop could make sense if the callsite passes the vec_len value to the callee. Since vec_len is a type, it'd be part of the function signature, so implementations could compile the callee to be called from within a vector loop without whole-program analysis.

Nested `vec_loop`s

Some architectures wouldn't be able to implement arbitrary nested vector parallelism, so one possible approach would be to prohibit it. Lexical nesting is easy to detect, and dynamic nesting -- a call inside a vec_loop calling a function that contains another vec_loop could be prohibited if we require calls in vec_loops to have exactly one vec_len argument, and prohibit vec_loop inside functions with a vec_len argument.

Masks, shuffles, extracts and inserts, strided loads/stores

There are ways this design could be extended to include these, but left them out to keep things simple to start with. It's mostly an independent question whether these can be implemented efficiently.

penzn commented 3 years ago

This is a good idea, definitely better than what we were able to come up with in #4. I think it should definitely investigated as a potential solution. I would still want to establish a baseline with the "exposed length" style, to have something to compare to, especially since the rest of operations (aside from the length) would not depend on this too much.

The obvious challenge from spec perspective would be introducing a new kind of non-determinism. But that should not be a deterrent, since any proposal hiding the length would have to be non-deterministic.

Another thing worth mentioning is that some work would be needed on tools producing wasm. Existing backends support vector instructions, even variable-length ones, but not the control flow we would need for this. This should not be a deterrent either - there are precedents of wasm operations diverging from what native targets do.

lemaitre commented 3 years ago

Such a design seems too simple to handle most SIMD/vector algorithms because of the lack of inter-lane operations. The prohibition of nested parallelism is also problematic for not so few algorithms.

Matrix multiplication is a really simple example that requires both inter-lane operations (either broadcast or reduce) and nested parallelism, and I fail to see how one would implement it with such an interface.

sunfishcode commented 3 years ago

The nested-SIMD restriction is mainly motivated by my understanding of the limitations of hardware platforms. It should be straightforward to lift, if we wish, since there's no global state. The question is, what algorithms would use nested-SIMD, and how would they map to various architectures?

Either way, nesting with regular loop isn't restricted, so eg. outer-loop vectorization is supported.

Broadcast and reduce would be straightforward -- broadcast would be an instruction which takes a scalar operand and produces a vec<n> result, and reductions would be instructions which take a vec<n> operand and return scalar results.

Even shuffles could work: for example, if you set the vec_loop immediate to 128, that says you want to process at least 128 bits each iteration, and then we could let you do 128-bit-wide shuffles (eg. xyzw/rgba kinds of things), while still allowing implementations to run the loop at a wider lengths when they want to.

penzn commented 3 years ago

Since the length is tied to lane size, how would this work for widen or narrow instructions?

sunfishcode commented 3 years ago

The n immediate on vec_loop declares the widest element bitwidth that will be operated on in the loop; you can operate at that width or narrrower, or both. To convert from an array of 16-bit elements to an array of 32-bit elements for example, declare a loop with n=32, the wider bitwidth, and use eg. a vec_load with n=16, and let's say we add a vec_extend_s instruction to perform a conversion, with immediates declaring its operand and result widths:

  (local $A i32) (local $C i32) (local $n i32)
  (local $vl vec_len.32)
  ...

  local.get $n
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements

    (local.set $t0 (vec_load $vl 16 (local.get $A))) ;; Load 16-bit elements; return type vec<16>
    (local.set $t2 (vec_extend_s 16 32 $vl (local.get $t0))) ;; vec<16> -> vec<32>
    (vec_store $vl 32 (local.get $C) (local.get $t2)) ;; Store with 32-bit elements

    (local.set $A (vec_step $vl 2 (local.get $A))) ;; Step by 2 bytes for the input array
    (local.set $C (vec_step $vl 4 (local.get $C))) ;; Step by 4 bytes for the output array
    (local.set $n (vec_step $vl -1 (local.get $n))) ;; Decrement the loop counter

    (br_if 0 (local.get $n) (local.get $n)) ;; pass the count back to the top
  end

lemaitre commented 3 years ago

The issue I see here is that you lose parallelism when processing smaller elements as your registers are not full. The common way to handle multiple element sizes is to process multiple registers of the larger types in order to fill completely the register with the smaller type. This looks complex to implement with your design.

How would you deal with functions? if you have to pass $vlat run time, I guess you would basically lose any benefits to having such a special variable, and your implementation would be, at best, as efficient as masks. If $vl is some kind of load-time constant, you would need to monomorphize all the functions for different sizes. That seems expensive.

Even shuffles could work: for example, if you set the vec_loop immediate to 128, that says you want to process at least 128 bits each iteration, and then we could let you do 128-bit-wide shuffles (eg. xyzw/rgba kinds of things), while still allowing implementations to run the loop at a wider lengths when they want to.

I think it would be super useful to have intra 128-bit shuffles. But if you only deal with shuffles inside 128-bit "lanes", then there is a bunch of algorithms that would not be really implementable. Prefix sum is an example.

If you want a toy program to experiment with your design, I would suggest you the following one:

static const uint8_t CNT_LUT[16] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4}; // LUT can fit into a 128-bit register
void popcount_prefixsum(const uint8_t*restrict A, uint16_t*restrict B, int n) {
  uint16_t sum = 0;
  for (int i = 0; i < n; ++i) {
    uint8_t a = A[i];
    uint8_t cnt_low = CNT_LUT[a & 0xf];
    uint8_t cnt_high = CNT_LUT[a >> 4];
    uint8_t cnt = cnt_low + cnt_high; // cnt = popcount(a)
    sum += cnt;
    B[i] = sum;
  }
}

I consider this to be a good primary test for the usefulness of a SIMD ISA as it exposes a lof of features (mixed type lengths, LUT, reduction, rotation) that people would expect from a SIMD ISA. They are a few other missing like masks, compress/expand, gather/scatter, conflict detection, but that is a good start.

In fact, SSE2 does not pass this test, which reflects the shortcomings of that ISA.

sunfishcode commented 3 years ago

The issue I see here is that you lose parallelism when processing smaller elements as your registers are not full. The common way to handle multiple element sizes is to process multiple registers of the larger types in order to fill completely the register with the smaller type. This looks complex to implement with your design.

One possible way to model this would be to add a vec_half opcode, which has type (vec_len<n>) -> (vec_len<n*2>), and which conceptually divides the VL in half so that you can do half-VL-but-double-width loads with it:

(vec_loop 16 $vl
    (local.set $half_vl (half (local.get $vl))
    (local.set $t0 (vec_load $half_vl 32 (local.get $A))) ;; Load 32-bit elements; return type vec<32>
    (local.set $t1 (vec_load $half_vl 32 (local.get $A))) ;; Load more 32-bit elements; return type vec<32>
    (local.set $t2 (vec_wrap 32 16 $vl (local.get $t0) (local.get $t1))) ;; vec<32> -> vec<16>
    ...

How would you deal with functions? if you have to pass $vlat run time, I guess you would basically lose any benefits to having such a special variable, and your implementation would be, at best, as efficient as masks. If $vl is some kind of load-time constant, you would need to monomorphize all the functions for different sizes. That seems expensive.

The key is that $vl has a distinct type, vec_len<n>. When the engine is compiling a function with a $vl argument, it knows that function is called within vector contexts, so it can use different calling conventions, and different strategies under the covers. A simple strategy for a machine with a fixed VL might be to just have the function handle one VL-length vector per call, and use a scalar remainder loop when compiling the vec_loop. On an architecture with mask registers, you could pass a mask value as an extra argument, and use that. On an architecture with a vector-length register, the function could just use it.

But if you only deal with shuffles inside 128-bit "lanes", then there is a bunch of algorithms that would not be really implementable. Prefix sum is an example.

If you want a toy program to experiment with your design, I would suggest you the following one:

I'm not very familiar with these kinds of algorithms. Has anyone prototyped this kind of prefix-sum algorithm with the other flexible SIMD proposals? I'd be curious to see how this looks.

lemaitre commented 3 years ago

@sunfishcode

I finally found time to give it a thought, and your design might just work. The syntax looks a bit weird, but it seems to solve the issues we are trying to solve.

The problem I see with your design is the necessity to have a remainder for some ISAs (so for all WASM codes in practice). As I said in another issue (most of #7), scalar remainders should be avoided because its costs is higher with wider SIMD, to the point that the scalar remainder might account for the majority of the processing time for not so small loops.

If you implement a masked remainder to alleviate this issue, what is the actual gain of your vec_loop design? Plus, you seem to propose to have masked iterations on all the iterations for for targets with native masks, but generating this mask might be not-so cheap for tiny loops. So even if you have masks, you still might want to avoid them for most of your loop, and keep them only for the remainder.

To nuance my comment, I think your main goal to avoid global state is really good, and it is one goal we have in common you and I. It's just that I think masked operations are pretty much necessary for other reasons, and could be used also for loops. I think we could keep your idea of a special kind of mask to encode vector length (which could be implemented more optimally on archs without native support for masks), generate loop body with full width vectors, and use those special masks only for the remainder.

sunfishcode commented 3 years ago

The vec_loop design avoids the need to decide which remainder strategy is best by letting wasm engines pick any remainder strategy they want.

At the wasm level, vec_loop can handle any n, so wasm code doesn't need explicit remainder loops. In the implementation, implementations can chose to lower vec_loop into one loop or multiple loops. And they can pick any remainder strategy they want, such as masking in the main loop, a separate remainder loop with masking, a separate scalar remainder loop, using a vector-length register, or anything else that makes sense on the target hardware.

lemaitre commented 3 years ago

The vec_loop design avoids the need to decide which remainder strategy is best by letting wasm engines pick any remainder strategy they want.

In theory, I would agree. But in practice, it seems impractical to impose WASM engines to change the meaning of the code depending on if it is the remainder or not. This is better achieved by compilers because they have time to reason about the code. Plus, this would impose to duplicate (monomorphize) functions taking a vec_len for the case where the length is known to be maximal (in loop body), and for the generic case. Otherwise, you would slow down the loop body for no good reason (especially on ISAs without native support of masks like SSE or Neon).

At the wasm level, vec_loop can handle any n, so wasm code doesn't need explicit remainder loops.

This is not compatible with something you said earlier about having vec_len being always full on some ISAs. Or maybe I misunderstood that last point. So let me ask again to ensure I understand your point correctly:

Would you have dynamic vec_len for SSE or Neon?
If so, how would you efficiently implement a dynamic vec_len on SSE or Neon?
Would it be fast enough to be used for the loop body? (also a concern for AVX*)
If it is not fast enough for the loop body, how would you handle functions? (also a concern for AVX*)
If vec_len would not be dynamic on SSE and Neon, how would you avoid the presence of a remainder (whatever its form) in the WASM code?

And they can pick any remainder strategy they want, such as masking in the main loop, a separate remainder loop with masking, a separate scalar remainder loop, using a vector-length register, or anything else that makes sense on the target hardware.

This looks super complex to implement on a WASM engine that has basically no time to handle those. It would indeed need to rebuild the semantic of the code (building the AST and the SSA form) in order to change the meaning of the code and make the actual decision. Unless there is something I am not aware of, this looks very impractical.

Also, the are some hardware solutions that cannot fit with this model. The one I have in mind is the global vector length, because you would not be able to handle nested loops. That being said, as long as you don't target such an architecture, that's fine. And to be fair, what I have in mind cannot handle those either. I just wanted to highlight this limitation, to remind you that, even in theory, your proposal cannot accommodate all the architectures.

sunfishcode commented 3 years ago

This is better achieved by compilers because they have time to reason about the code. [...] This looks super complex to implement on a WASM engine that has basically no time to handle those.

In general in wasm, engine compile time is an important concern. For wasm SIMD though, there's a good argument for letting engines take (somewhat) more time. On one hand, achieving portability, efficiency, and practicality in a single SIMD design is already very difficult, and the current designs are already making some major tradeoffs. Adding stringent compile-time restrictions risks making these worse. And on the other, SIMD is typically a very small portion of most applications (in static code terms), so making SIMD compiler slower won't affect most code. So if taking additional compile time on SIMD code gives us better performance, portability, or practicality, it's worth considering.

The design proposed here would likely require JITs to build an IR to perform many of the optimizations we want here. However, all major JIT-based wasm engines' top-tier backends already build IRs.

Would you have dynamic vec_len for SSE or Neon?

Yes. Here's an overview of how vec_loop would work at different levels.

At the wasm level, vec_loop can always be passed any arbitrary dynamic element-count. Applications don't need to provide their own remainder loops because they can just pass whatever count they have to vec_loop. vec_len<n> values are dynamic. The only platform-specific behavior is the specific number of elements vec_loop chooses to operate on in each iteration.
In an implementation on SSE/NEON/etc.:
- a vec_loop might be compiled into two loops, a regular vector loop and a scalar remainder loop*
- a function with a vec_len<n> argument would then be compiled into two versions, one which operates on full vectors, and one that operates on scalars. Calls inside the regular vector loop would call the full-vector version, and calls in the scalar remainder loop would call the scalar version.
In an implementation on AVX-512:
- a vec_loop might be compiled into a single loop that uses masking in a dynamic-vector-length style*
- a function with a vec_len<n> argument would then be compiled into one version, which has an extra mask argument to apply to all vector operations in the function.
In an implementation on RISC-V V:
- a vec_loop might be compiled into a single loop that uses the dynamic vector length hardware*
- a function with a vec_len<n> argument would then be compiled into one version, which uses the current value of the dynamic vector length register.

* Other implementation strategies are possible; I just picked some strategies to serve as examples.

Also, the are some hardware solutions that cannot fit with this model. The one I have in mind is the global vector length, because you would not be able to handle nested loops.

I wrote a bit about this in the "Nested vec_loops" section above. vec_loop does support global-vector-length architectures such as RISC-V V, and in fact, they'd likely be the easiest platforms to support. Also, you can nest vec_loop inside loop, or loop inside vec_loop, for outer-loop vectorization. The thing that's unsupported is nesting vec_loop within vec_loop, which to my knowledge doesn't correspond to a common SIMD idiom, but if it does turn out to be important, is something I expect the design could be extended to handle.

lemaitre commented 3 years ago

Ok, that is much more clear to me. So you would indeed have dynamic vec_len<> objects that can encode a length from 0 up to a constant defined by the target architecture, and you would monomorphize all the functions taking a vec_len<>: one version with a dynamic value, and one version with the maximal value. The loop body would, most likely, use the specialized version, while the remainder would always use the dynamic version.

This makes total sense, and the translation overhead looks reasonable. I can know totally see how WASM code would be translated into machine code (except maybe Risc-V V, but that's because I don't know enough about it).

About nesting: yes, I was thinking about nesting vec_loops. It is a bit weird, but some algorithms actually do nest. Some implementations of the matrix product with on-the-fly transposition, for instance. But it might be reasonable to forbid it for now, and later reconsider it if we have more hindsight.

Re-reading your proposal, there is something a bit odd: "Using this local outside of a vec_loop is prohibited." I see no reasons to forbid the use of a vec_len<> outside a vec_loop. Is it to accommodate for Risc-V V? Because I think that having to handle different type sizes would have the same implementation complexity than multiple vec_len<>, no? One use case I see for vec_len<> outside of vec_loop is to generate loop constants from functions taking a vec_len<>.

sunfishcode commented 3 years ago

Indeed. And, an implementation which unrolls loops could also use lengths greater than the target architecture's nominal length on some loops.

Forbidding vec_len outside of vec_loop is because the value is only meaningful within a given iteration of a vec_loop. If you want to pre-generate a vector to pass into a vec_loop, doing it manually would require knowing how many elements to add to it, but that's tricky because the vector length might differ between vec_loop iterations. A possible alternative might be to add instructions to algorithmically describe vector values. A common example of this is a broadcast instruction, but in theory we could also add instructions to produce vector values from arithmetic sequences, even/odd patterns, or other things.

lemaitre commented 3 years ago

Forbidding vec_len outside of vec_loop is because the value is only meaningful within a given iteration of a vec_loop.

The thing is, if you explicitly forbid vec_len outside of vec_loop, it has to be written in the spec, and WASM compilers/engines will have to make sure it never happens. So it adds complexity at multiple levels. Adding all this complexity just because you don't see a use case is pretty pointless. That would make sense if this gave you more guarantees as a result, but I don't see it.

If you want to pre-generate a vector to pass into a vec_loop, doing it manually would require knowing how many elements to add to it, but that's tricky because the vector length might differ between vec_loop iterations.

You could want to build an in-register LUT (like for emulating popcount), and you just happen to have a function that takes a vec_len and computes popcount for all elements. If you forbid vec_len outside loops, you would not be able to call this function to build your LUT: you would have to inline it manually, or duplicate it without the vec_len.

If vec_len is allowed outside vec_loop, you could build your temporary LUT by calling the function with a maximal vec_len (and the right input vector, of course).

A possible alternative might be to add instructions to algorithmically describe vector values.

That's another problem. It is interesting and must be solved at some point, but it is orthogonal to the discussion we have now.

Let me explore something with you. It seems that forbidding vec_len outside of vec_loop is the same issue than nesting loops: you want to have a single "active" vec_len.

Having multiple "active" vec_len is no problem for most ISAs (SSE, AVX, AVX512, Neon, SVE, Altivec, VSX...) because they decorate operations and are not part of global state. If we decided to target only those, it would be simpler to allow vec_len outside of vec_loop and nested vec_loops. Simpler on the spec, simpler on the compilers, simpler on the engines.

Now, this design is problematic when the target architecture handles vec_len with a global state, and allowing a single "active" vec_len ensures we don't need to constantly switch between vec_lens. The only reasonable ISA like that we might target is Risc-V V, so let me continue by considering only Risc-V V and nothing else.

As I mentioned earlier, we need a way to process types with different sizes in the same loop, and you proposed vec_len<n> where n would be the size (in bits) of the type worked on, and you would have instructions to convert a vec_len<n1> into a vec_len<n2>. With this, you would have multiple vec_len inside the loop, bu they are somehow tied together. As far as I recall, Risc-V V needs a vsetvl each time you want to process different types (with different sizes). So if we need to handle multiple types inside the loop, the WASM engine would already need to generate multiple vsetvl for the loop, namely: before each operation that use a different type size than the previous one.

In fact, this is not different than having multiple vec_len<n> (with the same n), and call vsetvl before each operation that operate with a different vec_len<n> than the previous one. So to me, the machinery to handle multiple types in Risc-V V is already enough to support multiple "active" vec_len at the same time. So we could allow it on Risc-V V, and all ISAs would benefit from it.

Also, I've looked back at your example with multiple type size:

(vec_loop 16 $vl
    (local.set $half_vl (half (local.get $vl))
    (local.set $t0 (vec_load $half_vl 32 (local.get $A))) ;; Load 32-bit elements; return type vec<32>
    (local.set $t1 (vec_load $half_vl 32 (local.get $A))) ;; Load more 32-bit elements; return type vec<32>
    (local.set $t2 (vec_wrap 32 16 $vl (local.get $t0) (local.get $t1))) ;; vec<32> -> vec<16>
    ...

And this example only works when $vl is maximal (full). Indeed, if the maximal vec_len<16> is 8 and $vl is 5, you need to make the first load of $A with a vec_len<32> of 4 and the second load with a vec_len<32> of 1. In theory, (3 + 2) would also be valid, but as 5 is not even, you cannot have the loads with the same vec_len<32>. This means that we really need multiple "active" vec_len<n> if we want this code to work.

And processing multiple type sizes at the same time is really common in signal processing, and also in AI if I'm not mistaken.

In conclusion, I really believe we need to have multiple "active" vec_len at the same time, and if we do, there is no point in forbidding either nested vec_loops or vec_len outside vec_loop. Plus, there is no implementation issue on all major platforms, and even in Risc-V V, it is solvable, and the solution is already needed for other reasons (multiple type sizes.).

jan-wassenberg commented 3 years ago

As far as I recall, Risc-V V needs a vsetvl each time you want to process different types (with different sizes). So if we need to handle multiple types inside the loop, the WASM engine would already need to generate multiple vsetvl for the loop, namely: before each operation that use a different type size than the previous one.

Note that RVV's LMUL (ganging together 2,4,8 registers, which are all affected by a single instruction) avoids the need for many/most? setvl. Even after promoting u16->u32, the number of lanes has not changed, only LMUL has doubled. There is also mf2..8 for half/quarter/eighths, but I have not yet seen these supported in intrinsics.

lemaitre commented 3 years ago

Note that RVV's LMUL (ganging together 2,4,8 registers, which are all affected by a single instruction) avoids the need for many/most? setvl.

You still need vsetvl to change LMUL, don't you?

In the documentation (risv-v-spec p.24), they explicitly call vsetvl to change data type size.

That being said, it is true that with LMUL, the need for more registers after promotion vanishes. But that is basically the only ISA that I know of that can do that.

jan-wassenberg commented 3 years ago

Looks like there have been some changes. VLH from your link no longer exists in 1.0-draft, and vtype (including LMUL) can indeed only be set by vsetvli (the special form with x0 in/out that changes only vtype, not VL).

Further changes are afoot: intrinsics seem to be changing to accept an AVL parameter and the compiler (in non-wasm usage) would be in charge of emitting vsetvli when it has changed. And to add yet more unclarity: store ops encode an EEW (effective element width), so it seems we could have LMUL=4, widen implicitly to LMUL=8, then store without vsetvli in between?

That being said, it is true that with LMUL, the need for more registers after promotion vanishes. But that is basically the only ISA that I know of that can do that.

Agreed.

sunfishcode commented 3 years ago

You could want to build an in-register LUT (like for emulating popcount), and you just happen to have a function that takes a vec_len and computes popcount for all elements. If you forbid vec_len outside loops, you would not be able to call this function to build your LUT: you would have to inline it manually, or duplicate it without the vec_len.

A pre-computed vector value would need to have a particular length, which would require the implementation to use that length for all iterations of all loops which could potentially use that value.

One of the goals of vec_loop is to let the implementation adjust the length as a loop runs, or between different loops, for the purpose of handling remainders, peeling for alignment, load balancing, using specialized hardware features, making register-pressure tradeoffs, or other things. Among other things, this is what makes it possible to avoid remainder loops at the wasm level, which is what avoids having wasm code bake in a particular remainder-loop strategy (masking, scalars, etc.) and is what gives implementations the flexibility to do what's efficient on the machine.

For LUT-like use cases, some possible options which would work with vec_loop include:

Add actual popcount or popcount-reduction instructions, or other instructions to handle specific use cases.
Add a way to algorithmically generate the popcount LUT pattern or other LUT patterns in a vector.
Perhaps we could add a concept of a "max VL" that we guarantee no engine will ever exceed, and then you could generate a LUT at "max VL" and we could say that the the LUT implicitly excludes trailing elements as needed to match the length of a vec_loop iteration.

lemaitre commented 3 years ago

A pre-computed vector value would need to have a particular length, which would require the implementation to use that length for all iterations of all loops which could potentially use that value.

You seem to want that vec_len actually changes the number of elements inside a vector, and discards all the elements whose index is larger than said vec_len, and makes them completely inaccessible, as if they never existed and never will. This design works only for very simple loops without any loop-carried dependency.

A really simple code that should work is reduction (written in Neon for the sake of simplicity):

uint32_t scalar_sum(const uint32_t * A, int n) {
  uint32_t s = 0;
  for (int i = 0; i < n; ++i) {
    s += A[i];
  }
  return s;
}

uint32_t neon_sum(const uint32_t * A, int n) {
  uint32x4_t s = vdupq_n_u32(0); // s has a length of 4
  int i = 0;
  for (; i < (n & -4); i += 4) {
    uint32x4_t a = vld1q_u32(&A[i]);
    s = vaddq_u32(s, a); // s has a length of 4
  }
  if (i < n) { // remainder
    // generate mask
    uint32x4_t I = {0, 1, 2, 3};
    I = vaddq_u32(I, vdupq_n_u32(i));
    uint32x4_t mask = vcltq_u32(I, vdupq_n_u32(n));
    // masked-load
    uint32x4_t a = vld1q_u32(&A[i]); // assume A is aligned with 16
    a = vandq_u32(a, I); // a has a "length" of 1, 2, or 3

    s = vaddq_u32(s, a); // s has a length of 4
  }
  // s has a length of 4
  return vaddvq_u32(s); 
}

As you can see here, we need to define s before the loop: its length is not correlated to the length inside loop iterations. Indeed, for the remainder, the length of the iteration is 1, 2, or 3, but the length of s always remains 4. It has to keep the end elements around for the final, in-register, reduction.

This can work with your design only if vec_len decorates operations and not vectors. Meaning: vec_len acts as a glorified mask, not as an actual vector length. This issue is for all loop-carried dependencies, and not only for accumulators: if you need values for the preceding iteration, you need to operate on vectors with different vec_len for at least the remainder.

That's why I proposed you to play with scan algorithms: they have a loop-carried dependency, and they are super common in signal and image processing. With your limitations, I fail to see how one could implement them with your design.

One of the goals of vec_loop is to let the implementation adjust the length as a loop runs, or between different loops, for the purpose of handling remainders, peeling for alignment, load balancing, using specialized hardware features, making register-pressure tradeoffs, or other things. Among other things, this is what makes it possible to avoid remainder loops at the wasm level, which is what avoids having wasm code bake in a particular remainder-loop strategy (masking, scalars, etc.) and is what gives implementations the flexibility to do what's efficient on the machine.

That is actually compatible with vec_len being glorified masks, and that is also compatible with having constants generated before the loop or loop-carried dependencies. But if you say that vec_len actually changes the intrinsic length of vectors, meaning that elements beyond current vec_len simply don't exist, this very view is incompatible with loop-carried dependencies. But as I explained, we can have a semantic of vec_len that is compatible.

For LUT-like use cases, some possible options which would work with vec_loop include:

Add actual popcount or popcount-reduction instructions, or other instructions to handle specific use cases.

Add a way to algorithmically generate the popcount LUT pattern or other LUT patterns in a vector.

Perhaps we could add a concept of a "max VL" that we guarantee no engine will ever exceed, and then you could generate a LUT at "max VL" and we could say that the the LUT implicitly excludes trailing elements as needed to match the length of a vec_loop iteration.

First, you simply cannot have specialized instructions for all the use-cases, so my example is popcount, but it could have been anything else. And no matter what kind of specialized instruction you could come up with to generate LUTs and what-not, there will always be some use-cases not handled by your specialized instructions. If you provide a way to "algorithmically" generate any kind of vector, nice, you just have invented a new language within the language, with a new level of complexity. Also, LUTs are not necessarily known at compile-time, but can still be constant for the whole duration of a loop.

Also, LUTs are kind of special because the length of the LUT vector is not correlated to the length of the data it is applied on. For instance, the popcount LUT I gave in example, if you only have a 8 byte-element vector, you would still need the 16 element LUT.

Also, you seem to have disregarded half of my comment, by this other half is, at least, as important as the one you replied to.

All in all, your proposal looks half-baked: the starting point looks good, but you should really try to see all the implications of your model, and try to "implement" common SIMD algorithms with it to see if it can works in practice. For now, it seems that many algorithms don't fit. And your explanations don't address those algorithmic issues.

So please, take back the example algorithm I gave you about prefix sum of popcount with mixed element sizes, and see how you can make it work with your design. You could even replace the prefix sum with a reduction, that would already be a good test.

sunfishcode commented 3 years ago

When you mentioned scan algorithms earlier, I replied:

I'm not very familiar with these kinds of algorithms. Has anyone prototyped this kind of prefix-sum algorithm with the other flexible SIMD proposals? I'd be curious to see how this looks.

I'd be interested in your answer here.

You're right that my proposal here as it stands does not support loop-carried dependencies. The neon reduction example is indeed something that vec_loop as it stands would not be able to express as such. However, this example hard-codes a particular vector length. If you're proposing we just have fixed-length vectors and not flexible-length vectors, that's fine, though I'd ask you to open a separate issue for it. If we go with fixed-length only, there's no point in having a vec_loop.

Otherwise, let's talk about how these popcount/scan/mixed-type examples look in other flexible-length-vector programming models. Precomputing a vector of arbitrary values, and loop-carried dependencies, will be "interesting" in any flexible-length proposal.

I didn't reply about mixed-element-type loops yet because we still seem to be discussing the basic mechanism of vec_loop and how it achieves its goals in simple uniform-element-type embarassingly-parallel cases. It's difficult to discuss extensions to vec_loop to handle more use cases until we have a common starting point.

lemaitre commented 3 years ago

I'm not very familiar with these kinds of algorithms. Has anyone prototyped this kind of prefix-sum algorithm with the other flexible SIMD proposals? I'd be curious to see how this looks.

I'd be interested in your answer here.

Here would be an implementation in SVE:

void prefix_sum(uint32_t const* A, uint32_t* B, int n) {
  svuint32_t last = svdup_n_u32(0);
  svuint32_t zero = svdup_n_u32(0);
  uint64_t vl = svcntw();
  for (int i = 0; i < n; i += vl) {
    svbool_t mask = svwhilelt_b32(i, n);
    svuint32_t a = svld1_u32(mask, &A[i]);
    // in-register prefix-sum
    for (int shift = 1; shift < vl; shift <<= 1) { // this loop could be unrolled on fixed-size ISAs
      svbool_t splice_mask = svwhilelt_b32(0, shift);
      splice_mask = svbnot(svptrue_b32(), splice_mask);
      svuint32_t sa = svsplice_u32(splice_mask, zero, a); // shift elements by "shift" amount
      a = svadd_m_u32(mask, a, sa);
    }
    // propagate previous sum
    a = svadd_m_u32(mask, a, last);
    last = svclastb(mask, a); // broadcast last element to all lanes
    svst1_u32(mask, &B[i], a);
  }
}

I cannot tell you how exactly one would implement such algorithm on other flexible-vectors proposal because there are operations missing. But my mental model around flexible vectors is, and has always been, close to what SVE is, and I thought that was commonly admitted.

To explicit my view: the target architecture has a constant vector width. This width is not known at compile time, but it is known at translation time. However, you can have an "effective" vector length (possibly implemented with masks) to apply operations on fewer elements. This view is actually compatible with your proposal as stated in the first post of this issue (even if I did not notice from the start). However, it seems incompatible with your view.

You're right that my proposal here as it stands does not support loop-carried dependencies.

This is a big issue as many algorithms require loop-carried dependencies. And to me, it is essential.

The neon reduction example is indeed something that vec_loop as it stands would not be able to express as such. However, this example hard-codes a particular vector length.

This can be implemented in a vec_loop design if you say that vec_len is a special kind of mask, and that operations with partial vec_len leaves elements beyond untouched.

If you're proposing we just have fixed-length vectors and not flexible-length vectors, that's fine, though I'd ask you to open a separate issue for it. If we go with fixed-length only, there's no point in having a vec_loop.

I'm not opposed to a runtime vec_len, quite the contrary, in fact. But I believe this should be viewed as masking operations, and not actually having less element. This gives you mostly the same benefits, but enables much more. And I repeat myself: this "vec_len is a mask" view is fully compatible with your proposal as stated in your first post.

Otherwise, let's talk about how these popcount/scan/mixed-type examples look in other flexible-length-vector programming models. Precomputing a vector of arbitrary values, and loop-carried dependencies, will be "interesting" in any flexible-length proposal.

Concerning LUTs and scans, as soon as you have a single internal vector length, those are "just" a matter of adding new instructions, nothing that really need a new conceptual view. Also, Loop-carried dependencies just work when you have a single internal vector length. Mixed type is a bit more complex, but if you can have multiple "effective lengths" at the same time, it should not be an issue either.

Precomputing a vector of arbitrary values at runtime should not be an issue in any design because it should just be calling SIMD instructions. SIMD can exist outside loops. Precomputing at compile time or translation time, need more thought, but is not tied to this proposal.

I didn't reply about mixed-element-type loops yet because we still seem to be discussing the basic mechanism of vec_loop and how it achieves its goals in simple uniform-element-type embarassingly-parallel cases. It's difficult to discuss extensions to vec_loop to handle more use cases until we have a common starting point.

Even if I think it is directly related to the core issue I mentioned, I agree to drop it for now.

I think all the issues I pointed revolve around the single fact that you really want to have vector with really different length at run time. And I think having vec_len operates as a mask is compatible with vec_loop and gives the same benefits than vectors truly differently sized, while also solving all issues mentioned. That's why I really insist on this point.

sunfishcode commented 3 years ago

Thanks for that example. One thing I should be more clear about is my interest in seeing if it's possible to avoid implicit state. As I mentioned at the outside, "one thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program." As I understand it, an SVE-style VL is not mutable, but it is state that needs to be available and shared between all wasm instances that a SIMD computation might span. This places constraints on suspending and resuming programs on machines with different computing resources, on distributing programs across networks, tiering execution strategies, and other things. As an example, an implementation with a powerful and long but power-hungry vector accelerator can't decide whether to use that accelerator based on runtime power-usage conditions; it has to run all loops at the same length, which might make the fallback inefficient.

While thinking about this, it occurs to me that the SVE model could perhaps be improved in this respect by adding an instruction which allows programs to declare when they've finished a SIMD kernel, meaning that implementations could pick a new VL next time they enter a SIMD kernel.

vec_loop on the other hand has no implicit state. It's more limited, but it still addresses many common use cases. It's also a more natural fit for RISV-V V-style hardware. So I think it's interesting to explore to see if it could be practical.

Besides implicit state, here's an attempt to summarize the high-level differences between the SVE model and vec_loop:

The SVE model gives programmers more control. You can craft your own LUTs, have loop-carried dependencies, or even write completely different code for VL=4 vs VL=8 or other things. With vec_loop, novel use cases will more often require more creativity and new language constructs, and may not be possible in all cases. It may be possible to extend vec_loop to handle the prefix sum algorithms you're talking about here, though I don't yet have a design in mind.
The vec_loop model gives implementations more flexibility. Implementations can use any remainder strategy they wish according to what's efficient on the machine. They can even use different strategies or lengths on different loops, or change strategies or lengths over the lifetime of a program. With the SVE model, these strategies are baked into the wasm code and the length cannot be changed once it's observed.

lemaitre commented 3 years ago

One thing I should be more clear about is my interest in seeing if it's possible to avoid implicit state.

Indeed, this view was implicit up until now. In your proposal, you mentioned global state, not implicit state.

As I understand it, an SVE-style VL is not mutable, but it is state that needs to be available and shared between all wasm instances that a SIMD computation might span.

It depends on what you call VL: if it's the actual number of elements stored in a SIMD register, then yes, it is constant. But if VL means the number of elements on which instructions operate, then VL can be mutated (but still has a constant maximum value).

To me, there no real way around if we want to support out-of-loop SIMD initialization and loop-carried dependencies. But if you find a way around, I would be glad to hear it.

BTW, Risc-V V also has such an implicit state: "Elements in any destination vector register group with indices ≥ vl are unmodified during execution of a vector instruction. When vstart ≥ vl, no elements are updated in any destination vector register group." Risc-V V spec, sec 3.4

This places constraints on suspending and resuming programs on machines with different computing resources, on distributing programs across networks, tiering execution strategies, and other things.

This concern is interesting, but is far beyond WASM scope and looks like a research topic on its own. In fact, as far as I'm aware, there no runtime in any language that can do this kind of "ISA" hot swap. Sure, there are some runtimes that have multiple versions of the same kernel for different targets and can select where to run a kernel at runtime, but this still require to have one manually specialized version per "accelerator". This would be completely different than what you propose (and might actually be much simpler to implement in the engine).

Also, how the situation is any different than with native code? The goal of WASM is to be close to native. Native cannot do that, so I wonder why WASM would require it.

While thinking about this, it occurs to me that the SVE model could perhaps be improved in this respect by adding an instruction which allows programs to declare when they've finished a SIMD kernel, meaning that implementations could pick a new VL next time they enter a SIMD kernel.

I don't get how it is an improvement, but yes, it could be done. The question is: how do you deal with nesting SIMD contexts? First you might say: no nesting, it does not make sense, and I would agree. But if you write a SIMD function in a lib and need SIMD, you would want to open a SIMD context. But you don't know your caller. So maybe the caller is already in its own SIMD context. How to deal with this situation?

vec_loop on the other hand has no implicit state. It's more limited, but it still addresses many common use cases. It's also a more natural fit for RISV-V V-style hardware. So I think it's interesting to explore to see if it could be practical.

While I mostly agree that vec_loop has no implicit state, it is not a more natural fit for Risc-V V: as I said, RVV has an implicit state by keeping elements beyond VL untouched, and the maximal VL is constant. This is not really different than the SVE paradigm. The main difference is that SVE has explicit masks to encode VL, while RVV has a global state.

The SVE model gives programmers more control. You can craft your own LUTs, have loop-carried dependencies, or even write completely different code for VL=4 vs VL=8 or other things. With vec_loop, novel use cases will more often require more creativity and new language constructs, and may not be possible in all cases. It may be possible to extend vec_loop to handle the prefix sum algorithms you're talking about here, though I don't yet have a design in mind.

With SVE, you cannot really have different code for different VL, except with runtime ifs, but that would be the same for all "flexible" vector ISA. You say that vec_loop would often require new language constructs to handle "novel cases". My question is: what such constructs would look like for reductions, scans, LUTs, and dependencies in general? How many of those constructs would be necessary to implement most of current SIMD algorithms? You seem do highly underestimate how many SIMD algorithms require loop-carried dependencies. Dot product and matrix multiplication don't fit into your mental model because they require loop-carried dependencies, and those are really not "novel" use cases (neither is prefix sum).

The vec_loop model gives implementations more flexibility. Implementations can use any remainder strategy they wish according to what's efficient on the machine. They can even use different strategies or lengths on different loops, or change strategies or lengths over the lifetime of a program. With the SVE model, these strategies are baked into the wasm code and the length cannot be changed once it's observed.

vec_loop might be more flexible for WASM engines, but definitely not for end-users. Sure the WASM engine could more heavily optimize the translation in that case, but it is not worth it if most code could not benefit from (wide) SIMD at all.

sunfishcode commented 3 years ago

Ok, yes, you got me, when I said loop-carried dependencies, I should have clarified that reductions and dot products and other useful cases of loop reductions can be handled. I even described how reductions would work, earlier in this thread.

And yes, I'm aware that RISC-V has state. The proposal here wouldn't expose that state to wasm directly.

And yes, I'm aware that vec_loop would be more flexible for implementations and less for programmers. That's what I just wrote.

In fact, as far as I'm aware, there no runtime in any language that can do this kind of "ISA" hot swap

https://github.com/bytecodealliance/wizer

for example.

Also, how the situation is any different than with native code? The goal of WASM is to be close to native. Native cannot do that, so I wonder why WASM would require it.

If you want native code, use native code.

lemaitre commented 3 years ago

Ok, yes, you got me, when I said loop-carried dependencies, I should have clarified that reductions and dot products and other useful cases of loop reductions can be handled. I even described how reductions would work, earlier in this thread.

I checked the thread again just to be sure, and no, you never mentioned how reductions would work. You even said: "The neon reduction example is indeed something that vec_loop as it stands would not be able to express as such".

I am genuinely curious to know how you would handle reductions, because I have no clue, and reductions are an important part of many SIMD algorithms.

And yes, I'm aware that RISC-V has state. The proposal here wouldn't expose that state to wasm directly.

So how vec_loop is "a more natural fit for RISC-V V-style hardware"? SVE model looks closer to Risc-V V model than vec_loop.

And yes, I'm aware that vec_loop would be more flexible for implementations and less for programmers. That's what I just wrote.

You missed my point: a less flexible for programmers approach would be ok if said programmers could still use it. But as it stands, most will not be able to use it because it lacks basic features that make most SIMD algorithms just impossible. I am still waiting that you show me how one could handles just even a simple reduction.

https://github.com/bytecodealliance/wizer

for example.

This one is interesting, but is much more specific than what you explained earlier. it is not some general "ISA hot-swap", it is basically two programs (with same source code), one that generates constant data and runs once, and one that uses those data and runs multiple times. Making SIMD vectors to not leak from one to the other in much simpler in this case than the general case. We don't need the extra complexity from separate SIMD contexts to make that work.

Also, how the situation is any different than with native code? The goal of WASM is to be close to native. Native cannot do that, so I wonder why WASM would require it.

If you want native code, use native code.

From webassembly.org main page:

Efficient and fast

The Wasm stack machine is designed to be encoded in a size- and load-time-efficient binary format. WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms.

One goal of WASM is to have a portable binary format that is as fast as native code.

Automatic and transparent offloading of code into accelerators is not a WASM goal.

penzn commented 3 years ago

Firstly, please do keep discussions productive.

Second, getting rid of global state may not be a goal on its own, but avoiding exposing native vector length definitely is, which is what this idea strives to achieve, and so far this is out best shot at that. Whether or not a program can infer global state is very important from upstream spec point of view, so we have to attempt to evaluate that - in some sense, that is required for moving up the stage ladder.

lemaitre commented 3 years ago

Right, I should stop replying to any comment part that I want to reply to and keep the discussion focused.

So let me expose my point of view on this in a single block.

The goal of such a thread is to explore this new model to see if it can work in practice. I exposed multiple critical flaws that have not yet been addressed: the impossibility to have reduction, the impossibility to have loop-carried dependencies, the impossibility to useful LUTs, the impossibility to handle multiple type sizes. Their criticality comes from the vast number of SIMD algorithms that require them. Among them: matrix product, dot product, prefix sum, stencils, most of image and signal processing algorithms... Leaking the target VL into runtime might be an issue, but it is a minor one. Nothing critical here.

However, "ISA hot-swapping" and off-loading are non-issues. They try to solve a problem we don't have (at least for the very large majority of us).

So we could either keep the runtime constant VL visible and implement most of SIMD algorithms, or we could go the actually no target VL is visible and many SIMD algorithms won't fit. The latter also comes with more complex WASM engine.

Therefore, the critical issues I exposed must be addressed. Otherwise, this vector model will simply not be used.

As a side note, I believed that an SVE-like model was consensual here, and I explained all my arguments in light of this. This thread proved me wrong, so I will write an SVE-like model proposal in the future, but don't expect anything from me for the next week at least.

sunfishcode commented 3 years ago

Here are some more ideas that cover several topics: nested SIMD, scalar library routines, LUTs, different forms of reductions, and a complete prefix_sum example at the end. I hope this also illustrates a general approach to problem-solving within the vec_loop conceptual framework.

`vec_preamble`

The problem with computing a vector in advance of a vec_loop is that we don't know how long of a vector we'll need. But even in the SVE-style model, the user's algorithm has to be flexible enough to let the last loop iteration run with fewer elements (via masking). So, what if we added a way to constrain a vec_loop to having a monotonically decreasing length? Then, we could extend the vec_loop syntax with an explicit vec_preamble construct, in which the program could do one-time initialization, in a vector context, at the maximum width for the loop, before entering the loop.

Here's an example that adds 7 to each element of A and stores the result in C, using vec_preamble to hoist out the broadcast of 7 into a vector:

  (local $A i32) (local $C i32) (local $n i32)
  (local $t0 vec.32) (local $t1 vec.32) (local $t2 vec.32)
  (local $vl vec_len.32)
  (local $index vec_len.32)
  ...

  local.get $n        ;; number of elements in array to process (passing zero is ok!)
  vec_preamble
    (local.set $t1 (broadcast.i32 (i32.const 7))
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements

    (local.set $t0 (vec_load $vl 32 (local.get $A)))
    (local.set $t2 (vec_add $vl (local.get $t0) (local.get $t1)))
    (vec_store $vl 32 (local.get $C) (local.get $t2))

    (local.set $A (vec_step $vl 4 (local.get $A)))
    (local.set $B (vec_step $vl 4 (local.get $B)))
    (local.set $C (vec_step $vl 4 (local.get $C)))
    (local.set $n (vec_step $vl -1 (local.get $n)))

    (br_if 0 (local.get $n) (local.get $n) (local.get $vl)) ;; pass the count *and current length* back to the top
  end                ;; end vector loop

vec_preamble would be part of the vec_loop syntax, simlar to how wasm's if and else are logically part of the same syntax.

vec_preamble could be used to hoisting loop-invariant broadcast vectors, as in the example here. But that's pretty limited, since not all loop-invariant vectors are broadcasts, so that brings us to the next feature...

`scalar_loop`

The requirement that everything in vector contexts execute "in vector mode" is limiting. If we want to call scalar library functions, in an otherwise vector loop, we want to call them once per lane, not once per vector loop trip, but that's not possible in a vec_loop. So what if we introduce a scalar_loop construct, which could be nested inside a vec_loop? It would run its body code in scalar, once per lane of the containing vec_loop. It'd have two new opcodes, vec_extract and vec_insert, to extract scalars out of vector values coming into the scalar_loop and insert scalar results into vector values flowing out, in intutive ways, except that the element index is an opaque vec_len<n> value.

Here's an example that shows the syntax:

  (local $A i32) (local $B i32) (local $C i32) (local $n i32)
  (local $t0 vec.32) (local $t1 vec.32) (local $t2 vec.32)
  (local $vl vec_len.32)
  (local $index vec_len.32)
  ...

  local.get $n        ;; number of elements in array to process (passing zero is ok!)
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements
    (local.set $t0 (vec_load $vl 32 (local.get $A)))
    (local.set $t1 (vec_load $vl 32 (local.get $B)))
    scalar_loop.32 $vl $index
        ;; Extract the element at index $index from $t0 and $t1, do a scalar add, and insert the
        ;; result at index $index in $t2.
        (vec_insert $t2 $index (i32.add $vl (vec_extract $t0 $index) (vec_extract $t1 $index)))
    end
    (vec_store $vl 32 (local.get $C) (local.get $t2))

    (local.set $A (vec_step $vl 4 (local.get $A)))
    (local.set $B (vec_step $vl 4 (local.get $B)))
    (local.set $C (vec_step $vl 4 (local.get $C)))
    (local.set $n (vec_step $vl -1 (local.get $n)))

    (br_if 0 (local.get $n) (local.get $n) (local.get $vl)) ;; pass the count and current length back to the top
  end                ;; end vector loop

I "inlined" some local.get and local.set instructions for the sake of clarity here, but that could be changed.

In addition to calling scalar library routines in an otherwise vector loop, scalar_loop could also be a way to allow custom reductions. As I mentioned above, the main strategy for reductions is to add instructions which compute a scalar, given a vector. In pseudo-code, this might look like s += reduce_add.i32(v) would add all the i32 elements of v to produce the scalar sum, and then do a scalar add of the result to compute the running sum. However, we might only provide the most common operators (add, mul, and, or, xor, perhaps), so if you want a custom reduction function, a scalar_loop could be one way to do it. It could run any scalar computation -- at scalar performance, to be sure, but it would allow the rest of the loop to run in vector.

And, scalar_loop is a way that nested SIMD can be added. All the awkwardness of nesting a vec_loop inside another vec_loop comes from the conceptual mixing of one vector context with another vector context. But with a vec_loop inside a scalar_loop inside a vec_loop, the inner vector context would be isolated from the outer vector context. We can then continue to nest as deeply as we want, alternating between vec_loop and scalar_loop (performance implications not necessarily withstanding, but that'll apply to any nested-SIMD design).

And, scalar_loop could be used within a vec_preamble, which would make vec_preamble more general, and allow users to compute arbitrary vectors, such as LUTs, in advance of vec_loop.

`p2_loop`

scalar_loop isn't quite enough to handle the prefix_sum example above, but it's close. scalar_loop steps through linear indices, while the inner loop in prefix_sum above steps through power-of-two indices. If we added a p2_loop construct, which would step through power-of-two indices, and stayed in vector mode, I think that would be enough. Here's a pseudo-code translation of the prefix_sum example above:

void prefix_sum(uint32_t const* A, uint32_t* B, int n) {
  vec_region {
    // the preamble
    vec_t last = broadcast(0);
    vec_t zero = broadcast(0);
    // the loop
    vec_loop(vl; n) {
      vec_t a = vec_load(A);
      // in-register prefix-sum
      p2_loop(shift) {
        vec_t shifted = vec_element_shift(a, shift); // shift elements by "shift" amount
        a = vec_add<u32>(a, shifted);
      }
      // propagate previous sum
      a = vec_add<u32>(a, last);
      last = broadcast(a); // broadcast last element to all lanes
      vec_store(B, a);
      A = vec_step(A, 4, vl);
      B = vec_step(B, a, vl);
      n = vec_step(n, -1, vl);
      continue(n, vl);
    }
  }
}

I don't know how widely applicable this p2_loop construct would be. Perhaps it's too specialized to be worthwhile overall. Or perhaps there's a way we could generalize it to handle more use cases.

penzn commented 3 years ago

@sampsyo has added examples of benchmarks in #5. @sunfishcode and @lemaitre feel free to take a look - it would be great to see what the new operations would help with, comparing to this proposal's baseline.

@lemaitre - SVE-like proposal would be very welcome.

lemaitre commented 3 years ago

@sunfishcode I finally found some time to look at your suggestion.

`vec_preamble`

It looks good. I'm not sure to understand your limitation "But that's pretty limited, since not all loop-invariant vectors are broadcasts" because you should be able (at least) to generate a vector where each element is its lane index, ie: [0, 1, 2, 3...]. This would already be super useful.

`scalar_loop`

I think I understand what you tried to do here, but I fail to see how it decorrelates the context from the SIMD context set by vec_loop. Indeed, in your example, you access vectors of length defined by the outer vec_loop. And if the scalar_loop does not decorrelates the context from the parent SIMD context, couldn't it be done with regular WASM loops (you would still need explicit insert/extract)?

Would it make more sense to say that accessing vector types from within a scalar loop would somehow automatically be translated into insert/extract in such a way that you could never access a vector, but only its elements? So your example would look like that:

  (local $A i32) (local $B i32) (local $C i32) (local $n i32)
  (local $t0 vec.32) (local $t1 vec.32) (local $t2 vec.32)
  (local $vl vec_len.32)
  (local $index i32)
  ...

  local.get $n        ;; number of elements in array to process (passing zero is ok!)
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements
    (local.set $t0 (vec_load $vl 32 (local.get $A)))
    (local.set $t1 (vec_load $vl 32 (local.get $B)))

    scalar_loop.32 $vl $index
        ;; Any reference to $t1 and $t2 are automatically transformed into scalar access
        ;; internally, extract and insert instructions would be generated
        ;; $index designates the lane index currently being processed. It is a plain i32 that can be used as-is in computations.
        (local.set $t2 (i32.add $t1 $t2))
    end

    (vec_store $vl 32 (local.get $C) (local.get $t2))

    (local.set $A (vec_step $vl 4 (local.get $A)))
    (local.set $B (vec_step $vl 4 (local.get $B)))
    (local.set $C (vec_step $vl 4 (local.get $C)))
    (local.set $n (vec_step $vl -1 (local.get $n)))

    (br_if 0 (local.get $n) (local.get $n) (local.get $vl)) ;; pass the count and current length back to the top
  end                ;; end vector loop

However, I don't think that scalar_loop can be used to implement efficient reductions. Horizontal operations within a vector is a "slow" operation on all architecture, therefore, they should be avoided within a loop. Moreover, the fastest way to implement horizontal operations is to use a tree based reduction that consists of log2(length) steps, whereas a scalar_loop would have length steps.

The classical way to implement reductions is to have a vector accumulator, and reduce it at the end of the loop using tree-based reduction. This is currently not possible with your design because of the monotonically decreasing vector length: What should you do with the lanes from the last iteration that are now outside vec_len, but have not been reduced yet?

This also brings to the table that you most likely need a vec_epilog after the vec_loop. I think that the vec_region from the end of your post could handle that, though.

`p2_loop`

p2_loop would also be useful for reductions (in order to implement the tree-based reduction). However, I don't see the benefit from having a special construct for it. Wouldn't it be simpler to have regular WASM loop to iterate over the power of 2s up to $vl?

Concerning the prefix sum, the broadcast at each iteration makes it simpler to implement it for your model than a plain old reduction. So I think we should focus more on reductions (see next part). Also, as a side note, the algorithm with shifts is not the fastest one on AVX512. The optimal shuffle patterns are much worse than that.

Partial loop carried dependencies

This one was not a part of your post, but I think that is the most important idea of your post:

So, what if we added a way to constrain a vec_loop to having a monotonically decreasing length?

I assume the idea is that from one iteration to the next, lanes can "disappear", but never "appear", right? It proves to be enough for some algorithms like having loop constants or prefix sums.

However, it is still not enough to implement an efficient reduction. Like I said at the earlier, reduction is too slow to be done at each and every iteration, so the proper way to do it is to have a vector accumulator that you reduce after the end of you loop. Here is a SVE code to illustrate this:

uint32_t reduce_add(uint32_t const* A, int n) {
  // vec_preamble
  svuint32_t acc = svdup_n_u32(0);
  svuint32_t zero = svdup_n_u32(0);

  // vec_loop
  uint64_t vl = svcntw();
  for (int i = 0; i < n; i += vl) {
    svbool_t mask = svwhilelt_b32(i, n);
    svuint32_t a = svld1_u32(mask, &A[i]);
    acc = svadd_m_u32(mask, acc, a);
  }

  // vec_epilog
  // hardware reduction
  //uint32_t sum = svaddv_u32(svptrue_b32(), acc);

  // software tree-based reduction
  for (int shift = 1; shift < vl; shift <<= 1) { // this loop could be unrolled on fixed-size ISAs
    svbool_t splice_mask = svwhilelt_b32(0, shift);
    splice_mask = svbnot(svptrue_b32(), splice_mask);
    svuint32_t a = svsplice_u32(splice_mask, zero, acc); // shift elements by "shift" amount
    acc = svadd_m_u32(mask, acc, a);
  }
  uint32_t sum = svlastb(svptrue(), acc);
  return sum;
}

In this case, acc needs to retain all its elements, even when the iteration has less active elements. Otherwise, the reduction will miss some elements.

sparker-arm commented 2 years ago

Has there been any more thoughts on this? This kind of loop construct looks like a concise stepping stone from fixed to flexible, and also flexible from a compiler strategy POV. The first proposal here is also similar to how we handle vector loops in Arm's MVE. I've just begun trying to implement flexible types in cranelift, likely targeting NEON first, and this loop proposal looks very appealing. As a side question, I'm not sure if I just failed to read the spec properly or if it hasn't been updated.... but what are the rules around what 'flexible' means? Is there a minimum width (128-bits) with the minimum being a factor of any other supported size?

penzn commented 2 years ago

It looks like we have moved in the other direction, there is a PR (which needs changes) to introduce more vector-friendly behavior, as opposed to loop instructions: #27. Idea expressed in this issue is solving the same problem but on much higher level.

Yeah, the minimum width is 128-bits, it is (sort of) implied by backwards compatibility with 128-bit SIMD, but I think we should make it explicit.

sunfishcode commented 2 years ago

To be clear, the proposal above does not have a 128-bit minimum or required factor. It intends for VMs to insert remainder loops or use masking as needed, and gives them the flexibility to do so, in whichever manner is best for the target architecture.

If there's interest in this proposal, I believe the various objections raised above can be answered. It does require significantly more complex language features, but it provides more flexibility to VMs.

sparker-arm commented 2 years ago

Exactly, it's the flexibility (portability) that I'm really interested in here. I'm thinking about this from purely a compiler/runtime engineer perspective and generating a remainder loop, for an architecture that doesn't support efficient masking, sounds much easier than converting a predicated vector loop into a non-predicated one along with a scalar remainder. As does unrolling on a core with multiple vector pipes. Getting performance portable SIMD is already hard enough without predication, and I'm not sure I'd want to be writing the wasm cost model for LLVM's vectorizer :)

What is the expectation of how users will use fixed vs flexible vectors? As an autovec target, I'm presuming flexible would be used any time the trip count is unknown at compile time and I'm concerned that would mean that fixed SIMD is barely used, potentially reducing the performance of traditional SIMD engines to that of a scalar loop. Though, feature detection and loop versioning could solve this problem.

penzn commented 2 years ago

What is the expectation of how users will use fixed vs flexible vectors? As an autovec target, I'm presuming flexible would be used any time the trip count is unknown at compile time and I'm concerned that would mean that fixed SIMD is barely used, potentially reducing the performance of traditional SIMD engines to that of a scalar loop.

What do you mean by traditional SIMD engine? Flexible vector operations are meant to lower to regular SIMD instructions at the time the module is compiled by the engine.

TBH, maybe the spec is not clear on some of those things, and those of us who have been working on Wasm SIMD may take some of the features for granted, therefore your feedback would be very valuable.

Exactly, it's the flexibility (portability) that I'm really interested in here. I'm thinking about this from purely a compiler/runtime engineer perspective and generating a remainder loop, for an architecture that doesn't support efficient masking, sounds much easier than converting a predicated vector loop into a non-predicated one along with a scalar remainder. As does unrolling on a core with multiple vector pipes. Getting performance portable SIMD is already hard enough without predication, and I'm not sure I'd want to be writing the wasm cost model for LLVM's vectorizer :)

The tradeoff is in shifting work between producer and consumer. For vector-like operations it is on producer (toolchain that produces Wasm), while for first-class vector loops it is on consumer (the runtime). Historically we have been leaning towards making the producer do more work, the motivation is that the producer runs once when the module is built, but the consumer runs every time it is used (on the Web that usually means on every page view).

sparker-arm commented 2 years ago

What do you mean by traditional SIMD engine? Flexible vector operations are meant to lower to regular SIMD instructions at the time the module is compiled by the engine. Sorry, I meant a physical CPU implementation.

The tradeoff is in shifting work between producer and consumer. For vector-like operations it is on producer (toolchain that produces Wasm), while for first-class vector loops it is on consumer (the runtime). And this decision makes sense, and my concern is the complexity, and associated runtime cost, that it will take to lower masked operations efficiently on to an architecture which doesn't support it natively. So both the producer and consumer will have to work hard. On the other hand, it would be relatively easy for a runtime compiler to take a vec_loop and perform predication (if it deemed it beneficial).

Even calculating whether to use masked load/stores, in the target-specific compiler, is a non-trivial task. So, my gut feeling is it will be hard to effectively support, on the majority of current CPUs, via an agnostic ISA.

The fixed width proposal introduced instructions that could be mapped efficiently to the popular vector extensions, but the same cannot be said if we introduce explicit masking into wasm. I'd say it's too early to call NEON legacy when there isn't a consumer device that supports SVE and LLVM will, in some cases, still use NEON even when SVE is supported. This proposal doesn't look as powerful as introducing SVE-like operations, but it would enable a good subset of cases where we just want to vectorize and take advantage of the varying widths of each vector extension. That's why I see it as a good stepping stone for today's cores, before we get to being able to take advantage of a full-blown vector machine.

penzn commented 2 years ago

Even calculating whether to use masked load/stores, in the target-specific compiler, is a non-trivial task. So, my gut feeling is it will be hard to effectively support, on the majority of current CPUs, via an agnostic ISA.

Can you elaborate a little bit, what calculations about using masks are you referring to and how do they apply here? Is this purely about NEON/SSE vs SVE/AVX?

I'd say it's too early to call NEON legacy when there isn't a consumer device that supports SVE and LLVM will, in some cases, still use NEON even when SVE is supported.

To provide background, the point about SVE being a valid target came from @akirilov-arm (https://github.com/WebAssembly/flexible-vectors/pull/27#issuecomment-860899060):

Just to add a bit to the SVE market availability angle - Arm has already announced processor cores with SVE support that cover almost the full range of the A architecture profile; the latest announcement was 3 weeks ago and concerned the next generation mobile cores. While it is true that it is not possible to buy anything featuring them on the market right now, if the past provides any indication about the future, then we should expect actual products in at most a year.

Some clarity on this would be greatly appreciated 😉

BTW, supporting instructions beyond 128 bits has similar challenges on x86, where SSE also has no masks, and there is still some hardware that only has SSE. That said, I feel like there are two problems discussed here: efficient support for "advanced" ISAs (AVX*, SVE) and fallbacks for less advanced 128-bit ISAs like SSE and Neon.

sparker-arm commented 2 years ago

Can you elaborate a little bit, what calculations about using masks are you referring to and how do they apply here? Sure.

Again, from purely a auto-vec compiler engineer perspective it takes some effort to model of the costs of different vectorization strategies - including whether using the more advanced vector operations, such as masks and gather/scatter, would be beneficial. Here is the X86 backend implementation in LLVM for masked and here is the costing for gather/scatter. It should be noted that, even though AVX provides these operations, it doesn't mean a compiler would use them regardless of the situation.

The backend implementation for AArch64 is significantly different, because NEON just can't do these things efficiently and SVE code generation isn't good enough yet.

Arm supports these features in microcontrollers, but again, a lot of effort has to go into making a reasonable decision about the real costs of using them.

Some clarity on this would be greatly appreciated

Yes, there has (finally!) been product announcements, but my point is that it will probably be 5 years until the majority of people browsing the web on an Arm device has SVE support. My assumption here is that it's a goal of wasm to support a common subset of user CPU features, but currently all phones, tablets and most (Intel and Arm) chromebooks wouldn't meet this criteria. Again, feature detection would solve this problem if we're happy with the increased binary size.

I feel like there are two problems discussed here: efficient support for "advanced" ISAs (AVX*, SVE) and fallbacks for less advanced 128-bit ISAs like SSE and Neon.

Definitely. My suggestion would be to reduce the scope of the flexible proposal, focusing on how we can support wider vectors in a performance portable manner. Another future proposal could introduce the more advanced operations of masking and gather/scatter, when these are supported by the majority of devices. Or we could make the proposal dependent upon feature detection, but it feels like we could still decouple sizeless vectors from 'advanced' support.

sparker-arm commented 2 years ago

Just to follow up on this... I've been made aware that LLVM costs gather-scatters so high that they're almost disabled for X86, unless AVX-512 is supported.

int X86TTIImpl::getGatherOverhead() const {
  // Some CPUs have more overhead for gather. The specified overhead is relative
  // to the Load operation. "2" is the number provided by Intel architects. This
  // parameter is used for cost estimation of Gather Op and comparison with
  // other alternatives.
  // TODO: Remove the explicit hasAVX512()?, That would mean we would only
  // enable gather with a -march.
  if (ST->hasAVX512() || (ST->hasAVX2() && ST->hasFastGather()))
    return 2;

  return 1024;
}

int X86TTIImpl::getScatterOverhead() const {
  if (ST->hasAVX512())
    return 2;

  return 1024;
}

penzn commented 2 years ago

Again, from purely a auto-vec compiler engineer perspective it takes some effort to model of the costs of different vectorization strategies - including whether using the more advanced vector operations, such as masks and gather/scatter, would be beneficial.

Using vector loop approach would move vectorization logic to runtimes, I don't think that is desirable or in line with how WebAssembly has been approaching similar problems. Some mask support is desirable for ISAs that have them, mainly because there are comparison operations that return them. This does not mean that we have to unlock the full power of masked operations right away.

My suggestion would be to reduce the scope of the flexible proposal, focusing on how we can support wider vectors in a performance portable manner. Another future proposal could introduce the more advanced operations of masking and gather/scatter, when these are supported by the majority of devices.

I am not sure what you mean by that, I don't think there are any scatters or gathers among the operations listed in the spec (keep in mind that instructions in higher tiers are not 'in' until we can test them).

sparker-arm commented 2 years ago

Sorry, I have been conflating masked memory operations and gather/scatter, but I wanted to show that, just because an architecture has some instructions, it doesn't mean it's a good idea to use them. My general observation of the wasm instruction proposals is that it's shown how an instruction maps to each target ISA, which is probably fine most of time, but it is evidently not in the case of more complicated cases, such as vector memory ops. A DSP engineer, with a good knowledge of the target (micro)architecture, is generally going to be much better at writing a kernel using assembly/intrinsics than what a compiler can manage because it's still frustratingly difficult to evaluate a loop as a whole. And this is with target-specific information.

Some mask support is desirable for ISAs that have them, mainly because there are comparison operations that return them. This does not mean that we have to unlock the full power of masked operations right away.

Agreed. To reiterate, I only have an issue with masked memory instructions, but how do you intend on controlling the 'unlocking'? I have, naively, assumed that breaking the spec up is the only way to do this... Or do you plan to formalize the current different tiers at different times?

Using vector loop approach would move vectorization logic to runtimes, I don't think that is desirable or in line with how WebAssembly has been approaching similar problems.

But the wasm code here is still representing a vectorized loop - it's just up to the target backend to select a width and what it wants to do to handle a remainder... A high-level vec loop structure makes this easy/fast and doesn't require vectorization, but it would it would likely need a pass to operate on the loop.

penzn commented 2 years ago

A DSP engineer, with a good knowledge of the target (micro)architecture, is generally going to be much better at writing a kernel using assembly/intrinsics than what a compiler can manage because it's still frustratingly difficult to evaluate a loop as a whole. And this is with target-specific information.

This has been the dilemma of Wasm performance extensions from the beginning - it is true that somebody with knowledge of microarchitecture would use intrinsics or write assembly, but the approaches on different architectures end up somewhat far from one another. There is a project, Highway, that shows that compromises are possible, and performance-critical code can be written in this manner.

To reiterate, I only have an issue with masked memory instructions, but how do you intend on controlling the 'unlocking'? I have, naively, assumed that breaking the spec up is the only way to do this... Or do you plan to formalize the current different tiers at different times?

The idea is to have a good vision about non-masking operations (or at least make headway) before trying to work masking ops out. The tiers (which don't include any masking ops at all at the moment) are meant to be collapsed into one eventually, with the features not making the cut becoming "future work".

A high-level vec loop structure makes this easy/fast and doesn't require vectorization, but it would it would likely need a pass to operate on the loop.

That is precisely what I would like to avoid adding to runtime. It won't be a just a simple instruction selection pass either, particularly because loops can be nested and multiple kinds of such constructs are necessary.

WebAssembly / flexible-vectors

Adaptive vector lengths without global state #21

New Types:

New Opcodes:

Example

Nondeterminism

Vector subroutines

Nested vec_loops

Masks, shuffles, extracts and inserts, strided loads/stores

Efficient and fast

vec_preamble

scalar_loop

p2_loop

vec_preamble

scalar_loop

p2_loop

Partial loop carried dependencies

Nested `vec_loop`s

`vec_preamble`

`scalar_loop`

`p2_loop`

`vec_preamble`

`scalar_loop`

`p2_loop`