WebAssembly / flexible-vectors

Vector operations for WebAssembly
https://webassembly.github.io/flexible-vectors/
Other
48 stars 7 forks source link

Thoughts about long vectors #2

Closed lemaitre closed 4 years ago

lemaitre commented 6 years ago

Hello everyone,

Maybe it's a bit soon to start talking about long vectors where short vectors aren't specified yet, but I wanted to share my point of view on the subject.

Fixed width SIMD would be a nightmare to handle portably as not all the architectures have the same SIMD width. For instance, AVX provides 256 bit registers, and AVX512 provides 512 bit registers. Actually, ARM presented a while ago its new SIMD ISA: Scalable Vector Extension (SVE).

The idea behind SVE is that the size of the registers is not known during compilation and is hardware dependent: the same binary would be able to run efficiently on different hardware with different SIMD width. I think that is exactly the abstraction we need for WASM to handle long vectors. The SIMD width would be a constant that is not known at compile time (when the WASM code is generated), but known when the VM runs the code.

What do you think?

jfbastien commented 6 years ago

See the distilled version of this discussion here: https://github.com/WebAssembly/design/blob/fc886f061e825d154aca910f9ab3b511b7fdf4ea/FutureFeatures.md#long-simd Earlier discussion here: https://github.com/WebAssembly/design/issues/41

lemaitre commented 6 years ago

I fail to see where a "dynamic" vector width has been discussed. Most of the discussions were way before SVE anyway.

So let me answer (very quickly) to the questions from the FutureFeatures document:

  • How will this model map onto popular modern SIMD hardware architectures?

Pretty well for most operations. Shuffling would not be shuffling per se but "table lookup", but will be as simple/complex as the short SIMD shuffle operation. Gather/scatter should not be any different. Masking would require some thoughts but should fine.

  • What is this model's relationship to other hardware parallelism features, such as GPUs and threads with shared memory?

GPUs should be pretty similar, but with very long vectors. threads: applicable, but for very long loops?

  • How will this model be used from higher-level programming languages? For example, the C++ committee is considering a wide variety of possible approaches; which of them might be supported by the model?

This is a hard bit. Auto-vectorization would be mostly fine (SVE compilers are able to handle this). But hand-written C (or C++) code with "dynamically" sized vectors would be difficult: the standard is not suited for that yet.

  • What is the relationship to the "short SIMD" API? "None" may be an acceptable answer, but it's something to think about.

In that case, none. But interfaces would/should be similar.

  • What nondeterminism does this model introduce into the overall platform?

None. At least no more than short SIMD. Source of non-determinism: fast reciprocals, optionally fused add-mul, unspecified reduction orders... There might be some alignment issues, and page faulting accesses?

  • What happens when code uses long SIMD on a hardware platform which doesn't support it? Reasonable options may include emulating it without the benefit of hardware acceleration, or indicating a lack of support through feature tests.

Every single architecture will be able to support such long SIMD interface as the size could be one. Unless we set a lower bound for the size (like SVE). In that case, all the architectures supporting the minimum size.

This is my feeling, and I just answer quickly without a deep and long reasoning on the questions.

jfbastien commented 6 years ago

Most of the discussions were way before SVE anyway.

No, they were not.

lemaitre commented 6 years ago

Going through all the links you gave me (recursively), I only see discussions from 2015. Which is before SVE that was announced in august 2016. And they don't talk much about dynamic size.

Dynamic size is, IMO, the key feature a long SIMD API should have.

jfbastien commented 6 years ago

Going through all the links you gave me (recursively), I only see discussions from 2015. Which is before SVE that was announced in august 2016.

Yes, that is correct. It doesn't make my statement incorrect.

lemaitre commented 6 years ago

I don't get you, then... I said the discussions were before SVE, and you said no... Maybe a quoting issue? Could you rephrase to avoid any ambiguity?

eddyb commented 6 years ago

(drive-by mention) Before ARM SVE, there were Cray vectors, and RISC-V will probably get something in the vein of Cray/SVE vectors before (fixed-size) "packed SIMD".

Not sure this repo is the appropriate place to discuss non-fixed-size vectors, as the interesting implementations are quite different from "packed SIMD", and require some concept of "machine register of unknown/dynamic size" at compile-time (and in WASM's case, also in the bytecode).

eternaleye commented 6 years ago

@lemaitre I suspect @jfbastien is implying that, while the discussions were before the public announcement of SVE, there may have been parties to the conversation that were aware of the SVE work, and those could have/did find ways of raising the option without violating any agreements they might be bound by.

sunfishcode commented 6 years ago

Just to confirm, in discussions where I've been a participant, "long vectors" mean dynamic vector lengths. The classic example of this is the Cray family of vector processors. These ideas came to mind as we started thinking about how best to expose AVX2, AVX-512, and possibly GPUs, in a portable way, if possible. SVE is now another architecture that might benefit, but it isn't what's driving these conversations, in part because so few chips actually include it.

I see long vectors as being complementary to short vectors. That is, they both have things they can do better than the other (there is also overlap). As such, adding short vectors to wasm won't necessarily preclude adding long vectors in the future.

eddyb commented 6 years ago

I prefer "packed SIMD" vs "flexible/scalable/Cray vectors", in terms of nomenclature, because it makes the distinction clearer, IMO.

@gnzlbg and I also agreed that they could coexist in the Rust ecosystem, especially given that packed SIMD exists today in many architectures, and even if you could support a nicer programming model by implementing dynamic length vectors on top of it, LLVM is barely beginning to get some support for it, so we'd have to either write our own backend support for them or wait.

I was looking for a discussion on this for WASM because it "feels" like WASM could benefit from dynamic vector lengths in the bytecode, but if the tooling isn't ready, to generate appropriate (e.g. SSE) native SIMD instructions, then it'd be all for naught.

gnzlbg commented 6 years ago

I was looking for a discussion on this for WASM because it "feels" like WASM could benefit from dynamic vector lengths in the bytecode, but if the tooling isn't ready, to generate appropriate (e.g. SSE) native SIMD instructions, then it'd be all for naught.

There is not much that I can add here beyond that I agree with what @sunfishcode stated here (https://github.com/WebAssembly/simd/issues/29#issuecomment-414205285): variable-length vectors and packed vectors are orthogonal.

@rkruppe is implementing variable length vector support in LLVM and is aware of the similarities and differences between ARM SVE and RISC-V vectors. I don't know whether he has given some thought to WASM yet - the LLVM WASM backend is just starting to gain simd128 support.

lemaitre commented 6 years ago

I re-read some of the old topics about long-vectors, while it is not clear that it's about "non-fixed" sized vectors, it is true that is is mentioned (at least here for SIMD.js).

I never studied how cray machines work, so I might be incorrect, but have the impression it does not scale well with the complexity of the computation. Also, if such a design were used for WASM, I have the feeling that the web browser would need to be an autovectorizing compiler. And that's exactly what we want to avoid.

Regarding GPUs, that's true they are some kind of SIMD machines, but they definitely don't have a SIMD ISA. The parallelism is not exposed by the instructions. I don't think that's the way to go for WASM. We should have a SIMD ISA, which could be compiled for GPUs.

I think that the best way to have "long vectors" in WASM is to use the same idea as SVE: The vector width is constant on a machine, but can change from one machine to the other. So it's not really dynamic size. The actual dynamism would come from masking.

The generation from such an ISA would be mostly trivial as it would fit directly into the target SIMD ISA. (at least for regular operations)

On the naming, I agree "long vectors" is not good. I'm not sure "scalable vectors" would be authorized because of SVE. I'm not a big fan of "Cray vectors". But I do like "flexible vectors".

I also agree that short and long vectors are two different beasts and are actually orthogonal.


Now, let me introduce some straw-man syntax to continue the discussion further.

There is a new type: vec. It represents the biggset SIMD register available on the target platform. Its size depends on the target platform, and the target platform alone. (for a WASM->ASM compiler, a vec is fixed size)

We can require its size to be either a power of 2, or a multiple of 128. (The latter is probably more flexible, but the former is probably simpler to generate WASM)

Interpretation of the content:

Regular operations are the same as packed simd: apply lane-wise.

Extra operations:

Here is a correspondance table (only for F32 for simplicity)

WASM SSE AVX AVX512 Neon SVE
vec __m128 __256 __m512 float32x4_t svfloat_t
vec.size 128 256 512 128 8*svcntb()
vec-32.size 4 8 16 4 svcntw()
vec-f32.add _mm_add_ps() _mm256_add_ps() _mm512_add_ps() vaddq_f32() svadd_x()

You can see with this table how immediate the translation is for simple instructions.


But we need masking, and we need to be clever about it.

Let's introduce some mask types:

N represent the bit-width of the lanes being predicated. This part is crucial to have efficient translation on architectures that don't any actual type for masking (like SSE/AVX/Neon/Altivec). This is also why nomask is required as a different type.

The mask-xN is also important for those architectures to handle masked load/store, but the rest of the operations are not masked.

mask-mN, mask-zN and mask-xN have the same exact binary representation, but are different types to reduce the coding size of the instruction set (might not be relevant, I don't know at this point).

Every single vector instruction would be masked, but the masks live in a second stack. The masked instructions on vectors do not pop masks from their stack, but always use the first one. (In practice, the same mask is used for many instructions)

Some extra instructions would be needed to take care of the mask stack.


All in all, I really think that with this design, the WASM->ASM compilation for "long vectors" would be as simple as for "short vectors". Of course, the generation of the WASM would be far more difficult, but many efforts are already put into this anyway.

eddyb commented 6 years ago

Also, if such a design were used for WASM, I have the feeling that the web browser would need to be an autovectorizing compiler

The hard part of vectorizing code (i.e. finding work to do using vectors) would be done to generate loops using the WASM dynamic-length vector instructions, at compile-time.

The WASM runtime would "only" need to duplicate each loop body that uses such an instruction, and specialize for the "one element at a time" case when the number of elements remaining is less than what the native packed SIMD support allows. The original loop can "just" be compiled with a fixed number of elements to process at a time, based on the instructions using the vectors, and the packed SIMD constraints.


The rest of your comment goes into more detail that overlaps with what I've described above, but AFAIK it shouldn't matter to the WASM spec/generators whether the machine fixes a vector length or each vector-using loop has its own, and implementations can do either or hardcode vector length to 1.

That last part raises an interesting question: can we easily handle vectors by using the existing operations, with modified value type? Then existing integer/float ops work with vector length of 1. EDIT: wouldn't work because wasm doesn't have 8-bit and 16-bit integers.

lemaitre commented 6 years ago

The WASM runtime would "only" need to duplicate each loop body that uses such an instruction, and specialize for the "one element at a time" case when the number of elements remaining is less than what the native packed SIMD support allows.

That would be a trivial task only for trivial loops. How would you do with the WASM code equivalent to this:

int *A, *C;
#pragma simd
for (int i = 0; i < n; i++) {
  if (A[i] > 0) {
    C[A[i]] += 1;
  }
}

or this:

char *s;
int n;
#pragma simd
while (*s != 0) {
  n++;
  s++;
}

These require a vectorizing compiler. It's not just converting scalar operations into vector ones. So I don't see how having a special simd loop in wasm would apply to those examples.

The original loop can "just" be compiled with a fixed number of elements to process at a time, based on the instructions using the vectors, and the packed SIMD constraints.

The rest of your comment goes into more detail that overlaps with what I've described above, but AFAIK it shouldn't matter to the WASM spec/generators whether the machine fixes a vector length or each vector-using loop has its own, and implementations can do either or hardcode vector length to 1.

The problem is: I don't see how the loop approach can work on non-trivial code.

That last part raises an interesting question: can we easily handle vectors by using the existing operations, with modified value type? Then existing integer/float ops work with vector length of 1. EDIT: wouldn't work because wasm doesn't have 8-bit and 16-bit integers.

I see another problem: sizeof(float) != sizeof(double). So the scalar case would require different types for different inner types.

eddyb commented 6 years ago

These require a vectorizing compiler.

Yes, but that would only need to exist in the compilation to WASM, e.g. in LLVM.

For your second example, I imagine it'd look something like this, unvectorized:

start(n: i32, s: i32):
    v = load.i8 s
    c = icmp eq v, 0
    n1 = add n, 1
    s1 = add s, 1
    br c, exit(n), start(n1, s1)

exit(n: i32):
    ret n

At compile-time, before WASM, a vectorizing compiler could turn it into:

start(n: i32, s: i32):
    setvl MAXVL ; start as wide as possible
    v = vec.speculative.load.i8 s
    ; VL is now equal to the number of successful loads
    ; (e.g. if a fault happened at s + 23, VL = 23)

    exits = vec.icmp.i8 eq v, 0
    ; `exits` is a boolean/mask vector where the first non-0
    ; element is where the original loop would've exited.
    exit_count = vec.count_ones exits
    loop_count = vec.count_leading_zeros exits
    ; perform the effects of several loop iterations at once
    n_after = add n, loop_count
    s_after = add s, loop_count
    ; leave the loop if the original would've done so by now
    any_exit = icmp ne exit_count, 0
    br any_exit, exit(n_after), start(n_after, s_after)

exit(n: i32):
    ret n

And WASM would contain some similar operation set and equivalent semantics.

An WASM implementation can choose to:

That last option could theoretically end up lowering the WASM to this (given MAX_VL=32):

start(n: i32, s: i32):
    (v, mask) = simd.speculative.load.i8x32 s
    ; `mask` contains non-0 for successful loads

    c = simd.icmp.i8x32 eq v, 0
    exits = simd.and.boolx32 c, mask
    ; `exits` is a boolean/mask vector where the first non-0
    ; element is where the original loop would've exited.
    exit_count = simd.count_ones.boolx32 exits
    loop_count = simd.count_leading_zeros.boolx32 exits
    ; perform the effects of several loop iterations at once
    n_after = add n, loop_count
    s_after = add s, loop_count
    ; leave the loop if the original would've done so by now
    any_exit = icmp ne exit_count, 0
    br any_exit, exit(n_after), start(n_after, s_after)

exit(n: i32):
    ret n

Note that this didn't even need the "one element at a time" copy of the loop body, that I was mentioning in my comment, because there's no "elements left to process to pass to setvl.

If we take a slightly more contrived example, like strnlen, we get this at compile-time:

start(n: i32, remaining: i32, s: i32):
    c = icmp eq remaining, 0
    br c, exit(n), rest
rest:
    v = load.i8 s
    c1 = icmp eq v, 0
    n1 = add n, 1
    s1 = add s, 1
    remaining1 = sub remaining, 1
    br c1, exit(n), start(n1, remaining1, s1)

exit(n: i32):
    ret n

And some compile-time (somewhat mediocre) vectorization:

start(n: i32, remaining: i32, s: i32):
    setvl remaining ; don't go past the provided limit
    v = vec.speculative.load.i8 s
    ; VL is now equal to the number of successful loads
    ; (e.g. if 23 < remaining, and a fault happened at s + 23, VL = 23)

    exits = vec.icmp.i8 eq v, 0
    ; `exits` is a boolean/mask vector where the first non-0
    ; element is where the original loop would've exited.
    exit_count = vec.count_ones exits
    loop_count = vec.count_leading_zeros exits
    ; perform the effects of several loop iterations at once
    n_after = add n, loop_count
    s_after = add s, loop_count
    remaining_after = sub remaining, loop_count
    ; leave the loop if the original would've done so by now
    exit_count_nonzero = icmp ne exit_count, 0
    remaining_after_zero = icmp eq remaining_after, 0
    any_exit = or exit_count_nonzero, remaining_after_zero
    br any_exit, exit(n_after), start(n_after, remaining_after, s_after)

exit(n: i32):
    ret n

Now if we have that in some WASM encoding and want to replicate that last packed-SIMD-with-speculated-load experiment, but let's say we have no masked loads, we'd end up with:

start(n: i32, remaining: i32, s: i32):
    c = icmp ult remaining, 32
    br c, start_scalarized(n, remaining, s), rest
rest:
    (v, mask) = simd.speculative.load.i8x32 s
    ; `mask` contains non-0 for successful loads

    c1 = simd.icmp.i8x32 eq v, 0
    exits = simd.and.boolx32 c1, mask
    ; `exits` is a boolean/mask vector where the first non-0
    ; element is where the original loop would've exited.
    exit_count = simd.count_ones.boolx32 exits
    loop_count = simd.count_leading_zeros.boolx32 exits
    ; perform the effects of several loop iterations at once
    n_after = add n, loop_count
    s_after = add s, loop_count
    remaining_after = sub remaining, loop_count
    ; leave the loop if the original would've done so by now
    exit_count_nonzero = icmp ne exit_count, 0
    remaining_after_zero = icmp eq remaining_after, 0
    any_exit = or exit_count_nonzero, remaining_after_zero
    br any_exit, exit(n_after), start(n_after, remaining_after, s_after)

; duplicated subgraph of the original CFG, with MAXVL = 1
start_scalarized(n: i32, remaining: i32, s: i32):
    vl_0 = icmp eq remaining, 0
    br vl_0, start_scalarized_vl_0, start_scalarized_vl_1

; further duplication, for VL = 1 vs VL = 0
start_scalarized_vl_1:
    v = load.i8 s
    exits = icmp.i8 eq v, 0
    exit_count = cast exits as i32
    loop_count = sub 1, exit_count
    n_after = add n, loop_count
    s_after = add s, loop_count
    remaining_after = sub remaining, loop_count
    remaining_after_zero = icmp eq remaining_after, 0
    any_exit = or exits, remaining_after_zero
    br any_exit, exit(n_after), start_scalarized(n_after, remaining_after, s_after)

 ; this is weird only because of re-rescalarized previously-vectorized WASM
; (the original would go to `exit` directly, without `start_scalarized_vl_0`)
; while this can optimize to `goto exit`, I left some of it around for clarity
start_scalarized_vl_0:
    exit_count = 0
    loop_count = 0
    n_after = n
    s_after = s
    remaining_after = remaining ; 0
    exit_count_nonzero = false
    remaining_after_zero = icmp eq remaining_after, 0 ; true
    any_exit = or exit_count_nonzero, remaining_after_zero ; true
    br any_exit, exit(n_after), start_scalarized(n_after, remaining_after, s_after)

exit(n: i32):
    ret n

As you can see, it's certainly plausible, but not fun at all, and could generate inefficient code.

lemaitre commented 6 years ago

These require a vectorizing compiler.

Yes, but that would only need to exist in the compilation to WASM, e.g. in LLVM.

You didn't understand me, but that's fine as I realized I didn't understand you. When you said:

The WASM runtime would "only" need to duplicate each loop body that uses such an instruction, and specialize for the "one element at a time" case when the number of elements remaining is less than what the native packed SIMD support allows.

I had the impression you had something similar to this in mind. But apparently you didn't.

And WASM would contain some similar operation set and equivalent semantics.

An WASM implementation can choose to:

  • have MAX_VL = 1 - that makes the vectorized version almost identical to the original
  • support the vector instructions natively through e.g. ARM SVE
  • lower them to packed SIMD, given an architecture that can speculate loads

Unless vec-32 and vec-64 were actually different types (which is not the case for the pack simd proposal), the MAX_VL = 1 would not be possible to encode because sizeof(float) != sizeof(double).

There is no fixed width SIMD architecture that needs speculative loads: as long as the accesses are aligned, you will always load at most one cache line alone.

Also, the translation to fixed-width simd ISA will be not really different from the translation to SVE. Extra care would be needed for masking (hence my multiple mask types), but that's it.

I think the setvl is a useless complication as this can be easily emulated with masks. And I would avoid having some global states that can change over time. That's why I think having the vector length being a platform-dependent constant is a good approach.

As you can see, it's certainly plausible, but not fun at all, and could generate inefficient code.

I don't see where you are trying to go. The code you tried to show would definitely not be the one generated from the previous code, for any architecture. Why/how would you duplicate the body loop to do WASM->ASM conversion. You just need to fallback on some instruction emulation if some are not directly supported by the target. This is, most of the time, easier than what people expect.

baryluk commented 5 years ago

More semantic oriented vector types, that allow a loop or code to be specialized either to short or wide vectors (depending what hardware provides), or the ones that are flexible (i.e. depending on hardware), like ARM SVE, would really make it nice to use, and make the code even more portable and future proof I think. Even if the hardware doesn't support scalable vectors, compiler would know what it the wides available vector that can support all used operations and use it in generated code.

The big piece of it to abstract a width of a vector. I.e. provide reduce operations (FloatWidest.sum, min, etc) and allow looping to be width independent (i.e. i += Float32Widest.size(), and it will evaluate to constant determined at compile time).

I think design should really focus on AVX2/AVX512 and SVE in principle (and provide efficient fallbacks for "legacy" SIMD), even if these are not widely available yet. It will change in few years dramatically, and it is important for the Web. So for example, it should do use masking extensively, conflict detection, etc.

I understand wide vector and non-fixed vectors are a two different things, but again, I think @lemaitre makes a lot of good points.

gnzlbg commented 5 years ago

More semantic oriented vector types, that allow a loop or code to be specialized either to short or wide vectors (depending what hardware provides), or the ones that are flexible (i.e. depending on hardware), like ARM SVE, would really make it nice to use, and make the code even more portable and future proof I think.

Are vector types / APIs that work for both packed vector types and Cray vectors available on any programming language, hardware, etc. ? AFAIK designing a proper "API" / "instruction set" for doing this that can be re-targetted for hardware supporting either packed or Cray (or both) vector types is an open research problem.

I think design should really focus on AVX2/AVX512 and SVE in principle (and provide efficient fallbacks for "legacy" SIMD), even if these are not widely available yet. It will change in few years dramatically, and it is important for the Web.

For all we know right now, ARM SVE might die next year, and RISC-V vector extensions might never become mainstream, or worse, they might release a slightly different future revision of those extensions that does become mainstream but is incompatible with anything that we can imagine right now. The simd128 proposal delivers value today for the hardware that we have today, and does not prevent us from doing something better in the future if those predictions end up becoming reality.

penzn commented 5 years ago

A bit different way to look at it, flexible-width vector operations can be thought of as a way to provide compatibility between platforms that have different SIMD width, as long as the operations are still within the "common denominator" set. Supported vector length would vary based on what hardware you are running, while the operations would stay the same.

I think it would also reduce WASM instruction count for the same kernel, and probably improve performance, as memory bookkeeping code would move to runtime. There is a relatively short post by Patterson and Waterman covering that with respect to native code, but it would apply here as well.

penzn commented 5 years ago

Disclaimer: I tried to write up a longer version, but it ended up being may be a bit too long, so my apologies for those who would try to read this.

In short, I think virtual vector operations can map to packed SIMD with some extensions, so hardware vector support is not needed to support this idea in WASM.

I'm going to try to illustrate this with element-wise addition of two vectors. There are more interesting examples, but this would be still representative of the challenges of producing efficient data-parallel code.

One can implement SIMD vector add like this (suppose simdload, simdstore, and simdfadd are the SIMD intrinsics):

// a = b + c
// LEN is numer of SIMD lanes
void add(float * a, float * b, float * c, unsigned sz) {
  unsigned i;
  for (i = 0; i < sz / LEN; ++i) {
    simdstore(a, simdfadd(simdload(b), simdload(c)));
    a += LEN;
    b += LEN;
    c += LEN;
  }
  for (i = 0; i < sz % LEN; ++i) {
    *a = *b + *c;
    ++a;
    ++b;
    ++c;
  }
}

This lowers to hardware SIMD instructions that in WASM's case represent a portable subset between major architectures. Note the second, scalar, loop - it seems like not much, but can have detrimental effects on performance if the function is called repeatedly.

Flexible-width vectors without partial-width operations would work similarly to packed SIMD, but with vector size set at runtime. Let's say enable_vec_ops turns on vector support and returns vector length, vec... are the operations:

// a = b + c
// Use runtime-defined vector size, without partial-width vector operations
void add(float * a, float * b, float * c, unsigned sz) {
  unsigned i;
  unsigned len = enable_vec_ops();
  for (i = 0; i < sz / len; ++i) {
    vecstore(a, vecfadd(vecload(b), vecload(c)));
    a += len;
    b += len;
    c += len;
  }
  for (i = 0; i < sz % len; ++i) {
    *a = *b + *c;
    ++a;
    ++b;
    ++c;
  }
}

This would map to hardware operations within "extended" portable subset - same line-wise operations, but width can be different on different platforms. Virtual instructions would lower to packed SIMD hardware instructions. Benefit of this approach is limited to enabling longer operations on platforms that support them without degrading performance on platforms that don't. Scalar loop is still there, for example.

What is meant by 'flexible-width' vector instructions would usually include a mechanism to eliminate the second loop, by tucking it into a computation with less lanes. Suppose set_vec_length enables the subset and sets vector length, and vec... are operations:

// a = b + c
// Maximum vector size is a runtime constant,
// set vector size dynamically between 1 element and that value
void add(float * a, float * b, float * c, unsigned sz) {
  int count = sz;
  while (sz > 0) {
    // Set width to min of count and supported width
    // result is written back to `count` variable
    set_vec_length(&count);
    vecstore(a, vecfadd(vecload(a), vecload(b)));
    sz -= count;
    a += count;
    b += count;
    c += count;
  }
}

Vector instructions here should lower to packed SIMD or vector hardware instructions (if the latter is available). This approach should reduce both hardware instruction count and WASM instruction count, and reduce amount of branching in WASM. I think it the more beneficial approach, as it would allow simplifying code generation and reduce instruction count even it would fail to implement partial-width operation in one pass. The gotcha is that some platforms support SIMD operation masking and some don't - in worst case those would be emulated using a scalar loop, which is equivalent to the current situation.

baryluk commented 5 years ago

@penzn Yes, essentially something like this.

However, in reality the last loop, will be done probably using masks, and first iteration of the first loop will also be done using masks, to account for possible misalignment. the count or len, will usually be known only be the compiler at runtime, and will affect the machine code specialization. For SVE / SVE2, it might end even as a non-constant during runtime, and only known fully by the hardware (compiler can find it out, but don't need to).

Worth reading:

https://arxiv.org/pdf/1803.06185.pdf https://www.fujitsu.com/global/Images/armv8-a-scalable-vector-extension-for-post-k.pdf https://static.sched.com/hosted_files/bkk19/3c/BKK19-202_New-Technologies-in-Arm-Architecture.pdf Slides 13 - 15

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4008.pdf

GCC does support it reasonably well. LLVM support is out of tree at the moment, pending merging.

penzn commented 4 years ago

To see how adding this to Wasm might look like, I've put a slide deck together (source).

There are a few projects providing flexible SIMD to native applications. for example Highway. This shows how runtime resolution of SIMD width may look like.

penzn commented 4 years ago

@jan-wassenberg, what do you think about this?

jan-wassenberg commented 4 years ago

Thanks for reaching out! I like the direction, IMO it's important to have unknown-length types for two reasons: 1) to allow using all of SSE4/AVX2/AVX-512 without source code changes; 2) to enable use of SVE/RiscV, as has been mentioned.

This blurs the line between "packed" and "flexible" - in both cases, the app doesn't know the width. The main difference with 2) is that width is not known until runtime.

If I understand a previous comment correctly ("flexible-width vector operations can be thought of as a way to provide compatibility between platforms that have different SIMD width"), that's also advocating 1).

AFAIK designing a proper "API" / "instruction set" for doing this that can be re-targetted for hardware supporting either packed or Cray (or both) vector types is an open research problem.

We're close to this now with Highway (thanks for linking it above) - the API supports packed vectors of app-unknown length. Extending to runtime-unknown could be done by switching from Descriptor::N (constant) to a function NumLanes(d) which returns svcnt*().

For 1), Highway shows that most operations can be defined so that they are efficient on both AVX2/512 and SVE. It might be surprising that shuffles/broadcasts operate independently on 128-bit parts. Unfortunately some differences are hard to bridge, e.g. u16->u32 promotion: SVE uses every other lane, whereas Intel/NEON use the lower half. Any ideas for that?

I'm a bit concerned about the use of set_vec_length/masks for handling the last lanes. This is great for SVE, but less so for packed SIMD. I agree scalar loops aren't awesome, but perhaps there is an alternative. Can apps be encouraged to pad their data such that reading and even writing up to some maximum width is fine? That would avoid the performance cliff from masking/loops.

However, I agree it can make sense to use smaller vectors than the maximum supported, e.g. no more than 8 lanes for 8x8 DCT. If set_vec_length is limited to powers of two and the runtime can also use smaller lengths, perhaps we're saying the same thing?

penzn commented 4 years ago

Thanks @jan-wassenberg. This is a pretty good description of the challenges going forward, closing this issue in favor of #7