Hardware support and priorities

WebAssembly / flexible-vectors

Vector operations for WebAssembly

https://webassembly.github.io/flexible-vectors/

Other

48 stars 7 forks source link

Hardware support and priorities #7

Open penzn opened 4 years ago

penzn commented 4 years ago

Thanks for reaching out! I like the direction, IMO it's important to have unknown-length types for two reasons: 1) to allow using all of SSE4/AVX2/AVX-512 without source code changes; 2) to enable use of SVE/RiscV, as has been mentioned.

This blurs the line between "packed" and "flexible" - in both cases, the app doesn't know the width. The main difference with 2) is that width is not known until runtime.

If I understand a previous comment correctly ("flexible-width vector operations can be thought of as a way to provide compatibility between platforms that have different SIMD width"), that's also advocating 1).

AFAIK designing a proper "API" / "instruction set" for doing this that can be re-targetted for hardware supporting either packed or Cray (or both) vector types is an open research problem.

We're close to this now with Highway (thanks for linking it above) - the API supports packed vectors of app-unknown length. Extending to runtime-unknown could be done by switching from Descriptor::N (constant) to a function NumLanes(d) which returns svcnt*().

For 1), Highway shows that most operations can be defined so that they are efficient on both AVX2/512 and SVE. It might be surprising that shuffles/broadcasts operate independently on 128-bit parts. Unfortunately some differences are hard to bridge, e.g. u16->u32 promotion: SVE uses every other lane, whereas Intel/NEON use the lower half. Any ideas for that?

I'm a bit concerned about the use of set_vec_length/masks for handling the last lanes. This is great for SVE, but less so for packed SIMD. I agree scalar loops aren't awesome, but perhaps there is an alternative. Can apps be encouraged to pad their data such that reading and even writing up to some maximum width is fine? That would avoid the performance cliff from masking/loops.

However, I agree it can make sense to use smaller vectors than the maximum supported, e.g. no more than 8 lanes for 8x8 DCT. If set_vec_length is limited to powers of two and the runtime can also use smaller lengths, perhaps we're saying the same thing?

Originally posted by @jan-wassenberg in https://github.com/WebAssembly/flexible-vectors/issues/2#issuecomment-585077397

lemaitre commented 4 years ago

Thanks for reaching out! I like the direction, IMO it's important to have unknown-length types for two reasons:

to allow using all of SSE4/AVX2/AVX-512 without source code changes;

to enable use of SVE/RiscV, as has been mentioned.

First, I would like to highlight that 1 is not really about "without source code change" at the C level, but at the WASM level (which is probably not the source). Meaning: the ability for the same WASM code to execute as efficiently as possible on those architecture with different vector width.

Actually, if you look at SVE, it is also the same idea: the same machine code for machines with different vector width. AFAIU, SVE is not designed to support vector length change at runtime (at least in user code), so support for SVE actually fits within 1.

RiscV support changing the vector length at runtime, but I don't think it should be a priority. Especially because you can set the widest vector possible and just ignore that you can change back.

That being said, the mental model I have for flexible vectors (or long vector) is that the vector length is a runtime constant. Compilation to WASM should not rely on any specific vector length (apart maybe from being a multiple of 128, a power of 2 ?). But it can rely on the fact that it will not change during the complete execution of the program.

However, compilation from WASM to the executing architecture can rely on the actual vector length of the machine and do some optimizations on it (like constant folding, which is fairly easy to perform).

Unfortunately some differences are hard to bridge, e.g. u16->u32 promotion: SVE uses every other lane, whereas Intel/NEON use the lower half. Any ideas for that?

This is a tricky beast. But if you don't know what is the size of your vector, how do you know where is the first half and the second half ? My guess is that is why SVE chose to use every other lane. We can probably resort on using dual input/output and hide this implementation details. But this would still need further thinking.

I'm a bit concerned about the use of set_vec_length/masks for handling the last lanes. This is great for SVE, but less so for packed SIMD. I agree scalar loops aren't awesome, but perhaps there is an alternative.

The best way is to still have a remainder, but the remainder is with masked vectors so the impact should be minimal.

Here is some charts from my master thesis showing the difference between scalar remainder and masked remainder on a AVX2 machine: masked-remainder

You can clearly see a saw-tooth pattern, but the SIMD version (that uses masked remainder) has much lower teeth than the others that uses scalar remainder. BTW, the code is just computing the parameters of parabolas passing by some points in float (hence the 8 period).

So I think this pattern is enough in most cases, and it is implementable in legacy architectures. For this simple example, just a masked store is required.

Can apps be encouraged to pad their data such that reading and even writing up to some maximum width is fine? That would avoid the performance cliff from masking/loops.

I think that's a recommendation that is already sensible now even without WASM. However, if you don't know the vector length, how do you know by how much you need to pad ? How do you pass the information to the end compiler ? Also, it should not be mandatory for having vectorization as it might not be always possible.

However, I agree it can make sense to use smaller vectors than the maximum supported, e.g. no more than 8 lanes for 8x8 DCT. If set_vec_length is limited to powers of two and the runtime can also use smaller lengths, perhaps we're saying the same thing?

For those use cases I would much prefer sub-SIMD rather than set_vec_length. What I mean by "sub-SIMD" is having fixed-size SIMD operations executed on vectors. For instance, you could have 4x 128-bit SIMD within an AVX512 register and perform shufflles that do not cross 128-bit lanes. If your algorithm process data in blocks (like 8x8 DCT), you would just process more blocks in one go, but exactly in the same way. If you cannot process in blocks (or chunks), chances are that a fixed-size SIMD will not help you.

I think that's part of why AVX and AVX512 involve so many sub-SIMD instructions: to make it easier to port old code to it that are already written with 128-bits in mind. The other reason is because it was simpler for them to implement of course.

penzn commented 4 years ago

Thank you @jan-wassenberg for the write up. I did not mean to take this long to reply. I think this is a good start.

Can apps be encouraged to pad their data such that reading and even writing up to some maximum width is fine?

What is your vision - padding Wasm values, or having padding built into runtime's memory management?

In general, I think the biggest challenge is writing to memory - it is possible to read the entire register worth of data and ignore the lanes we don't care about, but there is not good way of writing our just part of a SIMD register (without masking support). Maybe writing via a temp would be acceptable (I suspect it might be expensive)?

@lemaitre

AFAIU, SVE is not designed to support vector length change at runtime (at least in user code)

As you point out in a different thread, it has predicates, which is provides even more flexibility than dynamic length (since you can toggle any lane on or off, not just "end" ones).Flex length might be easier to implement with ISAs that don't support any masking at all. Set length can be lowered to predicates, while the opposite direction is not quite possible.

lemaitre commented 4 years ago

I just continued the discussion about masks in #6 as I think this discussion is more closely related to that issue.

jan-wassenberg commented 4 years ago

Very interesting discussion here :D

@lemaitre

I would like to highlight that 1 is not really about "without source code change" at the C level, but at the WASM level (which is probably not the source). Meaning: the ability for the same WASM code to execute as efficiently as possible on those architecture with different vector width. Actually, if you look at SVE, it is also the same idea: the same machine code for machines with different vector width. AFAIU, SVE is not designed to support vector length change at runtime (at least in user code), so support for SVE actually fits within 1.

I understand your interpretation; please allow me to clarify. By 1 I meant: "non-hardcoded in user code" but compile-time-constant vector length. That's sufficient for SSE4 vs AVX2 but not SVE. 2 involves non-compile-time-constant vectors, but I agree it's best not to let them change during execution.

We can probably resort on using dual input/output and hide this implementation details. But this would still need further thinking.

Yes.. seems a hidden shuffle is a somewhat decent option.

Here is some charts from my master thesis showing the difference between scalar remainder and masked remainder on a AVX2 machine:

Wow, quite a difference. Maybe we should discuss in #6, but was there a requirement that only 7 floats be written? We can often allocate a bit more, write 8, and advance the pointer by the actual count.

if you don't know the vector length, how do you know by how much you need to pad ?

We can provide an API for that, right?

For those use cases I would much prefer sub-SIMD rather than set_vec_length. What I mean by "sub-SIMD" is having fixed-size SIMD operations executed on vectors. For instance, you could have 4x 128-bit SIMD within an AVX512 register and perform shufflles that do not cross 128-bit lanes.

I agree that most shuffles should not cross 128-bit lanes (to match AVX2 hardware). Unfortunately for the DCT8x8 example, our data layout is such that we can't just load multiple blocks of 8 (at least without gather). It also seems useful to allow scalar remainder handling with the same source code, i.e. a "vector" type with single lane. We also used to have a use-case for loading exactly two ints, which depended on the previous pair of ints. Does that make a reasonable case for <=128bit, power of two vector types in addition to the "max that hardware supports" type?

@penzn

What is your vision - padding Wasm values, or having padding built into runtime's memory management?

I think it could be enough to provide a special allocator for "aligned+padded array", but am not familiar enough with Wasm's memory model.

but there is not good way of writing our just part of a SIMD register (without masking support). Maybe writing via a temp would be acceptable (I suspect it might be expensive)?

Yes, tricky.. if the app really can't afford to overwrite, I suspect a "temp" would be reasonable - did you mean app loads from the intended store location, blends with the new data, and stores the full vector? That seems easier to implement than 16-bit masked stores and still efficient.

lemaitre commented 4 years ago

I understand your interpretation; please allow me to clarify. By 1 I meant: "non-hardcoded in user code" but compile-time-constant vector length. That's sufficient for SSE4 vs AVX2 but not SVE

I understand your point. But with what you have in mind, SSE4 and AVX2 should also be part of 2 because we want a single WASM code for different architectures with different vector length. It is not about the same C source code that is compiled with different compile-time constants. It is about the same WASM code where the compilation has already taken place. So as far as WASM is concerned, the vector length is unknown. It will be known only when the WASM code is translated to the target architecture in the end.

Wow, quite a difference. Maybe we should discuss in #6, but was there a requirement that only 7 floats be written? We can often allocate a bit more, write 8, and advance the pointer by the actual count.

Yes, we could in theory. But in practice, legacy and backward compatibility came in. It was simpler to just have masked writes in the end of this function.

if you don't know the vector length, how do you know by how much you need to pad ?

We can provide an API for that, right?

Yes, you're right, but I have the feeling it will reduce the range of such a solution (alignment will not be known at compile time).

I agree that most shuffles should not cross 128-bit lanes (to match AVX2 hardware).

That is not exactly what I said. I took 128-bit lanes only as an example. But I don't feel that 128-bit is any special (except from a legacy point of view). I would also propose 256-bit lanes if we add the requirement that vector length is greater or equal to 256 (which is still possible to implement in SSE4 by having 2 registers per vector, and can even be faster in some cases as it works like unroll&jam).

But that does not exclude the need for full width shuffles with variables indices. Those are "emulatable" on AVX using multiple shuffles and blends (a combination of vperm2f128, vpermilps and vblendvps), and are available on AVX2 and AVX512.

Unfortunately for the DCT8x8 example, our data layout is such that we can't just load multiple blocks of 8 (at least without gather).

Indeed, you do need a gather in this case, but this is easily done with multiple full-width loads and an in-register transposition for the 8x8 DCT. In fact, this in-register transposition might be faster than native gather as you can do full width loads. The larger the lanes, the faster the transposition (n log2(n) for transposing n n-wide vectors).

And you can use this transposition scheme to emulate more traditional gather and scatter on legacy architectures. So in the end, even those operations could be supported by legacy architectures.

Now the question about masked load/store:

Aligned masked loads smaller than cache size are trivial, just select after an unmasked load
Aligned masked loads larger than cache size are only relevant for SVE (and Risc V?) and there is a mechanism to handle faults on them
Unaligned masked loads: The problem is basically the same as for the previous point: we need a way to ignore faults when it occurs on inactive elements. This could probably be done with a signal handler that checks if the load is semantically non-faulting or not.
Masked stores: on x86, they are available since SSE2 (with no alignment requirements). On ARM Neon, it seems that no masked store exists, but we could emulate it with a bunch of stN_lane with a switch case (or better: a computed goto). Merging the last vector with the already existing data has 2 problems: it can segfault past the end (on inactive elements), and it can produce a race condition with a store on following data (inactive elements of the first).

The masked stores are then not really an issue as they should appear spuriously in the code, and are either supported (x86, SVE) or easily implementable (Neon).

Now, the unaligned masked load is a bigger problem than I thought as segfaults can occur on inactive elements. It is not supported on legacy architectures (SSE4, Neon), and there 2 way I can think of to emulate it:

scalar emulation: either some load lane or scalar load plus insert
signal handler: a custom signal handler could catch segfaults, verify if the faulting instruction is in the table of masked load, and ignore the fault if it appears after all active elements of the vector.

As segfaulting on masked load should be very rare, I think we can go for the signal handler solution. Such a signal handler would require a list the addresses of every single unaligned masked load in order to check if the load should be verified (and how). This seems to be rather complex, but we probably need this complexity anyway if we want to support some sort of first-faulting load like in SVE that is pretty much required if the vector is larger than a cache line, or unaligned.

That being said, we could also recommend to align data and use aligned masked loads. Having a slow-ish unaligned masked load might not be an issue in practice.

jan-wassenberg commented 4 years ago

It is not about the same C source code that is compiled with different compile-time constants. It is about the same WASM code where the compilation has already taken place. So as far as WASM is concerned, the vector length is unknown. It will be known only when the WASM code is translated to the target architecture in the end.

I understand and agree from the Wasm perspective. FYI my worldview is that of a library implementor that also provides a Wasm backend; for the other backends, we know at the time of compilation what the target is going to be.

I have the feeling it will reduce the range of such a solution (alignment will not be known at compile time).

I used to feel the same way - for example, it is convenient to have stack-allocated vectors. Unfortunately that isn't going to work with RiscV V - they have no meaningful upper bound on the vector size, thus all vector data needs to be dynamically allocated anyway. (I'm told that some actual hardware is aiming for 16K lanes, which will quickly overflow the stack)

But I don't feel that 128-bit is any special (except from a legacy point of view).

One might think the number 128 isn't special, but there is some significance because of the way architectures have been extended. All Intel instruction sets make it considerably more expensive to cross 128 bit blocks. SVE always provides multiples of 128 bit and guarantees exactly 128 bits are available. Is that what you mean by legacy? I do see value in being friendly to the hardware, although it's a bit of a leaky abstraction.

But that does not exclude the need for full width shuffles with variables indices.

Sure, those are sometimes necessary, but hopefully not the only primitive we'd provide.

In fact, this in-register transposition might be faster than native gather as you can do full width loads.

Good point, thanks for this suggestion, I've put this on the TODO for after we've frozen the codec.

we need a way to ignore faults when it occurs on inactive elements

Exactly, that's a concern. A wasm engine might get away with this, but for a library this is a very unattractive proposition because signals and SEH on Windows are global/owned by the app. On Windows, one misbehaving (injected!) DLL can swallow or misunderstand the 'signal' which we're relying on suppressing.

Masked stores: on x86, they are available since SSE2 (with no alignment requirements).

Unfortunately x86 is really restrictive here. Yes, there is a byte-granularity store for SSE2, but what about AVX2? There, we only have one for int32/64.

if we want to support some sort of first-faulting load like in SVE that is pretty much required if the vector is larger than a cache line, or unaligned.

I haven't yet understood why we'd want to support that. Can't we simply say "don't load/store if it's going to touch unmapped mem"? (From my perspective, any load/store at risk of faulting is by definition "remainder")

To summarize, we have established that masked store/load are actually surprisingly problematic, but can be made to work with worst-case scalar code. Let's take a step back and remember that the stated use case was handling remainders with masks. If it's anyway going to boil down to scalar, apps can do that already. And even better if they can pad and avoid all this complexity entirely?

lemaitre commented 4 years ago

All Intel instruction sets make it considerably more expensive to cross 128 bit blocks. SVE always provides multiples of 128 bit and guarantees exactly 128 bits are available. Is that what you mean by legacy? I do see value in being friendly to the hardware, although it's a bit of a leaky abstraction.

Yes, that falls into my "legacy" point of view. I do think it would be problematic if no 128-bit sub-SIMD instructions are provided. My point was: Is 128-bit the only sub-SIMD granularity we need/want?

Sure, those are sometimes necessary, but hopefully not the only primitive we'd provide.

Agreed.

In fact, this in-register transposition might be faster than native gather as you can do full width loads.

Good point, thanks for this suggestion, I've put this on the TODO for after we've frozen the codec.

Also, ldN/stN in Neon are emulatable with partial in-register transpoition and full-width loads on other platforms.

A wasm engine might get away with this, but for a library this is a very unattractive proposition because signals and SEH on Windows are global/owned by the app. On Windows, one misbehaving (injected!) DLL can swallow or misunderstand the 'signal' which we're relying on suppressing.

I don't know enough here. I just said that SEH can theoretically work in this case.

Masked stores: on x86, they are available since SSE2 (with no alignment requirements).

Unfortunately x86 is really restrictive here. Yes, there is a byte-granularity store for SSE2, but what about AVX2? There, we only have one for int32/64.

Right, I forgot that. But in that case (for int8/int16), we can just split the vector in half and do 2 masked stores. It should still be pretty fast (compared to alternatives).

I haven't yet understood why we'd want to support that. Can't we simply say "don't load/store if it's going to touch unmapped mem"? (From my perspective, any load/store at risk of faulting is by definition "remainder")

On SVE, First Faulting is useful when doing string processing where you don't know in advance where the end is. You can think of it as: for string processing, every iteration is a remainder one.

So you might try to load past the end data before you actually know it is past the end. With aligned loads, this is not problematic on most architectures as a single aligned load cannot cross a cache line boundary, and a fortiori not a page boundary. But if you have long vectors or unaligned loads, a single load might be able to cross a page boundary. This page boundary might be after the end of your actual data, so you need a way to have the load succeed if you detect a posteriori that your data ends before the fault.

So maybe we can support such semantics for aligned loads, and then it would be a noop on all architectures except SVE (and Risc V?). And completely forbids it for unaligned loads, at which point, unaligned masked loads will be as trivial as aligned masked loads. This would require to first align loads before the remainder. (or go with scalar emulation for the load)

I should probably mention that such "First Faulting Access" is a special kind of load in SVE, and regular loads (even unaligned) are not concerned by this and do segfault.

Stores are not concerned by such policy because we always know where is the end of our data before the store is performed, and can rely on proper masking.

To summarize, we have established that masked store/load are actually surprisingly problematic, but can be made to work with worst-case scalar code.

As I explained, masked stores are not really problematic. Masked loads can be problematic if we support unaligned masked load where fault on inactive elements are ignored. But this problem is solvable, and might also just be ignored.

If it's anyway going to boil down to scalar, apps can do that already.

In the worst case, we can emulate masked loads/stores in scalar, but that does not mean the whole remainder will be scalar. Only the masked loads/stores would be. To reformulate, the remainder can stay with vector types and operations even if masked/remainder loads/stores are implemented in scalar.

Also, most of those problems also exist with a set_vlen scheme. For example: a fault might appear after vlen but before max_vlen.

And even better if they can pad and avoid all this complexity entirely?

That should still be the preferred option, but the thing is: compiler are not allowed to do that (except on stack?). So we still have to provide a way if the user did not opt-in. And I strongly believe that masks are an efficient way to go.

jan-wassenberg commented 4 years ago

My point was: Is 128-bit the only sub-SIMD granularity we need/want?

Ah, thanks for clarifying. I think that's a good start but also believe vec64/32/16/8 types could be useful.

I don't know enough here. I just said that SEH can theoretically work in this case.

Agree it's possible, but I would not recommend engines get into that business :)

You can think of it as: for string processing, every iteration is a remainder one.

Ah yes. For the same reason, I'd advocate explicit string lengths instead if possible?

With aligned loads, this is not problematic on most architectures

Actually there's another consideration: we care about msan, and it will still complain about aligned loads unless the user explicitly arranges for padding (and initializes/unpoisons it beforehand).

To reformulate, the remainder can stay with vector types and operations even if masked/remainder loads/stores are implemented in scalar.

I understand and agree. The remaining difference in opinion is philosophical. Should we make remainder handling more efficient, or less efficient so that apps do less of it, leading to better overall performance?

Also, most of those problems also exist with a set_vlen scheme.

To be clear, I am also concerned about set_vlen and would prefer static types (besides the compile-time-unknown but runtime-invariant full hardware length).

That should still be the preferred option, but the thing is: compiler are not allowed to do that (except on stack?).

Yes, the compiler does it on the stack. Unfortunately it sometimes forgets alignment for its spills, leading to crashes. BTW do we have a requirement that flexible-vector code should be generated via autovectorization? If so, I suspect it's going to be suboptimal because it will have to make most stores to dynamic memory masked (just in case), which will be costly.

lemaitre commented 4 years ago

My point was: Is 128-bit the only sub-SIMD granularity we need/want?

Ah, thanks for clarifying. I think that's a good start but also believe vec64/32/16/8 types could be useful.

Just an extra clarification: here I talk about sub-SIMD and not smaller SIMD. Let me give you an example. A 512-bit vector can be viewed as 16x 32-bit integers, but it can also be seen as 4x 4x 32-bit integers. So if you have an operation that can be applied on 4x 32-bit integers, you can apply the same operations 4 times on the 4 separate and independent 4x 32 bit-integers. Intel shuffles are mostly that.

I'm not talking about considering only the low 4x 32-bit integer.

You can think of it as: for string processing, every iteration is a remainder one.

Ah yes. For the same reason, I'd advocate explicit string lengths instead if possible?

For sure, but that's not always possible: kernel interface uses null-terminated strings.

With aligned loads, this is not problematic on most architectures

Actually there's another consideration: we care about msan, and it will still complain about aligned loads unless the user explicitly arranges for padding (and initializes/unpoisons it beforehand).

What's "msan"? Some kind of memory sanitizer? At which level does it work? Because if it works at WASM level, then it will see the instruction is an aligned masked load. Only the generated code will have forgotten that the load is actually masked. If it works further down, then yes, you will have a problem.

Yes, the compiler does it on the stack. Unfortunately it sometimes forgets alignment for its spills, leading to crashes.

If that's the case, that's a compiler bug. The compiler should have all the information to use the correct alignment (even if it's unknown, it knows how to get it).

BTW do we have a requirement that flexible-vector code should be generated via autovectorization? If so, I suspect it's going to be suboptimal because it will have to make most stores to dynamic memory masked (just in case), which will be costly.

I see no reason to forbid autovectorization from outputing flexible vector code. What would be the point?

Such mask stores would mostly generate in remainders and branches (ie: conditional stores). For the loop body, there is no reason a compiler would prefer mask stores because all but the last iteration are full width.

However, and that's quite funny, compilers tend to not generate masked vector remainders but prefer scalar ones even when masked load/store are available. Why? No idea. Here is a godbolt link where you can see that only icc does generate a masked vector remainder.

jan-wassenberg commented 4 years ago

Just an extra clarification: here I talk about sub-SIMD and not smaller SIMD. Let me give you an example. A 512-bit vector can be viewed as 16x 32-bit integers, but it can also be seen as 4x 4x 32-bit integers.

I see. In that case: yes, I'm not aware of any block size imposed by hardware other than 128 bit.

For sure, but that's not always possible: kernel interface uses null-terminated strings.

Isn't that a very niche use case? It is actually problematic to use SIMD inside a kernel - registers need to be saved (2KiB for AVX-512) and possibly pre-emption disabled. I'm guessing that strings which are tied to legacy interfaces are pretty short (<200 bytes), so probably not worth that overhead?

What's "msan"? Some kind of memory sanitizer? At which level does it work?

Yes. It's apparently planned for Wasm (https://webassembly.org/docs/tooling/) and is deeply embedded into a compiler/code generator (instrumenting each load).

If that's the case, that's a compiler bug.

Yes. I have encountered over a dozen SIMD-related compiler bugs :/

I see no reason to forbid autovectorization from outputing flexible vector code.

Sure, there is no need to forbid it, but I also wouldn't pin any hopes on it or burden the API with any attempt to help autovectorizers.

For the loop body, there is no reason a compiler would prefer mask stores because all but the last iteration are full width. However, and that's quite funny, compilers tend to not generate masked vector remainders but prefer scalar ones even when masked load/store are available.

Thanks for sharing the Godbolt. Even in this simplest of cases, the GCC/ICC codegen is rather suboptimal. It's updating the mask on every iteration and llvm-mca seems to think this is slower than clang's scalar loop.

lemaitre commented 4 years ago

For sure, but that's not always possible: kernel interface uses null-terminated strings.

Isn't that a very niche use case? It is actually problematic to use SIMD inside a kernel - registers need to be saved (2KiB for AVX-512) and possibly pre-emption disabled. I'm guessing that strings which are tied to legacy interfaces are pretty short (<200 bytes), so probably not worth that overhead?

To be clear, here I am not talking about kernel code, but user code that needs to manipulate data before a kernel call. An example is working with filesystem paths.

What's "msan"? Some kind of memory sanitizer? At which level does it work?

Yes. It's apparently planned for Wasm (https://webassembly.org/docs/tooling/) and is deeply embedded into a compiler/code generator (instrumenting each load).

In that case, you're good because the tool will see the wasm instruction which will convey the mask information, even if the target instruction does not.

Yes. I have encountered over a dozen SIMD-related compiler bugs :/

That's unfortunate, but this should not be WASM concern. Only WASM compilers concern.

Sure, there is no need to forbid it, but I also wouldn't pin any hopes on it or burden the API with any attempt to help autovectorizers.

Except if this can also help developers using intrinsics.

Thanks for sharing the Godbolt. Even in this simplest of cases, the GCC/ICC codegen is rather suboptimal. It's updating the mask on every iteration and llvm-mca seems to think this is slower than clang's scalar loop.

Be careful here, llvm-mca does not handle loops. So the simpler the function, the better llvm-mca will think it is. But you can be pretty sure that ICC codegen is at least as efficient as CLANG, especially on a simple example like this one.

Here, ICC copied the loop body 3 times, 2 of them are masked and not unrolled. If I had to guess, I would say the first one ..B1.8: is the loop for unaligned data (no remainder required because it is all masked), the second ..B1.12: is the main loop body for aligned data with an unroll factor of 2 and no masking, and the third ..B1.16: is the masked remainder. The remainder in this case is a loop because the main loop body is unrolled.

By the way, GCC tries to process the remainder with 256-bit registers before falling back to scalar. So all-in-all, CLANG seems to generate the worst remainder of all three, and because of their 4-unrolling, the remainder has even more elements to process than for the other compilers.

Maratyszcza commented 4 years ago

I suggest that flexible vectors should target only AVX2, AVX512 & SVE-compatible processors. SSE and NEON already map well to SIMD128, and restricting flexible vectors extension to compatibility with these instruction sets would make many desirable features impossible or dramatically inefficient, e.g.:

Fused multiply-add
Partial loads and stores
Gather
Scatter

jan-wassenberg commented 4 years ago

@lemaitre

To be clear, here I am not talking about kernel code, but user code that needs to manipulate data before a kernel call. An example is working with filesystem paths.

Thanks for clarifying. If this is before the kernel (i.e. the interface that requires c-strings), then I believe explicit-length vectorization would still be possible e.g. via std::string(+c_str() for the kernel) or BSTR. To take a step back, I haven't yet seen any use cases where remainder handling is both unavoidable and time-critical.

In that case, you're good because the tool will see the wasm instruction which will convey the mask information

hm, it's not clear to me that the msan developers are willing to support those semantics. FYI we've run into several bugs where precisely this (either load/store of a partial vector, or propagating the poisoned status of only some of the lanes) was not correctly handled, leading to crashes.

Yes. I have encountered over a dozen SIMD-related compiler bugs :/

That's unfortunate, but this should not be WASM concern. Only WASM compilers concern.

We can also let it guide our thinking on what is more likely to be workable :)

Be careful here, llvm-mca does not handle loops. So the simpler the function, the better llvm-mca will think it is. But you can be pretty sure that ICC codegen is at least as efficient as CLANG, especially on a simple example like this one.

Oh, something interesting - I just tried to insert __asm volatile("# LLVM-MCA-BEGIN") so we analyze only the aligned part, but that prevents autovectorization. This is makes me even less inclined to trust autovectorization.

Also, I'm not sure about the "at least as efficient" unless adding truly is a bottleneck for the application. This looks like >300 bytes of code and 12 branches (which further reduce DSB capacity), versus half that for clang.

I suggest that flexible vectors should target only AVX2, AVX512 & SVE-compatible processors.

@Maratyszcza I agree that the partial vectors I was talking about don't necessarily belong here, they might be a better fit for v2 of SIMD128. Interesting question: do we want scatter even though it's not supported by AVX2?

Maratyszcza commented 4 years ago

If flexible SIMD is going to support arbitrary-length vectors, scatter is a must. Without it, much of the long vectors will go unused. Of course, on AVX2 scatter will be emulated, likely via VEXTR* instructions.

jan-wassenberg commented 4 years ago

I'm curious why scatter is essential? Highway is basically a long/flexible vector API successfully using avx-512, but without scatter. On GPU also, it was standard practice to convert scatter to gather and seemed to generally be acceptable?

lemaitre commented 4 years ago

@Maratyszcza

I suggest that flexible vectors should target only AVX2, AVX512 & SVE-compatible processors.

I agree that it would simplify many things, but if flexible vectors are not compatible with all WASM supported architectures (especially Neon for current smartphones), they will not be used much. I would have no problem with it being slower than fixed-sized SIMD on SSE4/Neon platforms, though.

@jan-wassenberg

To take a step back, I haven't yet seen any use cases where remainder handling is both unavoidable and time-critical.

Like I said in a previous post (don't remember which one), if you have a narrow matrix, like Mx6 of floats, and need to read it, every iteration is a remainder one. Here you could say you can just pad every single row to be a multiple a the SIMD cardinal, but it is super wasteful for AVX512 and larger where you would need almost 3x memory. You might want to keep the vector packed in order to stay longer in caches and maximize performance.

In that example, you would really need a masked store (or at least a storeOnlyN).

hm, it's not clear to me that the msan developers are willing to support those semantics.

If masked loads are standardized, msan will see masked load instructions. It would not need to guess that it is a masked load (even if it is implemented as an unmasked load behind the scene). Thus, it would just need to perform the check for active lanes (which can be read directly from the mask, because msan will know where the mask is). There is no need to keep track of poisoned lanes and such for that to work at the WASM level.

Or I misunderstood, and msan does not work at WASM level. In such a case, there is nothing you could really do.

We can also let it guide our thinking on what is more likely to be workable :)

In practice, though, compiler bugs appear in random places and are not more likely to affect SIMD generation than any thing else.

Oh, something interesting - I just tried to insert __asm volatile("# LLVM-MCA-BEGIN") so we analyze only the aligned part, but that prevents autovectorization. This is makes me even less inclined to trust autovectorization.

asm breaks vectorization on every single compiler because they do not know what the statement does, and a fortiori, how to vectorize it, even if the statement is a noop.

Also, I'm not sure about the "at least as efficient" unless adding truly is a bottleneck for the application. This looks like >300 bytes of code and 12 branches (which further reduce DSB capacity), versus half that for clang.

In this code, Clang call the scalar remainder for the last 63 elements in the worst case. That's huge! Branches will be more efficient in that case. If you are scared about the binary size of the remainder, the masked store can just be converted into a function call where the function will be shared with every other masked stores. So binary size here is not an issue.

Also, a storeOnlyN (either explicit, or deduced from the shape of the mask) can be implemented using a jump table (or better on ARM: a computed goto).

I'm curious why scatter is essential?

Some algorithms just cannot be implemented efficiently using gather. For instance, array packing and histograms.

On GPU also, it was standard practice to convert scatter to gather and seemed to generally be acceptable?

On most hardware, gather is faster than scatter, so it seems preferable to use gather than scatter when possible, even when scatter is available. We can make it an official recommendation (in the same way we would recommend to pad data for maximum performance).

Maratyszcza commented 4 years ago

if flexible vectors are not compatible with all WASM supported architectures (especially Neon for current smartphones), they will not be used much.

Developers would have to build two versions of the WAsm binary, for SIMD128 and for flexible vectors. I don't see it as a big problem, because developers who directly work with WAsm do it all the time anyway, as WAsm engines differ in which extensions they support. Besides, in the time it takes flexible vectors standard to get to the market (3 years for SIMD128), mobile hardware might as well get SVE support.

jan-wassenberg commented 4 years ago

@lemaitre

Here you could say you can just pad every single row to be a multiple a the SIMD cardinal, but it is super wasteful for AVX512 and larger where you would need almost 3x memory.

What if we padded only the final row? It is safe to load and ignore data from the next row. We can avoid writing more than a row via unaligned load of the previous contents, blend with the new lanes, and then store the whole vector. Would that not be a less complex/more portable way to meet your requirements?

Or I misunderstood, and msan does not work at WASM level. In such a case, there is nothing you could really do.

I do not know how msan will work, but "nothing we can do" is troubling because some projects would consider lack of msan a dealbreaker :)

asm breaks vectorization on every single compiler because they do not know what the statement does, and a fortiori, how to vectorize it, even if the statement is a noop.

Sure, it's understandable but I had quietly hoped that no-ops or at least llvm's own MCA syntax could have been tolerated.

In this code, Clang call the scalar remainder for the last 63 elements in the worst case. That's huge! Branches will be more efficient in that case.

Let's take a step back. We are talking about using long vectors, which might downclock the core for quite some time. If the number of elements is not much higher than 63, it is probably not worthwhile. If it is larger, why is 63 still huge?

Some algorithms just cannot be implemented efficiently using gather. For instance, array packing and histograms.

I'm curious to learn more about the array packing. Are we talking about something like concatenating several unaligned/non-padded arrays into one packed array? If so, I would have thought gather is sufficient. Histograms are indeed difficult but it seems not even scatter would be enough. Wouldn't we also need conflict detection, which AFAIK is not available on most platforms?

lemaitre commented 4 years ago

@Maratyszcza

Besides, in the time it takes flexible vectors standard to get to the market (3 years for SIMD128), mobile hardware might as well get SVE support.

I'm pessimistic here: I don't think there will be much customer hardware with SVE in a 3 year time frame. ^^'

@jan-wassenberg

It is safe to load and ignore data from the next row. We can avoid writing more than a row via unaligned load of the previous contents, blend with the new lanes, and then store the whole vector. Would that not be a less complex/more portable way to meet your requirements?

This can work only if the next row is not accessed by another context like another thread. The sequence "load, blend, store" is not atomic, so you can have race conditions. And you might want to have only a few contiguous rows per threads if you need load-balancing (eg: #pragma omp for schedule(dynamic)).

I do not know how msan will work, but "nothing we can do" is troubling because some projects would consider lack of msan a dealbreaker :)

The "nothing we can do" is only if msan works at a lower level than WASM. But as I don't know at which level msan works, I cannot say more.

Sure, it's understandable but I had quietly hoped that no-ops or at least llvm's own MCA syntax could have been tolerated.

Me too ;-)

We are talking about using long vectors, which might downclock the core for quite some time. If the number of elements is not much higher than 63, it is probably not worthwhile. If it is larger, why is 63 still huge?

That's where Amhdal's law bites hard. Let's assume you have to process 1023 floats with AVX512. Because the loop body is unrolled by 4, we can process 960 elements with the loop body. It remains 63 elements that are processed in scalar. 960 is much larger than 63, so you might assume that's fine, but you would still spend more than half the time processing the remainder because 960/16 = 60 < 63.

I'm curious to learn more about the array packing. Are we talking about something like concatenating several unaligned/non-padded arrays into one packed array? If so, I would have thought gather is sufficient.

Here I'm talking about removing elements from an array and making the remaining elements contiguous. Scatter would be simpler and faster than gather. Actually, I'm still not sure how to implement it with gather...

In AVX512, we could use compress instructions for that so my example might not be the best, but that's still a problem where scatter is better than gather, even though scatter instruction (or its emulation sequence) is slower.

Histograms are indeed difficult but it seems not even scatter would be enough. Wouldn't we also need conflict detection, which AFAIK is not available on most platforms?

For best performance histograms, yes, conflict detection is extra beneficial and I would love to see a WASM instruction to do it. But the thing is conflict detection can be emulated in SIMD while scatter emulation requires to go back to scalar. Also, scatter has more usages than conflict detection.

Conflict detection is available on both AVX512 and SVE.

All in all, my main point is not that problems are unsolvable without masked memory accesses, because most are actually solvable. My main point is: masked memory accesses make a lot of things easier and are not that bad performance-wise. In particular, I believe (no thoroughly tested) that they will most often be faster than full scalar emulation, even if the memory access is by itself scalar emulated. And some applications will need them anyway, so why not make their usage broader?

penzn commented 4 years ago

Alright, I started writing this yesterday, but it got really late and now the discussion has moved quite far 😄

I think there is no opposition to add length-agnostic variants of operations present in SIMD proposal - those typically have AVX* equivalents (I suspect SVE as well). I think this would make for a good first prototype. As soon as it is possible to run something measurable with the instruction set we should start measuring it - this would be the baseline for adding everything else in.

If flexible SIMD is going to support arbitrary-length vectors, scatter is a must.

Developers made do quite successfully without it on native hardware, why is it a must for Wasm?

I suggest that flexible vectors should target only AVX2, AVX512 & SVE-compatible processors.

Wasm does not support that - you can't really target one subset to one architecture and another - to the other. At the very least that is not supported for standardized operations, and there are no plans to change that. Developers build simd-enabled binaries in addition to non-enabled, because SIMD is a proposal and there are engines that might not support it.

Balancing performance is an open issue - I think we should strive for this proposal to be centered on AVX and SVE, and hopefully there would not be catastrophic performance issues on platforms without those.

lemaitre commented 4 years ago

Developers made do quite successfully without it on native hardware, why is it a must for Wasm?

IMHO, I think it is a must for flexible vectors (much less for WASM SIMD in general). If you know the size of your SIMD register, it's easier to find another route that does not use scatter and fits your algorithm (for example, in register transposition). But with flexible vectors, your options are limited, and it will be harder to find an efficient workaround (eg: full in-register transposition is not applicable).

Plus, workarounds will often decrease in efficiency with larger vectors. I think that's why scatter is part of both AVX512 and SVE.

penzn commented 4 years ago

@lemaitre, good point, thank you! I just opened #12 to track this.

jan-wassenberg commented 4 years ago

@lemaitre

This can work only if the next row is not accessed by another context like another thread. The sequence "load, blend, store" is not atomic, so you can have race conditions.

x86 ISA does not guarantee even aligned >8 byte SIMD loads are atomic (with the exception of lock cmpxchg), and explicitly mentions that crossing cache/page boundaries is not atomic. Thus this is also not a valid use case of write_only_n or masked_store :) I observe it seems very difficult to find an actual use case where it is important.

When multiple cores (and especially sockets, because their cache coherency may be more restrictive) are involved, it is even more important to align and pad. Even in this case of a narrow matrix, I'd think the application could arrange for each thread to have padding at the end of its range.

Let's assume you have to process 1023 floats with AVX512.

This seems to be several orders of magnitude too low. It takes 500 us i.e. > 1 million cycles to activate AVX-512, and several hundred thousand 'lost' cycles after the last instruction due to frequency throttling. But first we can ask - why isn't the app just padding, which would entirely avoid the remainder issue?

Here I'm talking about removing elements from an array and making the remaining elements contiguous. In AVX512, we could use compress instructions for that so my example might not be the best, but that's still a problem where scatter is better than gather

Thanks for clarifying. It is possible to emulate compress() using movmsk to index into a lookup table, and from there load a PSHUFB control mask that removes gaps, and finally increment pointers by popcnt(movmsk). I would be interested to see a benchmark that finds scatter to be faster than that :D

My main point is: masked memory accesses make a lot of things easier and are not that bad performance-wise. And some applications will need them anyway, so why not make their usage broader?

Several reasons to explicitly ban write_only_n or masked_store have been mentioned:

emulating masked loads for 16bit types or on AVX is likely to be slower than load+blend
providing these functions will lead to them being used more often than necessary
implementing for unaligned accesses may require risky and complex features such as signal handlers or SEH
risk of msan false positives
risk of exposing platform-dependent behavior or at least performance cliffs

We might agree to disagree on this. I understand you believe them to be potentially useful but IMO the downsides outweigh this.

lemaitre commented 4 years ago

x86 ISA does not guarantee even aligned >8 byte SIMD loads are atomic (with the exception of lock cmpxchg), and explicitly mentions that crossing cache/page boundaries is not atomic.

Yes, SIMD memory accesses are usually not atomic, but the architecture guarantees that masked out elements are not touch/accessed, so there cannot be race conditions on those. On all architectures (supporting masked accesses), it is valid that different threads access the same "elements" if there are active on at most one thread.

Even in this case of a narrow matrix, I'd think the application could arrange for each thread to have padding at the end of its range.

Yes, you could, but then your matrix would not have a fixed stride between rows, so you would need an extra mechanism to handle that (like an array of pointers, which would be wasteful for short rows).

Like I said, most problems are solvable without mask accesses, but they usually require more complexity for the developer, and depends strongly on the context of the application, whereas mask accesses are more like a solution to fit them all.

This seems to be several orders of magnitude too low. It takes 500 us i.e. > 1 million cycles to activate AVX-512, and several hundred thousand 'lost' cycles after the last instruction due to frequency throttling.

This is only relevant if you consider that your function is the only compute intensive one and is not called often. But most applications using SIMD for not so large data will most likely chain calls to functions that size for the whole computation to be done, and there would not be any warmup delay between calls.

But first we can ask - why isn't the app just padding, which would entirely avoid the remainder issue?

Because that would impose extra burden on the developer, where a simpler solution for them could have existed. And it might not always be possible to add this extra padding because of external constraint (legacy code is the strongest one).

It is possible to emulate compress() using movmsk to index into a lookup table, and from there load a PSHUFB control mask that removes gaps, and finally increment pointers by popcnt(movmsk). I would be interested to see a benchmark that finds scatter to be faster than that :D

I know that is the way to go on SSE, but there are 3 problems here:

not all architectures have a movemask equivalent
the LUT size increases exponentially with the number of elements per register
how do you define an interface for a movemask operation when you do not know the size of your registers? (and vectors might be wider than 64 elements)

Of course, here, the best would probably be to define such an instruction at WASM level, and let the engine choose the right way to implement it according to the target architecture, but you cannot do that for all operations, so there will always be some that need to be implemented in WASM, and where you cannot rely on usual implementations that are specific for a fixed SIMD width.

emulating masked loads for 16bit types or on AVX is likely to be slower than load+blend

I think there is a misunderstanding here. If you define at WASM level a masked load instruction, the final engine will still be able to generate an unmasked load with blend for you behind the scene. As long as the load is aligned (and smaller than a page), you will not have any issue with those, even on AVX and 8-bit elements, or Neon.

providing these functions will lead to them being used more often than necessary

This problem exist for all complex instructions even on native architectures. For instance, SSE3 define the hadd function to do a partial additive reduction of a vector. Many people uses it for the full reduction, and even worse, many people recommend it, even though it is slower than just shuffling.

Is this a problem? Not really because people falling into this trap usually do not require the fastest possible code and might live with a small slowdown. If they want a faster code, they will need to learn a lot more than just this trap and will discover it by themselves when they start digging into latency and throughput of instructions.

I think we could have the same here: provide the instructions, warn about their speed, and give some alternatives for those who really wants the fastest code possible and are willing to pay for the extra complexity.

implementing for unaligned accesses may require risky and complex features such as signal handlers or SEH

Signal handlers are just the most efficient way to implement it, not the only one. Emulation would still be pretty acceptable in most cases. Don't forget that on AVX, it is possible to split the vector in 2 and use SSE mask instructions for small types.

Also, I have the impression that WASM does not give the final application the possibility to set signal handlers, but already rely on them internally. Here it would just be an extra internal usage of them.

risk of msan false positives

Only if msan does not see the WASM instruction, which, I'm pretty sure, it does. If msan does not see WASM instructions, why is it a WASM tool?

risk of exposing platform-dependent behavior or at least performance cliffs

The performance cliff is a valid concern, but for me, it is acceptable to have a performance cliff if there would be no way for you to implement it faster on your own. And that would be the case for masked accesses.

Of course you would be able to be faster in some (most?) cases with padding, but your code would then not be functionally equivalent at the low level. We should still recommend developers to pad their data for maximum performance when possible, even if masked accesses are provided.

We might agree to disagree on this. I understand you believe them to be potentially useful but IMO the downsides outweigh this.

I would be fine with this.

Just to sum up my thoughts: I'm on the opposite side from you. All you workarounds to the "masking" problem would work in practice, but I really have the impression that altogether, they have a much bigger complexity than masked accesses. Also, I propose complexity on the engine whereas you propose complexity on the end-user code.

However, don't get me wrong, I do agree with your recommendations for high performance code. I just think that the choice should be in the hands of the end-developer, and not us. We should just provide tools (instructions) for them to make this choice.

lemaitre commented 4 years ago

I propose to move the part of the discussion about how to implement masked memory accesses to #13. We would keep in this thread the discussion about workarounds and if we actually need them.

jan-wassenberg commented 4 years ago

the architecture guarantees that masked out elements are not touch/accessed, so there cannot be race conditions on those.

I believe tools such as tsan will see this differently (and raise errors).

valid that different threads access the same "elements" if there are active on at most one thread.

It would be interesting to see how many RFO transactions there are as cores fight over the cache line :)

If they want a faster code, they will need to learn a lot more than just this trap and will discover it by themselves when they start digging into latency and throughput of instructions.

That sounds like unnecessary burden on the developer, which you seem to want to avoid. Would it be simpler to just not define inefficient operations that no one uses after digging into latency?

it is acceptable to have a performance cliff if there would be no way for you to implement it faster on your own.

As I understand it, the wasm128 development put great emphasis on avoiding performance cliffs. In #13, the current proposal is to implement masked_store as a scalar loop.

lemaitre commented 4 years ago

I believe tools such as tsan will see this differently (and raise errors).

For me, if such a tool does not have access to the mask (or ignores it), then the tool is flawed. And the only that they can have access to the mask is to provide a WASM instruction for it.

That sounds like unnecessary burden on the developer, which you seem to want to avoid. Would it be simpler to just not define inefficient operations that no one uses after digging into latency?

That would just impose this burden on everybody, even the ones that are not interested in maximal performance. Maybe I missed something in your view.

As I understand it, the wasm128 development put great emphasis on avoiding performance cliffs.

Performance cliff is a valid concern. Personally, I don't care much about performance cliff. What I care about is slowdown. If the use of an instruction slows down an application on some architecture, that's a problem. But the use of such instruction have a huge speedup on some platform, but on the other it is not particularly faster without being slower, then I'm fine, even though there is a performance cliff.

Also, that's why I opened #13, to explore what would be this performance cliff (if any) when we try to make as efficient as possible.

In #13, the current proposal is to implement masked_store as a scalar loop.

There is no proposal in #13. Only leads. And scalar emulation is only one of them.

Correct me if I'm wrong, but I have the impression that you tend to forget that WASM compilation is a 2-step compilation: C->WASM where the target architecture is not known, and then WASM->ASM where the target architecture is known. When I talk about scalar emulation, I talk about WASM->ASM translation where the target architecture is known, and where the engine can choose more efficient than scalar emulation if there is hardware support for it. I never said that scalar emulation would be for all architectures.

jan-wassenberg commented 4 years ago

That would just impose this burden on everybody, even the ones that are not interested in maximal performance.

I suppose this is a matter of philosophy, but isn't any user of flexible vectors (above and beyond wasm SIMD128) interested in maximal performance by definition?

Correct me if I'm wrong, but I have the impression that you tend to forget that WASM compilation is a 2-step compilation: C->WASM where the target architecture is not known, and then WASM->ASM where the target architecture is known.

It's true this is not the way I use SIMD most of them time, but in case of masked_store, I do not see how the wasm->asm step can safely emit load+blend+store if the application did not set aside enough room first (i.e. pad). As mentioned, the risks include hitting an unmapped/guard page, which is super expensive even if handled by signal/SEH, and triggering MSAN complaints. If the application did set aside enough room, then load+blend+store is safe and I'd be surprised to see a benchmark indicating it is slower on avx2 than "native" codegen.

lemaitre commented 4 years ago

isn't any user of flexible vectors (above and beyond wasm SIMD128) interested in maximal performance by definition?

High performance is not maximal performance. Besides, people always do trade-offs: for certain applications the development burden might not be worth the few extra percents in performance, while for others, the gain will be more substantial and might be considered worth the effort. What you say here is everybody will be willing to pay this price even for a few percents, while my position is to let the devs chose according to their needs.

I do not see how the wasm->asm step can safely emit load+blend+store if the application did not set aside enough room first (i.e. pad).

If we require that there must be enough room to emit a load+blend+store (ie: no SEH), then it would be up to the C->WASM compiler (or the dev) to ensure there is actually enough room in order to emit the masked store in the first place. The WASM->ASM engine would just assume there is enough room and blindly convert the masked store into a load+blend+store. It would be part of the contract of the masked store that all elements, even inactive, must be a valid memory location.

The thing is, WASM->ASM can emit a native mask store if the target architecture supports it. This is not possible if the emulation is done during C->WASM phase (whatever the emulation is).

As mentioned, the risks include hitting an unmapped/guard page, which is super expensive even if handled by signal/SEH, and triggering MSAN complaints.

Let's put MSAN aside for now. The cost of touching an unmapped page will be roughly as high as the cost of touching a mapped but unallocated page. But this cost is considered ok, otherwise, we would not have on-the-fly page allocation.

Moreover, crossing an allocation boundary will be very rare in practice, so paying a couple thousand cycles here would be acceptable.

The really cool thing about signals is that it does not have any overhead when it is not used. It means that if we rely on signal handling to deal with allocation boundary, and the user does pad their data to ensure all memory accesses are valid, their program will go at exactly the same speed as if there were no signal handler at all.

If the application did set aside enough room, then load+blend+store is safe and I'd be surprised to see a benchmark indicating it is slower on avx2 than "native" codegen.

If we have a SEH (or the access is aligned), we don't even have to know there is enough room. The price of the SEH would be pay only if there were actually not enough room.

nickik commented 4 years ago

I not expert enought technically but I would make two small points

I'm pessimistic here: I don't think there will be much customer hardware with SVE in a 3 year time frame. ^^'

I know I am taking this a little bit out of context, but the sentiment is still important, we should remember that things like C, POSIX and so on are 50 years old at this point. WebAssembly should be a 100 to 1000 year standard if it achieves is goal of being being fast and universal.

RISC-V is young, like WebAssembly, but its also the only truly open standard of all the things discussed here. There is a very strong alignment in values. Considering other Open Standards should always be a big concern for Open Standard developments, rather then making sure property standards are not suffering from minor issues.

I also think its important to consider, that even if RISC-V Vectors will not be in many consumer devices, like phones. RISC-V in IoT is looking to be really big, and vectors will be used. Vectors are often used to handle data streaming in form sensors.

Thanks all for the great work on WebAssembly!

programmerjake commented 3 years ago

One other ISA to consider is SimpleV, a WIP extension for OpenPower. It's designed to be a CPU/GPU hybrid ISA so it supports all the things like gather/scatter/per-element-predication/etc.

penzn commented 3 years ago

@programmerjake I have some experience with Power (though that was a little while ago), however this is my first introduction to SimpleV. Can you share a bit more about it? I am curious how does "GPU/CPU hybrid" aspect work. Is it an ISA to target both GPU and CPU on the same machine or it means there would be two versions of the core?

programmerjake commented 3 years ago

@programmerjake I have some experience with Power (though that was a little while ago), however this is my first introduction to SimpleV. Can you share a bit more about it? I am curious how does "GPU/CPU hybrid" aspect work. Is it an ISA to target both GPU and CPU on the same machine or it means there would be two versions of the core?

We (Libre-SOC) are building an open-source/libre SOC where each CPU core is also simultaneously a GPU core that supports Vulkan, by adding SimpleV and other ISA extensions for things like texture decoding, triangle rasterization, etc. The design has just one kind of core (rather than separate CPU and GPU cores), where that core is good at both CPU and GPU workloads, and where running GPU shaders is as simple as just calling the JIT-compiled code from Linux threads in the same process.

SimpleV works by adding a prefix that can be added to all scalar instructions to convert them to vector instructions, e.g.: The PowerPC instruction:

add %r5, %r10, %r15

can be prefixed (where a new 32-bit chunk gets inserted before the existing 32-bit machine code instruction) in order to get a predicated vector add instruction:

// (exact assembly syntax TBD)
sv.add pred=%r3, elwidth=16, %r5.v, %r10.s, %r15.v

which will add the vector of 16-bit integers stored in the 64-bit registers r15, r16, r17, and so on to the 16-bit scalar integer in r10 and the vector of 16-bit integer results will be stored in 64-bit registers r5, r6, r7, and so on with each 16-bit element being written if the corresponding bit (counting from the LSB) in the predicate in r3 is set. The length of the vectors is taken from the VL register, which can be set with the setvl or setvli instructions (like RVV, except that the MVL/MAXVL is an immediate set by the compiler instead of chosen by the cpu designer).

Simplified pseudo-code for the above sv.add instruction:

union IntRegs {
    // we have 128 (instead of 32) 64-bit integer registers (128 64-bit fp regs too, not shown here)
    uint64_t u64[128];
    // view regs as an array of 32-bit words
    uint32_t u32[128][2];
    // view regs as an array of 16-bit half-words
    uint16_t u16[128][4];
    // view regs as an array of bytes
    uint8_t u8[128][8];
} regs;
int VL; // limited to 0 <= VL <= 64

void sv_add() {
    // sv.add pred=%r3, elwidth=16, %r5.v, %r10.s, %r15.v
    const int SRC1 = 5, SRC2 = 10, DEST = 15, PRED = 3;
    for(int element = 0; element < VL; element++) {
        auto pred_mask = 1ULL << element;
        if(pred_mask & regs.u64[PRED]) {
            uint16_t src1 = regs.u16[SRC1][0]; // scalar -- don't index by element
            // can intentionally spill over into succeeding registers
            uint16_t src2 = regs.u16[SRC2][element]; // vector -- index by element
            uint16_t dest = src1 + src2;
            regs.u16[DEST][element] = dest;
        }
    }
}

This prefixing means any new scalar instructions can also be vectorized by simply adding a SimpleV prefix to their machine code encoding, rather than like most other ISAs that require a whole separate vector instruction set, doubling design effort and increasing complexity.

See the SimpleV overview and the draft spec. for the prefix encoding for more details.

Feel free to say hi on our IRC #libre-soc on freenode, or our mailing list libre-soc-dev if your interested.

colepoirier commented 3 years ago

The CPU is the GPU (as well as the video processing unit).

Luke gave an overview of the Libre-SOC project and SimpleV at FOSDEM 2021 that explains them much better than I can. As Jacob said, both are being developed for the OpenPower ISA.