WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
Other
531 stars 43 forks source link

wasm_simd128.h intrinsic naming #342

Open tlively opened 4 years ago

tlively commented 4 years ago

Currently the intrinsic header uses a different naming convention than the underlying SIMD instructions. The intrinsic name can be determined from the instruction name by prepending wasm_, turning the . into an _, and removing any signedness suffix in favor of representing unsigned interpretations by using a u instead of an i in the initial interpretation specifier.

For example i32x4.lt_u becomes wasm_u32x4_lt and i32x4.lt_s becomes wasm_i32x4_lt.

I don't think we ever discussed this naming convention as a group, so with the recent instruction renamings, now is a good time to make sure there are no better alternatives before everything stabilizes.

One possibility, for example, would be to get rid of the us in lane interpretations and use the same _u and _s prefixes as the instructions. What other ideas do we have? Or do we like the current scheme?

cc @zeux.

Maratyszcza commented 4 years ago

I like the current naming conventions, and have plenty of code written against current version of wasm_simd128.h. So it you do change, please keep the old names (at least for a while) for backward compatibility.

zeux commented 4 years ago

I also like the current conventions just fine, but will also be happy with explicit _s suffix. However note that the intrinsics also include things like v8x16 prefixes that are being changed in the instruction naming. So it would be great to understand what is the full set of differences between current names and the outlined algorithm in the issue after the renames take place, because I think it's now more complicated than just "remove signed suffix"

tlively commented 4 years ago

Oh yeah, I guess I was implicitly assuming that we should update the v8x16 prefixes. I can keep the old names for a bit for those as well as @Maratyszcza suggested.

nemequ commented 3 years ago

Is the the "flexible" SIMD stuff about a vector length agnostic proposal (similar to SVE or the RISC-V V spec)?

If that's the case and there will never be explicit 256-bit (or anything else) vectors in WASM SIMD (which I think is the right decision), one possibility would be to drop the lane count. For example, i32x4.lt_u becomes wasm_u32_lt and i32x4.lt_s becomes wasm_i32_lt? I've been thinking about switching WAV to do that, but held off mainly because I was worried about future 256-bit vectors.

One possible wrinkle here is that everything is in the wasm_ namespace instead of something dedicated to WASM SIMD, which could confuse people. Personally, I'd like to see everything moved into a different namespace (maybe wasmv or wv)…

penzn commented 3 years ago

Yes, so far there is no intention to provide 256-bit fixed-width operations.

This is a good question - really it depends on operop between the two and requirements of defining user-facing intrinsics. Native intrinsics typically have different versions for different ISA extensions, even if they do the same thing, I am not sure what path Wasm would take here.

BTW, if you are curious on how flexible proposal would handle some of this we should definitely talk about in that repository.

nemequ commented 3 years ago

I hadn't really considered the possibility of using the same API for flexible vectors, but I think there are some obstacles which would make that difficult. For example, you probably want an additional parameter on most functions for the predicate (unless you want to get rid of the predicate in the API). I'd prefer to see a separate namespace for flexible vectors, but having (for example) wasm_i32x4_add for 128-bit and wasm_i32_add for flexible could work.

My concern was more about the possibility of future functions like v256_t wasm_i32x8_add(v256_t, v256_t). IMHO that would be a mistake; only x86 has fixed 256-bit vectors, everyone else seems to be moving directly from 128-bit to flexible. I think it would make sense for WASM to do the same unless it really wants what would effectively be an x86-only API.

If the flexible SIMD would use a different namespace and there isn't going to be a fixed 256-bit SIMD extension, I think dropping the lane counts from the API would make code using it a lot more readable (and easier to write, but I care less about that). Otherwise it could still be done by overloading the functions, which I would be good with but others may find distasteful…

BTW, if you are curious on how flexible proposal would handle some of this we should definitely talk about in that repository.

I am very interested in the C/C++ side, yes. I'll watch the repo.

penzn commented 3 years ago

I'd say flexible SIMD is highly likely to use a different namespace - that is how things usually go with the toolchain.

My concern was more about the possibility of future functions like v256_t wasm_i32x8_add(v256_t, v256_t). IMHO that would be a mistake; only x86 has fixed 256-bit vectors, everyone else seems to be moving directly from 128-bit to flexible. I think it would make sense for WASM to do the same unless it really wants what would effectively be an x86-only API.

If the flexible SIMD would use a different namespace and there isn't going to be a fixed 256-bit SIMD extension, I think dropping the lane counts from the API would make code using it a lot more readable (and easier to write, but I care less about that).

You are right - adding them would be beneficial for x86 at expense of everybody else, which did not make a lot of sense. To my best knowledge there are no proposals to introduce 256-bit ops as of today.

BTW, flexible vectors take approach somewhat in the same direction as WAV - there are types for lanes, instead of a generic 'vector' type. Also, manipulating lane count (which a few people would like to see gone) is done via a separate instruction, rather than through an extra parameter, which means that the core API would not need a lane count argument.

nemequ commented 3 years ago

I just added a lot of tests to the tests branch of WAV to verify that everything compiles down to exactly what we expect. Functions which don't generate the expected result have been annotated in wav.h (in the main branch) as requiring unimplemented-simd128 (grep for "_UNIMPLEMENTED"). I think this should serve as a pretty good list for you, though it doesn't incorporate your change from yesterday.

The exception is the const functions, which generate splat + replace_lane.

One other thing I noticed is that there are 4 different builtins for any_true, but the spec only has a single variant (which makes sense).

Also, if you mask the count on the shift functions in C (i.e., vec << (count & 15)) you'll see the mask operation in the output even though the shifts (theoretically) do this internally.

tlively commented 3 years ago

@nemequ, I landed a patch in LLVM yesterday to remove the unimplemented-simd128 target feature and the implicit definitions of the corresponding feature macro, but of course that shouldn't stop you from using the same macro for your own purposes. That same patch should enable v128.const by default.

I expect to remove the current anytrue* intrinsics and replace them with a single any_true_v128 intrinsic to mirror the change we made to the instructions.

I submitted https://bugs.llvm.org/show_bug.cgi?id=49655 to track the suboptimal shift issue.

tlively commented 3 years ago

I just uploaded https://reviews.llvm.org/D101112 to add the missing intrinsics and update the existing intrinsics to match the new instruction names. The old names are kept as deprecated functions to ease transitioning to the new names. Reviews welcome!

Maratyszcza commented 3 years ago

@tlively The updated intrinsics look good. I'd like to have const_splat versions though.

tlively commented 3 years ago

@Maratyszcza, if you do e.g. wasm_i8x16_splat(42), it will actually give you a v128.const i8x16 42, 42, 42, .... Is that sufficient or would having a more explicit intrinsic for achieving that be worth it?

Maratyszcza commented 3 years ago

It is enough for me, but is probably a bad idea overall: polymorphic macros/functions are not portable to older C codebases.

tlively commented 3 years ago

This isn't polymorphism in the clang front-end; it's the backend instruction selection detecting that a splat of a constant is better represented with a v128.const than with a splat instruction.

Maratyszcza commented 3 years ago

Then it is not enough: other compilers may not have the same optimization. E.g. last time I checked, Microsoft Compiler happily generated multi-instruction sequence for _mm_set1_epi32(const) on x86.

nemequ commented 3 years ago

The first parameter to __builtin_wasm_load*_lane should be const, and wasm_v128_load*_lane shouldn't cast away the const.

__builtin_wasm_abs_i64x2 doesn't generate i64x2.abs https://godbolt.org/z/jY1hf3Pch

Other than that, my biggest concern is that __builtin_wasm_load*_lane are basically unusable for me. If the lane parameter isn't an integer constant expression the compiler will emit an error, so the API has to use a macro. In C I can use a generic selection to choose the right function, but that's not an option in C++. Handling the obvious pattern would fix the issue.

penzn commented 3 years ago

@tlively let me know if you want me to look into any of this.

tlively commented 3 years ago

Thanks for the feedback @Maratyszcza and @nemequ, and thanks for the offer of help @penzn! I'm currently working on adding a test upstream in clang to directly test the codegen for all of the intrinsics to catch the i64x2.abs issue @nemequ identified and other similar issues.

@penzn, if you want to look into replacing the macro intrinsics with functions using the pattern @nemequ shared, that would be really helpful and would not clash with my current work. If you're interested, I would also appreciate help fixing codegen issues after my test lands.

nemequ commented 3 years ago

Another one: wasm_load32_zero and wasm_load64_zero don't generate the right instructions. __builtin_wasm_load32_zero and __builtin_wasm_load64_zero work, which is why I didn't catch this earlier, but wasm_simd128.h doesn't use them. Of course I'd rather LLVM recognized the existing implementations instead of requiring the intrinsics…

Also, the input to __builtin_wasm_load32_zero and __builtin_wasm_lod64_zero should be const.

FWIW, I should have a bunch of examples where codegen can be improved soon. My plan is to update the WASM SIMD128 implementation in SIMDe to match the current wasm_simd128.h over the weekend, then port the (verified correct) implementations to WAV so I can use the tests I have there to easily find implementations for which LLVM doesn't generate optimal code (i.e., a single instruction). I'll probably file them in the LLVM bugzilla next week. Sorry, it's going to be a lot of work on the LLVM side, but I think it should be pretty helpful for improving autovectorization.

tlively commented 3 years ago

Excellent, I look forward to the reports. Thanks for your work on this!

tlively commented 3 years ago

I have the codegen tests up for review here: https://reviews.llvm.org/D101684. @nemequ, could you take a look through the FIXMEs there and make sure they line up with the issues you've seen?

nemequ commented 3 years ago

LGTM. They include all the issues I've seen; your list has a few more because WAV uses the builtins directly (unless I can avoid them altogether and still get the right instructions) instead of going through wasm_simd128.h, and most of these seem to be bugs in the header.

I did just notice that the pointer parameters for __builtin_wasm_load*_lane aren't const but should be. Since wasm_v128_load*_lane are macros the diagnostic is emitted when using wasm_simd128.h: https://godbolt.org/z/e5c84vb17. I'm working around it in WAV by disabling the warning, which wasm_simd128.h could do, but fixing the builtin seems like a better solution.

nemequ commented 3 years ago

Argh, one last thing, sorry: the inputs to __builtin_wasm_narrow_u_i8x16_i16x8 and __builtin_wasm_narrow_u_i16x8_i32x4 should be signed.

Maratyszcza commented 3 years ago

@tlively One more thing: the wasm_v128_load*_lane/wasm_v128_store*_lane APIs should use const void*/void* rather than strongly typed pointers.

penzn commented 3 years ago

Looks like the tests need to be re-done in terms of C->IR->Wasm. @tlively since this is more work, let me know if you need help with that as well, as a baseline it should be possible to take the tests you have written and turn them into two files for two stages.

tlively commented 3 years ago

Thanks, @penzn! I've already landed the C-> IR test here: https://reviews.llvm.org/rGf3b769e82ff3340e96c2e50ac2bd6361fbc3d797. I don't think we'll want a specific IR->Wasm test file that matches the intrinsics, though. I still have my end-to-end test in a separate branch to help me identify specific backend issues, though.

For example, https://reviews.llvm.org/rG14ca2e5e22e7806c77ca3e5b126e888c9b1c4041 should fix the bug with i64x2.abs not being generated.

tlively commented 3 years ago

How would folks feel about changing the return type of the extract_lane functions and parameters to replace_lane functions to be the specific element types, e.g. int8_t rather than just int? The original motivation for returning int is that the underlying instructions consume and produce i32 values, but that's not consistent with our use of more specific types in other places like the arguments to the splat functions.

lars-t-hansen commented 3 years ago

How would folks feel about changing the return type of the extract_lane functions and parameters to replace_lane functions to be the specific element types, e.g. int8_t rather than just int?

I'm a little worried about the C++ compiler inserting redundant extend instructions in both cases, since the underlying instructions consume and return int. I expect this is mostly not a problem, though it is possibly ABI-dependent.

(Consistency is good, but could it be that it's the arguments to the splat functions that should be changed to int instead?)

tlively commented 3 years ago

Yes, it would also be possible to change the arguments to splats to be ints, but IMO the more specific types give a better developer experience. I can add tests that check that no extra extend instructions are emitted, if that sounds good.

lars-t-hansen commented 3 years ago

Yes, it would also be possible to change the arguments to splats to be ints, but IMO the more specific types give a better developer experience. I can add tests that check that no extra extend instructions are emitted, if that sounds good.

Tests are good... Still wondering if this might be an issue in corner cases, but having a hard time constructing concrete examples for it, depends on compiler internals. In principle though, something as simple as (i32.add (i8x16.extract_lane n ...) ...) (transliterated to obvious C++) should tell us something, and I'd settle for some obvious tests like that.

nemequ commented 3 years ago

How would folks feel about changing the return type of the extract_lane functions and parameters to replace_lane functions to be the specific element types, e.g. int8_t rather than just int?

I'm strongly in favor of this, but isn't it basically #257? I don't see why it would make sense for the builtin to return an int8_t but the wrapper return an int; if anything I would think that should be reversed (though IMHO it would be much better to just use int8_t for both).

This brings up another question I've been wondering about: is there a reason for these builtins to exist at all? Why not just drop them and implement the extract_lane and replace_lane functions using the subscript operator? The only benefit of the builtin I can think of is that it verifies that the index is a constant, but you can do that in the wrapper instead (which is what I do in WAV).

It's not just these functions; a lot of the current builtins can already be replaced by normal code (I can very easily put together a fairly complete list if there is interest). There are a lot of functions in wasm_simd128.h which already work this way so obviously it's not forbidden, but what is the criteria for determining whether something needs a builtin or not?

tlively commented 3 years ago

Ah yes, I had forgotten about #257. @Maratyszcza, on that issue you were arguing against using the smaller types; has your thinking on that evolved? I know my own focus has shifted from trying to represent the underlying instructions precisely to trying to present the best developer experience (with no difference in the produced code either way).

This brings up another question I've been wondering about: is there a reason for these builtins to exist at all?

Generally, many of the builtins exist because getting end-to-end codegen for particular instructions was simpler using builtin functions and LLVM intrinsics than proper ISel patterns. My plan is to continue replacing those builtins with normal code patterns.

The extract and replace builtins predate all that rapid prototyping work, though. I was very new to this project when I added them, and I agree that they are not useful today ;)

what is the criteria for determining whether something needs a builtin or not?

If normal code patterns can reliably give us good results via reasonable ISel patterns, no builtin or LLVM intrinsic should be necessary. If the underlying instruction has potentially useful semantic differences from similar C, a builtin is probably still useful to expose those semantics at the source level, though. (For example, the builtins for the trapping and non-trapping float-to-int instructions are useful because out-of-bounds float-to-int conversions are UB in C.) Builtins are also useful when the operation cannot be simply expressed in C or LLVM IR, for example for the saturating, rounding q15 mul instruction.

nemequ commented 3 years ago

I know my own focus has shifted from trying to represent the underlying instructions precisely to trying to present the best developer experience (with no difference in the produced code either way).

That's great, though IMHO it would need some major, fundamental changes to present a good experience; that's exactly why I started working on WAV.

Generally, many of the builtins exist because getting end-to-end codegen for particular instructions was simpler using builtin functions and LLVM intrinsics than proper ISel patterns. My plan is to continue replacing those builtins with normal code patterns.

Ok, it seems like we're on the same page here. I'm trying to do the same.

By "replacing", do you mean that you plan to remove the builtins, or just stop using them in wasm_simd128.h? Right now I'm using them directly in WAV (when normal code won't work), but since WAV isn't distributed with LLVM like wasm_simd128.h is I'm eventually going to need a stable target…

I have a bunch of functions in WAV (besides extract/replace_lane) which don't call builtins but do generate the correct instructions, including many which wasm_simd128.h does use the builtins for. Once you have the tests you're working on updated and merged into LLVM do you want me to put together a patch to merge those implementations into wasm_simd128.h?

tlively commented 3 years ago

I know my own focus has shifted from trying to represent the underlying instructions precisely to trying to present the best developer experience (with no difference in the produced code either way).

That's great, though IMHO it would need some major, fundamental changes to present a good experience; that's exactly why I started working on WAV.

Yes, I'm just trying to get the best experience possible without major changes ;) I'm very excited about your WAV work as well!

Generally, many of the builtins exist because getting end-to-end codegen for particular instructions was simpler using builtin functions and LLVM intrinsics than proper ISel patterns. My plan is to continue replacing those builtins with normal code patterns.

Ok, it seems like we're on the same page here. I'm trying to do the same.

By "replacing", do you mean that you plan to remove the builtins, or just stop using them in wasm_simd128.h? Right now I'm using them directly in WAV (when normal code won't work), but since WAV isn't distributed with LLVM like wasm_simd128.h is I'm eventually going to need a stable target…

For now I would like to remove them entirely whenever possible, but I totally agree about a stable target. LLVM 13 will branch on July 27; let's finish all the cleanup by then and call whatever builtins make it into LLVM 13 stable. Does that seem reasonable to you?

I have a bunch of functions in WAV (besides extract/replace_lane) which don't call builtins but do generate the correct instructions, including many which wasm_simd128.h does use the builtins for. Once you have the tests you're working on updated and merged into LLVM do you want me to put together a patch to merge those implementations into wasm_simd128.h?

Yes, the tests are merged upstream and patches to wasm_simd128.h would be very welcome!

nemequ commented 3 years ago

For now I would like to remove them entirely whenever possible, but I totally agree about a stable target. LLVM 13 will branch on July 27; let's finish all the cleanup by then and call whatever builtins make it into LLVM 13 stable. Does that seem reasonable to you?

Yes. Are you open to similar wasm_simd128.h changes for roughly the same window? Here are some ideas off the top of my head:

Also, I think it's worth considering how aggressively uni-typed wasm_simd128.h should be. For example

I can see an argument for leaving things as they are for consistency but since wasm_simd128.h is going to be the official API, short of deprecating the whole thing out and switching to a typed API (which I'd be completely comfortable with ;)) I think filling in some of the gaps like this would be the least-bad option… you still don't get type safety, but at least there are fewer weird holes in the API.

Yes, the tests are merged upstream and patches to wasm_simd128.h would be very welcome!

Oh, that was quick. I'll try to get on that soon.

tlively commented 3 years ago

For now I would like to remove them entirely whenever possible, but I totally agree about a stable target. LLVM 13 will branch on July 27; let's finish all the cleanup by then and call whatever builtins make it into LLVM 13 stable. Does that seem reasonable to you?

Yes. Are you open to similar wasm_simd128.h changes for roughly the same window?

Generally yes, although breaking changes would be a harder sell and would need to allow time for users to update their code.

Here are some ideas off the top of my head:

  • wasm_v128_load/wasm_v128_store are for unaligned data, there should probably be a similar function for 16-byte aligned data. In WAV I have wav_*_loadu (unaligned), wav_*_loada (16-byte aligned), and wav_*_load (aligned to _Alignof(element_type)). The third variant doesn't really make sense for wasm_v128_* functions, but wasm_v128_loada seems reasonable to me if you want to keep the default as unaligned (which may also be worth discussing).

I was thinking that for aligned loads and store, users could just dereference v128_t pointers, but you're right that it would probably be better to have explicit functions if we want to support that. I'd be open to discussion here. @Maratyszcza, you've been a proponent of the unaligned, maximally general lowering. Do you have an opinion on this?

  • Related to the normal code vs. intrinsics issue, IMHO it would be better to switch macros to functions when possible. They give much more useful error messages when you do something wrong, play nicer with tooling, are more self-documenting, etc.

I agree, and have started moving in that direction with https://reviews.llvm.org/D102018.

  • One thing I just started doing in WAV (but haven't finished) which I think would make sense in wasm_simd128.h as well is improving the parameter names so functions / macros are a bit more self-documenting. For example, __mask on wasm_v128_bitselect is already a huge improvement over __c (NEON does this), but does the value come from __a or __b when the corresponding bit in __mask is set? Or for wasm_v128_andnot, is it a & ~b or ~a & b? I would assume the former, but anyone who has used _mm_andnot_si128 will likely think twice. This doesn't break API so it's not really time-sensitive.

Sounds like a good idea!

Also, I think it's worth considering how aggressively uni-typed wasm_simd128.h should be. For example

  • If you do add aligned load/store, I'd prefer typed loads (with conformant array parameters). This makes it clearer what the API expects.

Interesting! I assume C++ supports that as well? Seems worth considering.

  • Right now there is no wasm_f32x4_shuffle, wasm_u32x4_eq, or wasm_u32x4_mul, but I find having users mix "types" like that in their code quite unpleasant. v128 as a catch-all for things like bitwise operations is one thing, but IMHO it would be better to have all variants in wasm_simd128.h even when the assembly is the same instead of using the signed integer names as a generic version.

I get where you're coming from, but I think this would make more sense if we actually exposed different types. I slightly prefer leaning into the uni-typed design in which the function/instruction names give the semantics, but the type itself is not interpreted. I guess we could have multiple names for the same operation, but I'm not sure it would be worth it. I'm open to further discussion here.

tlively commented 3 years ago

A quick list of recent changes I have landed upstream:

nemequ commented 3 years ago

I was thinking that for aligned loads and store, users could just dereference v128_t pointers, but you're right that it would probably be better to have explicit functions if we want to support that. I'd be open to discussion here. @Maratyszcza, you've been a proponent of the unaligned, maximally general lowering. Do you have an opinion on this?

One problem with just casting to v128_t * is that it triggers a -Wcast-qual diagnostic: https://godbolt.org/z/MrTn1bx76

  • If you do add aligned load/store, I'd prefer typed loads (with conformant array parameters). This makes it clearer what the API expects.

Interesting! I assume C++ supports that as well? Seems worth considering.

Conformant array parameters? No, C++ doesn't support them. They're based on the syntax C uses for VLAs which C++ (wisely) never added. They don't technically create VLAs since array arguments degrade to pointers, but clang and GCC will both emit a -Wvla diagnostic (if it's enabled) when they are encountered, and MSVC and IAR (which don't support VLAs) error. However, that's easy enough to get around with a macro, which is what I do in WAV. Even if it gets compiled away to nothing I still find it valuable for documentation. To be clear, an empty square bracket there is totally fine for C++, so compiling it away to nothing is valid on C++ and pre-C99.

Also, the C2x charter adds an additional principle (the only new one) for this:

  1. Application Programming Interfaces (APIs) should be self-documenting when possible. In particular, the order of parameters in function declarations should be arranged such that the size of an array appears before the array. The purpose is to allow Variable-Length Array (VLA) notation to be used. This not only makes the code's purpose clearer to human readers, but also makes static analysis easier. Any new APIs added to the Standard should take this into consideration.

So I expect this to be more common going forward, hopefully with more tooling picking up support.

  • Right now there is no wasm_f32x4_shuffle, wasm_u32x4_eq, or wasm_u32x4_mul, but I find having users mix "types" like that in their code quite unpleasant. v128 as a catch-all for things like bitwise operations is one thing, but IMHO it would be better to have all variants in wasm_simd128.h even when the assembly is the same instead of using the signed integer names as a generic version.

I get where you're coming from, but I think this would make more sense if we actually exposed different types.

Completely agree, but IMHO it's still sensible with a single type.

I slightly prefer leaning into the uni-typed design in which the function/instruction names give the semantics, but the type itself is not interpreted. I guess we could have multiple names for the same operation, but I'm not sure it would be worth it. I'm open to further discussion here.

Okay, so here's my pitch:

For stuff like shuffle where it truly is generic, I'd be happy with v8x16/v16x8/v32x4/v64x2; obviously stuff like shuffle and swizzle (and load/store, if typed load/store happens) would fall into that category. But for mul it's not necessarily obvious that you can use signed multiplication safely; I think a lot of people would just think "oh, I guess WASM SIMD128 doesn't support unsigned multiplication".

There is also the readability issue. For example, consider something like:

v128_t madd_u32(v128_t a, v128_t b, v128_t c) {
  return wasm_u32x4_add(wasm_i32x4_mul(a, b), c);
}

To me, that just looks like a bug. I'd definitely trip over something like that if I was skimming some code and have to think about it for a second. IMHO it's important enough to make sure an API is readable and unsurprising to justify adding aliases even when, if you think about it, under the hood they do exactly the same thing.

Then there is the consistency issue. The above snippet isn't a bug, but this is:

v128_t madd_f32(v128_t a, v128_t b, v128_t c) {
  return wasm_f32x4_add(wasm_i32x4_mul(a, b), c);
}

In some ways, I think it's actually more important to do stuff like this in a uni-typed API. That first example doesn't just look like a bug, it looks like a class of bug which is probably going to be very common since there is no type safety to prevent you from making that mistake. I'd expect people to be more sensitive to issues like that when reviewing code using this API, so you're going to end up with a lot of false-positive diagnostics from your brain for stuff like this.

Basically, the aliases would make code easier to read and write, and the only cost is a few extra functions in a system include.

Maratyszcza commented 3 years ago

@Maratyszcza, you've been a proponent of the unaligned, maximally general lowering. Do you have an opinion on this?

IIRC, V8 & SM ignore alignment hints, and practically enforcing alignment hints would cost performance on platforms where they are not natively enforced by hardware (including x86 AVX and ARM64 NEON). Thus, practically alignment hints are of no use in WAsm SIMD.

Maratyszcza commented 3 years ago

But for mul it's not necessarily obvious that you can use signed multiplication safely; I think a lot of people would just think "oh, I guess WASM SIMD128 doesn't support unsigned multiplication".

For these types of people we have automatic autovectorization in Clang. IMO it is reasonable to expect that people who directly program in low-level intrinsics understand which instructions those map to, and what operations they do.

penzn commented 3 years ago

Also, if you mask the count on the shift functions in C (i.e., vec << (count & 15)) you'll see the mask operation in the output even though the shifts (theoretically) do this internally.

Compiler does not normally carry full info about extra operations or constraints intrinsics have, for use in optimizations. I would not consider this a bug in spirit of @Maratyszcza's point right above this message :)

Posted in bugzilla (LLVM is migrating to Github issues some time soon): https://bugs.llvm.org/show_bug.cgi?id=49655#c1