What is the intended workflow of actually leveraging any implementation defined instructions?

juj commented 1 year ago

Reading the spec documentation, it for example defines laneselect as:

Select lanes from a or b based on masks in m.
If each lane-sized mask in m has all bits set or all bits unset, these instructions behave the same as v128.bitselect.
Otherwise, the result is implementation defined.

What kind of workflow would a developer follow in order to manage this "implementation defined" aspect?

In other words, where should a developer turn to in order to find documentation on the different implementation defined behavior? I.e. how does a developer turn the "implementation defined" part into something that they can work with?

Is the intention that each browser/VM vendor will produce documentation on what their implementation defined means in all cases? Does Chrome have such documentation available somewhere?

Or is the intention that each hardware vendor will produce this documentation on how it works on their hardware?

How can a project such as Emscripten be able to leverage relaxed SIMD instructions in order to produce better SSE and NEON cross-porting headers? In such a scenario, Emscripten developers would be looking to find what the guarantees are, and project these via the Emscripten compiler and its documentation to Emscripten users.

penzn commented 12 months ago

Thank you for filing this. Is this an issue specifically with laneselect or other operations as well?

Something along these lines have been raised w.r.t relaxed laneselect in #125. To answer what it is "supposed to do", the intent that it would do "either" of the existing hardware behavior: on Arm the instruction is identical to existing bitselect, and on x86 there are three different blend instructions that operate off the top bit of every lane, with exception of 16-bit lanes. The intended workflow for this operation, from https://github.com/WebAssembly/relaxed-simd/issues/125#issuecomment-1579998014, is that user would do a compare, followed by relaxed laneselect to get either or of the two potential values (think about it as vectorized version of x compare y ? a : b for various comparison operators. In that form it is quite cross-platform: comparison ops return 11..1 or 00..0 in every lane and it won't matter whether laneselect checks only the sign bit or every bit.

One additional caveat with this pattern is that it can be relatively trivially detected when using regular bitselect. I have a CL to implement this in V8, machinery to detect the case and produce blend already exists in the engine, see https://github.com/WebAssembly/relaxed-simd/issues/125#issuecomment-1749277177

On the other hand, I completely understand the confusion and frustration with it: there is some amount of tribal knowledge (the fact I am describing how the op would work by providing a link to a comment), it is unclear whether you should detect which flavor you are going to get or ensure that data is in a shape where that won't matter, etc.

juj commented 12 months ago

Is this an issue specifically with laneselect or other operations as well?

When I read the laneselect instruction, I was more than a bit puzzled, because what the whole instruction does is implementation defined, and not just some corner cases that would be intended to not be on the critical computational path (e.g. NaN or Inf semantics might not be important to some instruction for some uses)

the intent that it would do "either" of the existing hardware behavior

Gotcha, that would make sense. Would it be possible for the spec should strongly guarantee this, e.g. say "on SSE archs this is how it should work, and on NEON archs this is how it should work", and then provide means to developers to detect if running on NEON vs SSE arch. (iiuc such detection mechanism doesn't yet exist?)

If this intent is only conveyed in an explanatory GitHub conversation, then it does not yet provide a guarantee that developers can ship software with it. If the implementation defined part is subject to VM-specific decisions, then for each VM that provides specifications on "this is how we implement instructions a,b,c", there should be a related guidance on "how do you detect that you are running on our VM that implements it like this". Implementations cannot feature test these types of function computations, since that could end up in false positives, or having to benchmark across a lot of possible function inputs to gain confidence? (if I understood the situation correctly)

In general, instead of providing functions that change their behavior depending on SSE vs NEON, I would recommend providing separate functions that have fixed SSE and NEON behavior, and then requiring VMs to implement both semantics. That way there would be more compilation strategies available to developers and determinism would be preserved.

For example, a single operation the complexity similar to a matrix-vector multiply would be too fine-grained operation to include boolean checks on every operation on what kind of laneselect is available, but if particular code always needed a SSE blend or a NEON laneselect, then it could be written to directly depend on it. Targeting these kind of "what this function does depends on what CPU you are on" instructions limits strategies to where the instruction can be deployed. And for example we could not utilize any of these to improve performance and portability of compiling existing SSE/NEON code to the web.

A bit more about this at https://github.com/emscripten-core/emscripten/pull/20391#discussion_r1348029508

juj commented 12 months ago

I wonder if the intent of the wording "implementation defined" is rather more to mean "unspecified"?

As in:

i32x4.laneselect(a: v128, b: v128, m: v128) -> v128

for i in 0...3:
  if (m[i] = 0xFFFFFFFF) ret[i] = a[i];
  else if (m[i] == 0) ret[i] = b[i];
  else ret[i] = unspecified;

to allow implementations to provide a function where users are only expected to call laneselect with m either all ones or all zeroes in a lane?

In such case, I think it would be cool to provide both SSE and NEON semantic functions, e.g. laneselect_sse and laneselect_neon, and then specify that the function laneselect would be an alias to one of those two functions (whichever is more performant), and with the expectation that when users are calling laneselect, they are not meant to call it with anything else than all zeros or all ones in the masks?

Alternatively, if the intent is not to say unspecified, but to specifically define the result to be implementation defined, then I think this would call for accompanying documentation/specifications from VMs that explain how to find what the implementation defined behavior will then be, and how to test for that behavior?

juj commented 12 months ago

When I read the laneselect instruction, I was more than a bit puzzled, because what the whole instruction does is implementation defined, and not just some corner cases

Yeah, now I realize with more careful reading that the above was not quite correct; but my initial understanding was off.

I still think having both well defined, and alias to the more performant one would be a nice design to approach these (laneselect_sse, laneselect_neon, laneselect_fast to alias to one or the other?).

penzn commented 11 months ago

Something along these lines has been voiced before, for example by @titzer. I don't know how much we can do this late in the proposal process though.

A bit more about this at emscripten-core/emscripten#20391 (comment)

This is actually pretty reasonable. Some other people who tried porting existing SIMD code ran into the problems you are describing. In fact, some of these issues start with original SIMD, not even relaxed SIMD. There was a talk about adding a separate operation for singnselect (laneselect_sse), but that got voted down. laneselect_neon is bitselect, BTW.

WebAssembly / relaxed-simd

What is the intended workflow of actually leveraging any implementation defined instructions? #152