Open IJzerbaard opened 5 years ago
Thanks for logging this API proposal @IJzerbaard.
I'll give this a more thorough look over when I get into the office tomorrow and will likely leave it marked as api-suggestion
for 1-2 weeks so that the community can provide any applicable feedback.
This or some variant would be nice.
CC. @CarolEidt
Do you have an opinion on how best we could support a scenario like this? Namely, there are some APIs (like mask extracting) which are good cross-platform candidates and which are generally more-useful, but which just returning an int
or long
may not work since the size of Vector<T>
is not strictly defined.
Forcing users to always go through a Span<T>
seems undesirable, but we may also need to eventually support things like Vector2048
(ARM SVE extensions support this, for example).
Potentially, something like this could be supported via more explicit APIs that operate directly on Vector128<T>
and Vector256<T>
in a cross-platform manner...
This is definitely the sort of thing that requires some careful API design (and deeper thinking). This PR provides a good start in that it outlines a few of the key use cases. Further, I think we can now start thinking about Vector<T>
as a higher-level abstraction for which we need not constrain the APIs to map to a single machine instruction, especially as we reduce the friction of utilizing the HW intrinsics to implement operations on Vector<T>
.
I think we should consider adding cross-platform APIs that operate on fixed-size vectors (Vector128<T>
, Vector256<T>
and perhaps even Vector64<T>
), but I think we should also continue to look at expanding the Vector<T>
APIs.
It would be good to write realistic prototype code to design an API like this. It can be a bit hard to foresee what API shape is what's needed in practical code.
Mask-extraction, such as MOVMSKPS/D and PMOVMSKB are commonly used to
EqualsAll
andEqualsAny
which are already supportedAll such uses could be supported by adding a low-level API such as:
But this has some issues,
Instead I propose supporting some narrower specific use cases, for example:
CompressedCopyTo
stores the elements selected by the mask and returns the number of elements written. Admittedly this has a problem: this cannot be used safely near the end of the destination array even if the number of elements selected by the mask plus the startIndex would be less than the length of the destination, because an entire vector is always stored, it just has the selected elements packed at the start. Using the masked store instruction is not a solution because it has a non-temporal hint, which makes it unsuitable for general use. AVX512 compress-store (eg VCOMPRESSPS with a memory destination) does not have this problem, but is not widely supported.There are some other issues,
switch
over the mask and 16 separate shuffle-by-immediate.. not very nice, but still worth having IMO. As far as I know, compressing 16 bit and 8 bit elements is a lost cause with SSE2, ending up in scalar fall-back. SSSE3 market penetration is at 97.71% on the Steam Hardware Survey.This is a tough nut to crack but compression of filtered results is broadly applicable and currently impossible with the System.Numerics.Vector API.
To support getting the index of the first or last match, I propose:
With the usual semantic of returning the first or last index of a match if there is one, and -1 otherwise. Possible applications include:
GreaterThanOrEqual
andFirstIndexOfNonZero
)Issue: should floating point vectors be supported? They could be, but they raise questions about the precise semantics, eg does NaN equal itself for the purpose of finding its index, do 0.0 and -0.0 equal each other, etc.