API Proposal: Support the use-cases of mask-extraction on Vector<T>

IJzerbaard commented 5 years ago

Mask-extraction, such as MOVMSKPS/D and PMOVMSKB are commonly used to

implement EqualsAll and EqualsAny which are already supported
compression after filtering (in conjunction with PSHUFB and optionally POPCNT)
getting the index of the first or last match (in conjunction with BSF or BSR)
various ad-hoc uses

All such uses could be supported by adding a low-level API such as:

public static ulong ExtractMask(Vector<T> x);

But this has some issues,

the real number of output bits varies with vector size and may lead to code assuming that the top bits are zeroes, which may be invalidated in the future.
the number of output bits may exceed 64 in the future, AVX512 is already at 64.
the implementation for T=ushort is tricky without AVX512.
without PSHUFB/VPERMD or bitscan, the result is not very useful anyway.

Instead I propose supporting some narrower specific use cases, for example:

struct Vector<T> {
    public int CompressedCopyTo(Vector<T> mask, T[] destination, int startIndex);
}

CompressedCopyTo stores the elements selected by the mask and returns the number of elements written. Admittedly this has a problem: this cannot be used safely near the end of the destination array even if the number of elements selected by the mask plus the startIndex would be less than the length of the destination, because an entire vector is always stored, it just has the selected elements packed at the start. Using the masked store instruction is not a solution because it has a non-temporal hint, which makes it unsuitable for general use. AVX512 compress-store (eg VCOMPRESSPS with a memory destination) does not have this problem, but is not widely supported.

There are some other issues,

Elements smaller than 32 bits are tough to support without AVX-512_VBMI2, but it is possible and it does not look too bad.
For pre-SSSE3 targets, 32 bit elements could be handled with a switch over the mask and 16 separate shuffle-by-immediate.. not very nice, but still worth having IMO. As far as I know, compressing 16 bit and 8 bit elements is a lost cause with SSE2, ending up in scalar fall-back. SSSE3 market penetration is at 97.71% on the Steam Hardware Survey.
The AVX2 version would need a lookup table 8KB in size.

This is a tough nut to crack but compression of filtered results is broadly applicable and currently impossible with the System.Numerics.Vector API.

To support getting the index of the first or last match, I propose:

public static int FirstIndexOf<T>(Vector<T> vector, T value);
public static int LastIndexOf<T>(Vector<T> vector, T value);
public static int FirstIndexOfNonZero<T>(Vector<T> vector); (optional)
public static int LastIndexOfNonZero<T>(Vector<T> vector); (optional)

With the usual semantic of returning the first or last index of a match if there is one, and -1 otherwise. Possible applications include:

a replacement for the compression of filtered results, assuming matches are rare: search for the first match, handle it in scalar code, pick up searching after that match.
to implement IndexOf on a bigger array
string comparison/search
bucketized hash tables (like Swiss Tables)
replacement for the last iterations in a binary search (using GreaterThanOrEqual and FirstIndexOfNonZero)

Issue: should floating point vectors be supported? They could be, but they raise questions about the precise semantics, eg does NaN equal itself for the purpose of finding its index, do 0.0 and -0.0 equal each other, etc.

tannergooding commented 5 years ago

Thanks for logging this API proposal @IJzerbaard.

I'll give this a more thorough look over when I get into the office tomorrow and will likely leave it marked as api-suggestion for 1-2 weeks so that the community can provide any applicable feedback.

scalablecory commented 5 years ago

This or some variant would be nice.

tannergooding commented 5 years ago

CC. @CarolEidt

Do you have an opinion on how best we could support a scenario like this? Namely, there are some APIs (like mask extracting) which are good cross-platform candidates and which are generally more-useful, but which just returning an int or long may not work since the size of Vector<T> is not strictly defined.

Forcing users to always go through a Span<T> seems undesirable, but we may also need to eventually support things like Vector2048 (ARM SVE extensions support this, for example).

Potentially, something like this could be supported via more explicit APIs that operate directly on Vector128<T> and Vector256<T> in a cross-platform manner...

CarolEidt commented 5 years ago

This is definitely the sort of thing that requires some careful API design (and deeper thinking). This PR provides a good start in that it outlines a few of the key use cases. Further, I think we can now start thinking about Vector<T> as a higher-level abstraction for which we need not constrain the APIs to map to a single machine instruction, especially as we reduce the friction of utilizing the HW intrinsics to implement operations on Vector<T>.

I think we should consider adding cross-platform APIs that operate on fixed-size vectors (Vector128<T>, Vector256<T> and perhaps even Vector64<T>), but I think we should also continue to look at expanding the Vector<T> APIs.

GSPP commented 5 years ago

It would be good to write realistic prototype code to design an API like this. It can be a bit hard to foresee what API shape is what's needed in practical code.

dotnet / runtime

API Proposal: Support the use-cases of mask-extraction on Vector<T> #30569