[API Proposal]: Add SSE4.2 String Comparison Intrinsics

Background and motivation

The SSE4.2 string comparison instructions were omitted from the S.R.I surface area when the remainder of SSE4.2 was implemented, mostly due to inability to agree on an acceptable API.

This omission hasn't caused too much pain, because most people looking to use these instructions are implementing something along the lines of IndexOf(char/byte) or Contains(char/byte), which can be accomplished more efficiently with CompareEqual and MoveMask/TestZ.

There are, however, places where the string instructions are useful. For example, pcmpestri can be faster than the above workaround when implementing something like IndexOfAny with a longer search list (4+ bytes/chars), especially over a shorter search length.

Further, the string comparison instructions can be used for general-purpose vector intersection, doing up to 64 16-bit or 256 8-bit comparisons at once. One such use is https://github.com/RoaringBitmap/CRoaring, which was brought up in https://github.com/dotnet/runtime/discussions/63332

API Proposal

The initial attempt at defining the API made the control byte and flag selection into enums, which might have been too much of an abstraction. This proposal matches up more closely with the C intrinsics.

Names of the flag-returning variants are based on those used for ptest.

namespace System.Runtime.Intrinsics.X86;

public abstract class Sse42 : Sse41
{
    /// <summary>
    /// __m128i _mm_cmpistrm (__m128i a, __m128i b, const int imm8)
    ///   pcmpistrm xmm, xmm, imm8
    /// </summary>
    public static Vector128<byte> CompareImplicitLengthStringsAndGetMask(Vector128<byte> left, Vector128<byte> right, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpistri (__m128i a, __m128i b, const int imm8)
    ///   pcmpistri xmm, xmm, imm8
    /// </summary>
    public static int CompareImplicitLengthStringsAndGetIndex(Vector128<byte> left, Vector128<byte> right, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpistrc (__m128i a, __m128i b, const int imm8)
    ///   pcmpistri xmm, xmm, imm8
    /// </summary>
    public static bool CompareImplicitLengthStringsAndTestC(Vector128<byte> left, Vector128<byte> right, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpistro (__m128i a, __m128i b, const int imm8)
    ///   pcmpistri xmm, xmm, imm8
    /// </summary>
    public static bool CompareImplicitLengthStringsAndTestO(Vector128<byte> left, Vector128<byte> right, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpistrs (__m128i a, __m128i b, const int imm8)
    ///   pcmpistri xmm, xmm, imm8
    /// </summary>
    public static bool CompareImplicitLengthStringsAndTestS(Vector128<byte> left, Vector128<byte> right, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpistrz (__m128i a, __m128i b, const int imm8)
    ///   pcmpistri xmm, xmm, imm8
    /// </summary>
    public static bool CompareImplicitLengthStringsAndTestZ(Vector128<byte> left, Vector128<byte> right, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpistra (__m128i a, __m128i b, const int imm8)
    ///   pcmpistri xmm, xmm, imm8
    /// </summary>
    public static bool CompareImplicitLengthStringsAndTestNotZAndNotC(Vector128<byte> left, Vector128<byte> right, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// __m128i _mm_cmpestrm (__m128i a, int la, __m128i b, int lb, const int imm8)
    ///   pcmpestrm xmm, xmm, imm8
    /// </summary>
    public static Vector128<byte> CompareExplicitLengthStringsAndGetMask(Vector128<byte> left, byte leftLength, Vector128<byte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpestri (__m128i a, int la, __m128i b, int lb, const int imm8)
    ///   pcmpestri xmm, xmm, imm8
    /// </summary>
    public static int CompareExplicitLengthStringsAndGetIndex(Vector128<byte> left, byte leftLength, Vector128<byte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpestrc (__m128i a, int la, __m128i b, int lb, const int imm8)
    ///   pcmpestri xmm, xmm, imm8
    /// </summary>
    public static bool CompareExplicitLengthStringsAndTestC(Vector128<byte> left, byte leftLength, Vector128<byte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpestro (__m128i a, int la, __m128i b, int lb, const int imm8)
    ///   pcmpestri xmm, xmm, imm8
    /// </summary>
    public static bool CompareExplicitLengthStringsAndTestO(Vector128<byte> left, byte leftLength, Vector128<byte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpestrs (__m128i a, int la, __m128i b, int lb, const int imm8)
    ///   pcmpestri xmm, xmm, imm8
    /// </summary>
    public static bool CompareExplicitLengthStringsAndTestS(Vector128<byte> left, byte leftLength, Vector128<byte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpestrz (__m128i a, int la, __m128i b, int lb, const int imm8)
    ///   pcmpestri xmm, xmm, imm8
    /// </summary>
    public static bool CompareExplicitLengthStringsAndTestZ(Vector128<byte> left, byte leftLength, Vector128<byte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    /// <summary>
    /// int _mm_cmpestra (__m128i a, int la, __m128i b, int lb, const int imm8)
    ///   pcmpestri xmm, xmm, imm8
    /// </summary>
    public static bool CompareExplicitLengthStringsAndTestNotZAndNotC(Vector128<byte> left, byte leftLength, Vector128<byte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    // Overloads for compatible element types
    public static Vector128<sbyte> CompareImplicitLengthStringsAndGetMask(Vector128<sbyte> left, Vector128<sbyte> right, byte mask) => throw new PlatformNotSupportedException();
    public static int CompareImplicitLengthStringsAndGetIndex(Vector128<sbyte> left, Vector128<sbyte> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestC(Vector128<sbyte> left, Vector128<sbyte> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestO(Vector128<sbyte> left, Vector128<sbyte> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestS(Vector128<sbyte> left, Vector128<sbyte> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestZ(Vector128<sbyte> left, Vector128<sbyte> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestNotZAndNotC(Vector128<sbyte> left, Vector128<sbyte> right, byte mask) => throw new PlatformNotSupportedException();

    public static Vector128<sbyte> CompareExplicitLengthStringsAndGetMask(Vector128<sbyte> left, byte leftLength, Vector128<sbyte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static int CompareExplicitLengthStringsAndGetIndex(Vector128<sbyte> left, byte leftLength, Vector128<sbyte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestC(Vector128<sbyte> left, byte leftLength, Vector128<sbyte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestO(Vector128<sbyte> left, byte leftLength, Vector128<sbyte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestS(Vector128<sbyte> left, byte leftLength, Vector128<sbyte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestZ(Vector128<sbyte> left, byte leftLength, Vector128<sbyte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestNotZAndNotC(Vector128<sbyte> left, byte leftLength, Vector128<sbyte> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    public static Vector128<ushort> CompareImplicitLengthStringsAndGetMask(Vector128<ushort> left, Vector128<ushort> right, byte mask) => throw new PlatformNotSupportedException();
    public static int CompareImplicitLengthStringsAndGetIndex(Vector128<ushort> left, Vector128<ushort> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestC(Vector128<ushort> left, Vector128<ushort> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestO(Vector128<ushort> left, Vector128<ushort> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestS(Vector128<ushort> left, Vector128<ushort> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestZ(Vector128<ushort> left, Vector128<ushort> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestNotZAndNotC(Vector128<ushort> left, Vector128<ushort> right, byte mask) => throw new PlatformNotSupportedException();

    public static Vector128<ushort> CompareExplicitLengthStringsAndGetMask(Vector128<ushort> left, byte leftLength, Vector128<ushort> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static int CompareExplicitLengthStringsAndGetIndex(Vector128<ushort> left, byte leftLength, Vector128<ushort> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestC(Vector128<ushort> left, byte leftLength, Vector128<ushort> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestO(Vector128<ushort> left, byte leftLength, Vector128<ushort> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestS(Vector128<ushort> left, byte leftLength, Vector128<ushort> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestZ(Vector128<ushort> left, byte leftLength, Vector128<ushort> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestNotZAndNotC(Vector128<ushort> left, byte leftLength, Vector128<ushort> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();

    public static Vector128<short> CompareImplicitLengthStringsAndGetMask(Vector128<short> left, Vector128<short> right, byte mask) => throw new PlatformNotSupportedException();
    public static int CompareImplicitLengthStringsAndGetIndex(Vector128<short> left, Vector128<short> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestC(Vector128<short> left, Vector128<short> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestO(Vector128<short> left, Vector128<short> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestS(Vector128<short> left, Vector128<short> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestZ(Vector128<short> left, Vector128<short> right, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareImplicitLengthStringsAndTestNotZAndNotC(Vector128<short> left, Vector128<short> right, byte mask) => throw new PlatformNotSupportedException();

    public static Vector128<short> CompareExplicitLengthStringsAndGetMask(Vector128<short> left, byte leftLength, Vector128<short> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static int CompareExplicitLengthStringsAndGetIndex(Vector128<short> left, byte leftLength, Vector128<short> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestC(Vector128<short> left, byte leftLength, Vector128<short> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestO(Vector128<short> left, byte leftLength, Vector128<short> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestS(Vector128<short> left, byte leftLength, Vector128<short> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestZ(Vector128<short> left, byte leftLength, Vector128<short> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
    public static bool CompareExplicitLengthStringsAndTestNotZAndNotC(Vector128<short> left, byte leftLength, Vector128<short> right, byte rightLength, byte mask) => throw new PlatformNotSupportedException();
}

API Usage

int IndexOfAny(Vector128<byte> haystack, byte b1, byte b2, byte b3, byte b4)
{
    uint search = (uint)b1 | (uint)b2 << 8 | (uint)b3 << 16 | (uint)b4 << 24;
    Vector128<byte> needle = Vector128.CreateScalar(search).AsByte();

    int index = Sse42.CompareExplicitLengthStringsAndGetIndex(haystack, (byte)Vector128<byte>.Count, needle, sizeof(uint), 0);

    return BitOperations.TrailingZeroCount(index);
}

Alternative Designs

The control byte for these instructions is quite complicated. A flags enum might be helpful but probably doesn't clear much up. The docs are a necessity in either case.

The flag-returning methods could have more descriptive names. I've used the raw flags name, as is done with ptest (e.g. Sse41.TestZ), but they could be english-ized.

Should we have overloads for sbyte/short/ushort? The element size is included in the control byte, but at least that part of the control could be inferred from the vector type.

Risks

None, other than the temptation to use these instructions when there is a cheaper non-string alternative.

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation The SSE4.2 string comparison instructions were omitted from the S.R.I surface area when the remainder of SSE4.2 was implemented, mostly due to inability to agree on an acceptable API. This omission hasn't caused much pain, because most people looking to use these instructions are actually implementing something along the lines of `IndexOf(char/byte)` or `Contains(char/byte)`, which can be accomplished more efficiently with `CompareEqual` and `MoveMask`/`TestZ`. There are, however, places where the string instructions are useful. For example, `pcmpestri` can be faster than the above workaround when implementing something like `IndexOfAny` with a longer search list (4+ bytes/chars), especially over a shorter search length. Further, the string comparison instructions can be used for general-purpose vector intersection, doing up to 64 16-bit or 256 8-bit comparisons at once. One such use is https://github.com/RoaringBitmap/CRoaring, which was brought up in https://github.com/dotnet/runtime/discussions/63332 Related issues: https://github.com/dotnet/runtime/issues/957 https://github.com/dotnet/runtime/issues/31914 https://github.com/dotnet/runtime/issues/41332 ### API Proposal The initial attempt at defining the API made the control byte and flag selection into enums, which created too much of an abstraction. This proposal matches up more closely with the C intrinsics. Names of the flag-returning variants are based on those used for `ptest`. ```C# namespace System.Runtime.Intrinsics.X86; public abstract class Sse42 : Sse41 { ///

/// __m128i _mm_cmpistrm (__m128i a, __m128i b, const int imm8) /// pcmpistrm xmm, xmm, imm8 ///

public static Vector128 CompareStringImplicitLengthReturnMask(Vector128 left, Vector128 right, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpistri (__m128i a, __m128i b, const int imm8) /// pcmpistri xmm, xmm, imm8 ///

public static int CompareStringImplicitLengthReturnIndex(Vector128 left, Vector128 right, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpistrc (__m128i a, __m128i b, const int imm8) /// pcmpistri xmm, xmm, imm8 ///

public static int CompareStringImplicitLengthReturnC(Vector128 left, Vector128 right, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpistro (__m128i a, __m128i b, const int imm8) /// pcmpistri xmm, xmm, imm8 ///

public static int CompareStringImplicitLengthReturnO(Vector128 left, Vector128 right, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpistrs (__m128i a, __m128i b, const int imm8) /// pcmpistri xmm, xmm, imm8 ///

public static int CompareStringImplicitLengthReturnS(Vector128 left, Vector128 right, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpistrz (__m128i a, __m128i b, const int imm8) /// pcmpistri xmm, xmm, imm8 ///

public static int CompareStringImplicitLengthReturnZ(Vector128 left, Vector128 right, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpistra (__m128i a, __m128i b, const int imm8) /// pcmpistri xmm, xmm, imm8 ///

public static int CompareStringImplicitLengthReturnNotZAndNotC(Vector128 left, Vector128 right, byte mask) => throw new PlatformNotSupportedException(); ///

/// __m128i _mm_cmpestrm (__m128i a, int la, __m128i b, int lb, const int imm8) /// pcmpestrm xmm, xmm, imm8 ///

public static Vector128 CompareStringExplicitLengthReturnMask(Vector128 left, byte leftLength, Vector128 right, byte rightLength, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpestri (__m128i a, int la, __m128i b, int lb, const int imm8) /// pcmpestri xmm, xmm, imm8 ///

public static int CompareStringExplicitLengthReturnIndex(Vector128 left, byte leftLength, Vector128 right, byte rightLength, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpestrc (__m128i a, int la, __m128i b, int lb, const int imm8) /// pcmpestri xmm, xmm, imm8 ///

public static Vector128 CompareStringExplicitLengthReturnC(Vector128 left, byte leftLength, Vector128 right, byte rightLength, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpestro (__m128i a, int la, __m128i b, int lb, const int imm8) /// pcmpestri xmm, xmm, imm8 ///

public static Vector128 CompareStringExplicitLengthReturnO(Vector128 left, byte leftLength, Vector128 right, byte rightLength, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpestrs (__m128i a, int la, __m128i b, int lb, const int imm8) /// pcmpestri xmm, xmm, imm8 ///

public static Vector128 CompareStringExplicitLengthReturnS(Vector128 left, byte leftLength, Vector128 right, byte rightLength, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpestrz (__m128i a, int la, __m128i b, int lb, const int imm8) /// pcmpestri xmm, xmm, imm8 ///

public static Vector128 CompareStringExplicitLengthReturnZ(Vector128 left, byte leftLength, Vector128 right, byte rightLength, byte mask) => throw new PlatformNotSupportedException(); ///

/// int _mm_cmpestra (__m128i a, int la, __m128i b, int lb, const int imm8) /// pcmpestri xmm, xmm, imm8 ///

public static Vector128 CompareStringExplicitLengthReturnNotZAndNotC(Vector128 left, byte leftLength, Vector128 right, byte rightLength, byte mask) => throw new PlatformNotSupportedException(); } ``` ### API Usage ```C# int IndexOfAny(Vector128 haystack, byte b1, byte b2, byte b3, byte b4) { uint search = (uint)b1 | (uint)b2 << 8 | (uint)b3 << 16 | (uint)b4 << 24; Vector128 needle = Vector128.CreateScalar(search); int index = Sse42.CompareStringExplicitLengthReturnIndex(haystack, (byte)Vector128.Count, needle, sizeof(uint), 0); return BitOperations.TrailingZeroCount(index); } ``` ### Alternative Designs The control byte for these instructions is quite complicated. A flags enum might be helpful but probably doesn't clear much up. The docs are a necessity in either case. The flag-returning methods could have more descriptive names. I've used the raw flags name, as is done with `ptest` (e.g. `Sse41.TestZ`), but they could be english-ized. ### Risks None, other than the temptation to use these instructions when there is a cheaper non-string alternative.

Author:	saucecontrol
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime.Intrinsics`, `untriaged`
Milestone:	-

Missing these instructions made implementing Roaring Bitmaps in .NET more difficult and less performant.

Any progress on that? That would allow me to implement a far more efficient method in searching multiple options at once

This is still pending someone coming up with good signatures. It remains very low priority as it is rarely the most efficient way on modern CPUs.

These instructions were very explicitly not brought forward (by the hardware vendors) to support modern CPU features such as YMM, ZMM, or EVEX; which includes being excluded in the new converged AVX10.1 ISA, where they remain only capable of 128-bit support in the legacy and VEX encoding schemes. These instructions have a number of known pessimizations that make them overall difficult to work with and may have very poor performance on a wide range of available CPUs. Even on Intel CPUs, the performance of these instructions has typically gotten worse over time, not better, having started around 9-14 cycles for purely register based operations and since having decreased to 16-20 cycles in the latest E-cores and 12-16 cycles in the latest P-cores (often taking 9-12 micro-ops and taking a broad range of ports that can hinder more general throughput).

As such, newer algorithms have tended to shy away from such instructions and have started taking explicit advantage of alternatives to emulate or entirely replace the functionality instead, particularly on EVEX capable hardware. This has become even more true as Arm64, WASM, and other SIMD capable platform adoption has increased, as these platforms do not have direct equivalents either and are often better suited to algorithms that are oriented around more regular SIMD processing techniques

I would suggest that if someone is still wanting such functionality, they should try to provide some concrete suggestions to improve the naming and ensure all the desired functionality is available/functioning.

Ideally that would also come with some samples of how these instructions in particular would be used and how they are not replaceable with alternative sequences on 256-bit or 512-bit capable hardware.

Can you give some examples on such? In particular, I'm looking at doing a parallel search on all bytes in a vector to all bytes in another vector (EQUALS_ANY).

It really depends on what exactly you're searching for.

If it's a general search, then something like vpconflict (Avx512CD.DetectConflicts) is often usable, but it is similarly xarch unique functionality. Otherwise, something like how MemoryExtensions.IndexOfAny(this ReadOnlySpan<T> span, SearchValues<T> values) can be a good approach as it factors in the handling of fixed sets of search values and may fallback to a probabilistic map otherwise (still vector accelerated). If the values are constant and known, you may benefit from doing some form of saturation or other comparison to readily identify the potential scenarios and save perf. There are also opportunities for table based lookups that drive off these other optimizations, bloom filters, and other things that can lead to significant perf increases.

Hello, we need some of those particular instructions as well. Just to let You know - there are more people, that need them. We will try using c++ imports or finding other workarounds, but it would be nice if those are made available eventually...

@la83lynx As with the above discussion, could you provide concrete examples of what you're doing and why they're a strict requirement?

These still remain low priority given the reasons above (they instructions are effectively deprecated, often have better alternative sequences available, and are continuing to get slower over time).

Hello. I've never said about strict requirements. We are not writing some code from scratch, we are actually porting some code from a high perf c library, to .net, which basically handles set difference operation. We are also considering using the c library wrapper, however we were able to successfully port some other parts before directly, which produced some organizational codebase benefits. If You wish to see some code lines in question - here it is:


            const __m128i x_from_Y = _mm_cmpistrm(vecY, vecX, _SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
            mask_X_items_from_Y =    _mm_or_si128(mask_X_items_from_Y , x_from_Y);
            const int contained_in_difference =_mm_extract_epi32(mask_X_items_from_Y , 0) ^ 255;

the code then uses a lookup table and some more tricks to pick necessary items.

That is a case where the optimal depends in part what vecX or vecY is. If you're doing a full 8x8 intersection then the naive equivalent of _mm_cmpistrm(vecY, vecX, _SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK); is:

Vector128<ushort> xFromY = Vector128<ushort>.Zero;
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(0));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(1));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(2));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(3));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(4));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(5));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(6));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(7));
xFromY = Vector128.CreateScalar(xFromY.ExtractMostSignificantBits());

Even though this looks like a lot of code, each of those CompareEqual is a 1 cycle, 0.5 reciprocal throughput instruction. So it can take as few as 4 cycles to process them all. The vpshufb that'd be required to broadcast is similarly timed and so that's another 4-8 cycles. pcmpistrm is often 8-11 cycles with a 3 cycle latency (sometimes slower), plus its often microcoded (not proper uops) and so you're already hurting throughput in other ways by using it.

In the absolute worst case scenario, you might end up a bit slower. But the practical case is that it's often faster for other reasons, which is particularly true if you can expand the algorithm up to AVX2 (operating on it like 2x128, not necessarily 1x256) and definitely if you can expand to AVX512 (where vpconflict and vpintersect can handle this more nicely). If you know any of vecX or vecY is constant then you can often simplify this further and definitively get faster code 100% of the time, it really just depends on what you're searching for (based on the small snippet you gave the last Vector128.CreateScalar(xFromY.ExtractMostSignificantBits()) is likely unneeded for example).

Beyond that there's also alternatives depending on what the actual overall work you're doing is. Things like using the new SearchValues<T> functionality with IndexOfAny are often handling the complexities internally and will beat the performance here as well.

-- Part of the reason that this ends up being often faster is simply because pcmp*str* functions are often implemented in microcode and often expand to the exact sequence I just gave above. So explicitly coding it as the above gives more opportunities for the CPU, the compiler, and the user to optimize things.

There are cases where its not actually faster, but they tend to be rare or not overall impactful to the actual hot path consideration, especially if you're able to fully take advantage of modern alternatives and larger vector widths.

dotnet / runtime