Open saucecontrol opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.
Author: | saucecontrol |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Runtime.Intrinsics`, `untriaged` |
Milestone: | - |
Missing these instructions made implementing Roaring Bitmaps in .NET more difficult and less performant.
Any progress on that? That would allow me to implement a far more efficient method in searching multiple options at once
This is still pending someone coming up with good signatures. It remains very low priority as it is rarely the most efficient way on modern CPUs.
These instructions were very explicitly not brought forward (by the hardware vendors) to support modern CPU features such as YMM, ZMM, or EVEX; which includes being excluded in the new converged AVX10.1 ISA, where they remain only capable of 128-bit support in the legacy and VEX encoding schemes. These instructions have a number of known pessimizations that make them overall difficult to work with and may have very poor performance on a wide range of available CPUs. Even on Intel CPUs, the performance of these instructions has typically gotten worse over time, not better, having started around 9-14 cycles for purely register based operations and since having decreased to 16-20 cycles in the latest E-cores and 12-16 cycles in the latest P-cores (often taking 9-12 micro-ops and taking a broad range of ports that can hinder more general throughput).
As such, newer algorithms have tended to shy away from such instructions and have started taking explicit advantage of alternatives to emulate or entirely replace the functionality instead, particularly on EVEX capable hardware. This has become even more true as Arm64, WASM, and other SIMD capable platform adoption has increased, as these platforms do not have direct equivalents either and are often better suited to algorithms that are oriented around more regular SIMD processing techniques
I would suggest that if someone is still wanting such functionality, they should try to provide some concrete suggestions to improve the naming and ensure all the desired functionality is available/functioning.
Ideally that would also come with some samples of how these instructions in particular would be used and how they are not replaceable with alternative sequences on 256-bit or 512-bit capable hardware.
Can you give some examples on such? In particular, I'm looking at doing a parallel search on all bytes in a vector to all bytes in another vector (EQUALS_ANY).
It really depends on what exactly you're searching for.
If it's a general search, then something like vpconflict
(Avx512CD.DetectConflicts
) is often usable, but it is similarly xarch unique functionality. Otherwise, something like how MemoryExtensions.IndexOfAny(this ReadOnlySpan<T> span, SearchValues<T> values)
can be a good approach as it factors in the handling of fixed sets of search values and may fallback to a probabilistic map otherwise (still vector accelerated). If the values are constant and known, you may benefit from doing some form of saturation or other comparison to readily identify the potential scenarios and save perf. There are also opportunities for table based lookups that drive off these other optimizations, bloom filters, and other things that can lead to significant perf increases.
Hello, we need some of those particular instructions as well. Just to let You know - there are more people, that need them. We will try using c++ imports or finding other workarounds, but it would be nice if those are made available eventually...
@la83lynx As with the above discussion, could you provide concrete examples of what you're doing and why they're a strict requirement?
These still remain low priority given the reasons above (they instructions are effectively deprecated, often have better alternative sequences available, and are continuing to get slower over time).
Hello. I've never said about strict requirements. We are not writing some code from scratch, we are actually porting some code from a high perf c library, to .net, which basically handles set difference operation. We are also considering using the c library wrapper, however we were able to successfully port some other parts before directly, which produced some organizational codebase benefits. If You wish to see some code lines in question - here it is:
const __m128i x_from_Y = _mm_cmpistrm(vecY, vecX, _SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
mask_X_items_from_Y = _mm_or_si128(mask_X_items_from_Y , x_from_Y);
const int contained_in_difference =_mm_extract_epi32(mask_X_items_from_Y , 0) ^ 255;
the code then uses a lookup table and some more tricks to pick necessary items.
That is a case where the optimal depends in part what vecX
or vecY
is. If you're doing a full 8x8 intersection then the naive equivalent of _mm_cmpistrm(vecY, vecX, _SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
is:
Vector128<ushort> xFromY = Vector128<ushort>.Zero;
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(0));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(1));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(2));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(3));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(4));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(5));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(6));
xFromY |= Vector128.CompareEqual(vecX, Vector128.Create(vecY.GetElement(7));
xFromY = Vector128.CreateScalar(xFromY.ExtractMostSignificantBits());
Even though this looks like a lot of code, each of those CompareEqual
is a 1 cycle, 0.5 reciprocal throughput instruction. So it can take as few as 4 cycles to process them all. The vpshufb
that'd be required to broadcast is similarly timed and so that's another 4-8 cycles. pcmpistrm
is often 8-11 cycles with a 3 cycle latency (sometimes slower), plus its often microcoded (not proper uops) and so you're already hurting throughput in other ways by using it.
In the absolute worst case scenario, you might end up a bit slower. But the practical case is that it's often faster for other reasons, which is particularly true if you can expand the algorithm up to AVX2 (operating on it like 2x128, not necessarily 1x256) and definitely if you can expand to AVX512 (where vpconflict
and vpintersect
can handle this more nicely). If you know any of vecX
or vecY
is constant then you can often simplify this further and definitively get faster code 100% of the time, it really just depends on what you're searching for (based on the small snippet you gave the last Vector128.CreateScalar(xFromY.ExtractMostSignificantBits())
is likely unneeded for example).
Beyond that there's also alternatives depending on what the actual overall work you're doing is. Things like using the new SearchValues<T>
functionality with IndexOfAny
are often handling the complexities internally and will beat the performance here as well.
-- Part of the reason that this ends up being often faster is simply because pcmp*str*
functions are often implemented in microcode and often expand to the exact sequence I just gave above. So explicitly coding it as the above gives more opportunities for the CPU, the compiler, and the user to optimize things.
There are cases where its not actually faster, but they tend to be rare or not overall impactful to the actual hot path consideration, especially if you're able to fully take advantage of modern alternatives and larger vector widths.
Background and motivation
The SSE4.2 string comparison instructions were omitted from the S.R.I surface area when the remainder of SSE4.2 was implemented, mostly due to inability to agree on an acceptable API.
This omission hasn't caused too much pain, because most people looking to use these instructions are implementing something along the lines of
IndexOf(char/byte)
orContains(char/byte)
, which can be accomplished more efficiently withCompareEqual
andMoveMask
/TestZ
.There are, however, places where the string instructions are useful. For example,
pcmpestri
can be faster than the above workaround when implementing something likeIndexOfAny
with a longer search list (4+ bytes/chars), especially over a shorter search length.Further, the string comparison instructions can be used for general-purpose vector intersection, doing up to 64 16-bit or 256 8-bit comparisons at once. One such use is https://github.com/RoaringBitmap/CRoaring, which was brought up in https://github.com/dotnet/runtime/discussions/63332
Related issues: https://github.com/dotnet/runtime/issues/957 https://github.com/dotnet/runtime/issues/31914 https://github.com/dotnet/runtime/issues/41332
API Proposal
The initial attempt at defining the API made the control byte and flag selection into enums, which might have been too much of an abstraction. This proposal matches up more closely with the C intrinsics.
Names of the flag-returning variants are based on those used for
ptest
.API Usage
Alternative Designs
The control byte for these instructions is quite complicated. A flags enum might be helpful but probably doesn't clear much up. The docs are a necessity in either case.
The flag-returning methods could have more descriptive names. I've used the raw flags name, as is done with
ptest
(e.g.Sse41.TestZ
), but they could be english-ized.Should we have overloads for sbyte/short/ushort? The element size is included in the control byte, but at least that part of the control could be inferred from the vector type.
Risks
None, other than the temptation to use these instructions when there is a cheaper non-string alternative.