Closed GrabYourPitchforks closed 2 years ago
Do we need signed overloads?
Probably. The instruction does not specify if it operates on signed or unsigned data and our convention has been to expose both in this scenario.
Do we also need unsafe overloads
Also probably. Exposing overloads that take T*
rather than ref T
has been the convention so far. Additionally having ref T
overloads for these would probably be worthwhile, given the use-case.
I think, based on the naming we've been giving the other intrinsics, these should be Load
and Store
; rather than Read
and Write
I'm also not sure BigEndian
is the right term to use here. The instruction just swaps the bytes on load/store and thus works for both big and little endian data:
The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus providing support for converting little-endian values to big-endian format and vice versa.
Also CC. @fiigii, @CarolEidt
Can we just light up the existing BinaryPrimitives methods using the intrinsics? Who would ever want to use these intrinsics directly instead of the methods on BinaryPrimitives?
@jkotas I imagine most people would continue to go through BinaryPrimitives
, since it would either become an intrinsic recognized by the runtime (like we did with ReverseEndianness
), or the implementation would itself query the proposed Movbe.IsSupported
API and dispatch accordingly.
My understanding is that there's a desire to expose most of the non-base instruction sets as we have already done for SSE, POPCNT, BMI1/2, etc. This could be useful for people writing very high-performance code. If I'm reading Agner's instruction tables correctly, on Skylake and later architectures a 64-bit MOVBE load instruction can be dispatched to a wider range of execution ports at a lower latency than a 64-bit MOV load followed by a 64-bit BSWAP. A developer who has custom unrolled a hot loop may wish to use a different loop unrolling technique based on the availability of this instruction set. (Somebody more familiar with CPU architecture will have to check me on this.)
The hardware intrinsics made complete sense for SIMD instructions. I think they have diminishing value for regular bit operations. Looking at patterns that are developing around Bmi1 - I am not sure whether it is a good idea to have that one as an intrinsics either.
I understand that you can always find somebody who wants to have a super low-level control. .NET is not the right environment for these needs and it does not make sense to try to turn it into it.
@jkotas, I'm not sure what the concern here is.
Whether it is making the relevant BinaryPrimitives
intrinsic themselves, with a software fallback, or exposing new explicit HWIntrinsics which BinaryPrimitives
would then call; the code that we need to add to the VM/JIT is relatively the same.
The only real drawback, that I can see, of doing the latter (exposing new HWIntrinsics) is that we have a larger public API surface area that is exposed. However, I can see multiple benefits; including (but not limited to):
BinaryPrimitives
will emit the more optimized intrinsic code is documented in the source code, with an easy way to determine under exactly what conditions the intrinsics will be used
Intrinsic
attribute), then find the corresponding code in the VM/JIT that handles this type of intrinsic, and finally investigate all the places that the intrinsic is referenced in the JIT to determine exactly what support is there (as it varies from intrinsic to intrinsic). You still have roughly the same steps for HWIntrinsics if you want to dig in, but they can generally be summed to "I emit this instruction".My concern is that the new intrinsic APIs like this have diminishing value. If I put the new public type and methods on one side of the scale and all the benefits you have mentioned on the other side of the scale, I think it is not worth adding the new intrinsic APIs like this.
If we take this approach to extreme, we would have a hardware intrinsic for add
instruction - because of somebody may want to emit the exact add
instruction in some situations instead of lea
that the JIT likes to optimize the add into in some cases.
My concern is that the new intrinsic APIs like this have diminishing value.
I do agree that some of these intrinsics, including movbe
, have diminishing value. Unlike the vector instructions (which are often combined in a near limitless number of ways to create interesting algorithms), many of the bit manipulation instructions have a single-stable implementation and only operate on primitive types (however, often having explicit support for types smaller than int32
). This means that they are much more suitable to be exposed indirectly as intrinsics and users will likely end up with the same codegen in the end.
If we take this approach to extreme, we would have a hardware intrinsic for add instruction - because of somebody may want to emit the exact add instruction in some situations instead of lea that the JIT likes to optimize the add into in some cases.
I would think the primary difference here is that add
is considered a baseline instruction and has both a direct IL equivalent (add
) and a direct higher-level equivalent (+
) -- In both these cases, this applies to just primitive types. The instrinsics exposed thus-far (and the ones being proposed here), do not have direct IL or higher-level equivalents. It requires either pattern-based recognition (which tends to be iffy at best) or an explicit method that is special-cased by the runtime.
Looking at native code and the types of intrinisics they expose, it doesn't include things like add
. It does, however, include things like movebe
(_storebe_u16
, _loadbe_u16
, etc), adc
(addcarry_u16
, etc), mul
(generally just for 128-bit or upper xx-bit multiply: _mul128
, _mulh
, etc), rol/ror
(_rotl8
, _rotr32
, etc), and more All of these fit into the category of having explicit CPU support but requiring either explicit pattern or method-based recognition to support.
DivRem
do not generally have intrinsics because the support for recognizing that pattern is almost always trivial.So my thoughts are basically that, in the cases where we have a need to do pattern-based recognition; it is probably better to just have an explicitly handled method (at the very least, it makes the resulting code more readable, and we tend to wrap such things in helper methods ourselves anyways) and since we will then have explicitly handled methods, it would be beneficial to have them centralized, exposed, and handled in a consistent manner. The System.Runtime.Intrinsics
namespace is a very good place for that to happen and provides all the benefits listed above at the cost of some minimal additional API surface area.
I do also think we want to be careful about just exposing intrinsics for anything and that we should ensure that new intrinsics are properly prototyped and tested to show that it is worthwhile supporting and exposing. I think, on top of what we have already exposed, there are probably a handful of instructions that are both commonly supported and frequently used enough to be worthwhile (e.g. AddWithCarry, BorrowWithSubtract, 128-bit Multiply/Divide, RotateLeft/Right). However, the good news is that the existing HWIntrinsic
infrastructure makes it fairly easy to prototype these and get perf numbers to determine whether or not a given proposal is actually worthwhile.
Is a public hardware specific intrinsic api necessary though? Rather than a platform neutral api method?
To implement it efficiently it may require an internal intrinsic + software fallback; but that's different.
To me the question is; would you write different code if you had access to the intrinsic vs having an api that will software fallback?
Vector-type stuff yes, as the available instrinsics determine the algorithm; Popcnt and TrailingZeroCount meet this bar as more loosely as they impact loop unrolling choices https://github.com/aspnet/AspNetCore/pull/5715#issuecomment-448399157 if used in sequence due to instruction latency.
I'll take a cmov
intrinsic if we are doing them 😉 (though maybe we'll get pattern matching for it https://github.com/dotnet/coreclr/pull/16156)
Is a public hardware specific intrinsic api necessary though? Rather than a platform neutral api method? To implement it efficiently it may require an internal intrinsic + software fallback; but that's different.
Fair, having the same intrinsic but having it be internal
provides all the benefits I listed above without the drawback of the larger public surface area. The argument for it then becomes "why not, given its so trivial" (but that isn't a very good argument 😄).
To me the question is; would you write different code if you had access to the intrinsic vs having an api that will software fallback? Popcnt and TrailingZeroCount meet this bar as more loosely as they impact loop unrolling choices aspnet/AspNetCore#5715 (comment) if used in sequence due to instruction latency.
I think almost all the HWIntrinsics
would actually hit this category. In the software fallback case you either have a function call or larger inlined code. In the intrinsic case, you generally end up with a single (fairly low latency) instruction that can impact pipelining, loop unrolling, etc. In the case of movbe
, you can also combine a load/store with the byte-swapping, which saves you some additional size as well (allowing tighter loops and the like).
I think almost all the HWIntrinsics would actually hit this category. ...
That's more why you'd want intrinsic to be used in preference to the software fallback; which I'd agree with 😄
The PopCnt example https://github.com/aspnet/AspNetCore/pull/5715#issuecomment-448399157 is because the intrinsic is a higher latency instruction (3 cycles); but with a high throughput (1 cycle); so to get maximum throughput and hide the latency you'd want to change your loop to unroll it 3 times; whereas unconditionally unrolling (including the software fallback) would cause unnecessary code bloat.
MOVBE
doesn't look to have any latency; so its unlikely you'd change your loop to try and hide it?
MOVBE doesn't look to have any latency; so its unlikely you'd change your loop to try and hide it?
Right, but it does support execution on multiple ports, which makes it a possible candidate for pipelining optimizations.
For my particular scenario, I'd be fine with an enlightened BinaryPrimitives
implementation (though we'd want to add one that took ref TInt as a parameter) or a JIT that recognized when it was about to emit a bswap
(or rol
/ror
for 16-bit) / mov
pair and instead turned it into movbe
on capable hardware. I don't need an intrinsic to accomplish my particular scenario.
Consider the below line (from the UTF-8 transcoding prototype), which occurs on a hot path:
return BinaryPrimitives.ReverseEndianness((ushort)(temp + value + 0xC080u)); // [ 00000000 00000000 10xxxxxx 110yyyyy ]
This eventually makes its way into an Unsafe.WriteUnaligned<ushort>(...)
, which means that it's essentially four instructions: lea
, movzx
, ror
, mov
. If the JIT can turn this into two instructions (lea
, movbe
) without further action on my part, that's goodness.
However, let me also share my experience with implementing bswap
. When I did the original UTF-8 prototype work, I avoided BinaryPrimitives.ReverseEndianness
as much as I could since it was implemented as software rather than an intrinsic. This meant that I ended up performing contorted bit shifting manually within the transcoding logic to try to squeeze as many operations as possible into each cycle. It had a measurable effect on performance and on the type of code I wrote. The only reason I switched the logic to using BinaryPrimitives.ReverseEndianness
is because I know for a fact that it's now implemented as a single-instruction hardware intrinsic. If I didn't have that guarantee, I'd probably keep the weird convoluted logic in the transcoding source.
An enlightened BinaryPrimitives
implementation would also mean you wrote the code with a single path; whereas raw intrinsics would mean you'd need an if else
switch for each architecture intrinsic*.
*Which may be what's needed inside the BinaryPrimitives
implementation; but at least its only done in one place and the messiness is hidden.
or a JIT that recognized when it was about to emit a bswap (or rol/ror for 16-bit) / mov pair and instead turned it into movbe on capable hardware
That would make more sense than adding yet another intrinsic.
If we want to do this purely at the JIT pattern recognition level I'll support that. But my fear is that if we go this route this will turn into yet another issue that stays open for years, where a new API (even an internal one used as an implementation detail of an existing public helper) could allow us to get this up and running in much less time.
We'd make use of such an API in networking. socket addresses, HTTP/2, HTTP/3 all use big-endian integers.
We should just improve the BinaryPrimitives implementation now. Whoever improves BinaryPrimitives can do so in whatever way makes the most sense, be it an internal intrinsic that's used or a JIT optimization.
We'd make use of such an API in networking. socket addresses, HTTP/2, HTTP/3 all use big-endian integers.
Would we actually do anything differently than we currently do? We use / should use BinaryPrimitives, and that just gets better.
Whoever improves BinaryPrimitives can do so in whatever way makes the most sense, be it an internal intrinsic that's used or a JIT optimization.
JIT optimization could help with users of IPAddress.HostToNetworkOrder
, but I'd take BinaryPrimitives
too.
To clarify, if this were implemented wholly in the JIT (no public API surface), then under this proposal the patterns the JIT would recognize would be:
BinaryPrimitives.ReverseEndianness
followed by a memory store; orBinaryPrimitives.ReverseEndianess
.Helper APIs like BinaryPrimitives.ReadInt32BigEndian
are already implemented using this pattern, so they should presumably get the benefits for free. APIs like IPAddress.HostToNetworkOrder
are already implemented in terms of BinaryPrimitives
, so presumably they'd get the benefits for free as well.
But my fear is that if we go this route this will turn into yet another issue that stays open for years
Unfortunately, it's hard to see how this proposed JIT optimization will be prioritized higher than many, many others.
This might actually be something where having a small doc indicating how to add new instructions for x86/ARM/ARM64 is worthwhile.
The process here for x86 is basically:
instrsxarch.h
hwintrinsiclistxarch.h
S.P.Corelib
as [Intrinsic]
(in this case BinaryPrimitives.ReverseEndianness
)compSupports(InstructionSet_Name)
and returning a gtNewScalarHWIntrinsicNode
if it is supportedlowerxarch.cpp
and lsraxarch.cpp
hwintrinsiccodegenxarch.cpp
emitxarch.cpp
The last three steps are generally not necessary unless there is something new/unique about the instruction/intrinsic that isn't already covered or expressible by the flags.
The hwintrinsic
support is generally reusable and isn't strictly tied to System.Runtime.Intrinsics
, we do similar replacements for things like System.Math.FusedMultiplyAdd
.
Reusing the hwintrinsic
support is a bit simpler than adding a new node type (which we've also done in the past) and will get other optimizations (like value numbering support) as that comes online.
Looks like it's basically described in https://github.com/dotnet/runtime/blob/master/docs/design/features/hw-intrinsics.md, but if you have suggested improvements, please make them.
This issue, though, seems to have morphed from an API design to a specific JIT pattern-based optimization request.
movbe
is an intrinsic that allows you to perform a big-endian read from or a big-endian write to main memory. It's approximately equivalent to amov
followed by abswap
(or vice versa) but is more efficient. It was introduced in the Intel Atom line and eventually made its way to later generations.Proposal:
Open questions:
/cc @tannergooding
category:cq theme:optimization skill-level:intermediate cost:medium