JIT should emit "movbe" instead of "mov / bswap" on compatible hardware

GrabYourPitchforks commented 5 years ago

movbe is an intrinsic that allows you to perform a big-endian read from or a big-endian write to main memory. It's approximately equivalent to a mov followed by a bswap (or vice versa) but is more efficient. It was introduced in the Intel Atom line and eventually made its way to later generations.

Proposal:

namespace System.Runtime.Intrinsics.X86 {

    [Intrinsic]
    public abstract class Movbe {
        internal Movbe() { /* nobody can construct me */ }
        public static bool IsSupported { get; }
        public static void WriteBigEndian(ref ushort address, ushort value);
        public static void WriteBigEndian(ref uint address, uint value);
        public static ushort ReadBigEndianUInt16(ref ushort address);
        public static uint ReadBigEndianUInt32(ref uint address);

        [Intrinsic]
        public abstract class X64 {
            /* IsSupported and 64-bit overloads */
        }
    }

}

Open questions:

Do we need signed overloads?
What about alignment? X86 processors don't care about alignment for this instruction, so I assume we'll document that it can accept unaligned refs no problem.
The APIs proposed above take ref parameters of the same width as the values they're writing / reading, so they're 100% safe. Do we also need unsafe overloads? (I assume not but wanted to throw it out there.)

/cc @tannergooding

category:cq theme:optimization skill-level:intermediate cost:medium

tannergooding commented 5 years ago

Do we need signed overloads?

Probably. The instruction does not specify if it operates on signed or unsigned data and our convention has been to expose both in this scenario.

Do we also need unsafe overloads

Also probably. Exposing overloads that take T* rather than ref T has been the convention so far. Additionally having ref T overloads for these would probably be worthwhile, given the use-case.

tannergooding commented 5 years ago

I think, based on the naming we've been giving the other intrinsics, these should be Load and Store; rather than Read and Write

I'm also not sure BigEndian is the right term to use here. The instruction just swaps the bytes on load/store and thus works for both big and little endian data:

The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus providing support for converting little-endian values to big-endian format and vice versa.

tannergooding commented 5 years ago

Also CC. @fiigii, @CarolEidt

jkotas commented 5 years ago

Can we just light up the existing BinaryPrimitives methods using the intrinsics? Who would ever want to use these intrinsics directly instead of the methods on BinaryPrimitives?

GrabYourPitchforks commented 5 years ago

@jkotas I imagine most people would continue to go through BinaryPrimitives, since it would either become an intrinsic recognized by the runtime (like we did with ReverseEndianness), or the implementation would itself query the proposed Movbe.IsSupported API and dispatch accordingly.

My understanding is that there's a desire to expose most of the non-base instruction sets as we have already done for SSE, POPCNT, BMI1/2, etc. This could be useful for people writing very high-performance code. If I'm reading Agner's instruction tables correctly, on Skylake and later architectures a 64-bit MOVBE load instruction can be dispatched to a wider range of execution ports at a lower latency than a 64-bit MOV load followed by a 64-bit BSWAP. A developer who has custom unrolled a hot loop may wish to use a different loop unrolling technique based on the availability of this instruction set. (Somebody more familiar with CPU architecture will have to check me on this.)

jkotas commented 5 years ago

The hardware intrinsics made complete sense for SIMD instructions. I think they have diminishing value for regular bit operations. Looking at patterns that are developing around Bmi1 - I am not sure whether it is a good idea to have that one as an intrinsics either.

I understand that you can always find somebody who wants to have a super low-level control. .NET is not the right environment for these needs and it does not make sense to try to turn it into it.

tannergooding commented 5 years ago

@jkotas, I'm not sure what the concern here is.

Whether it is making the relevant BinaryPrimitives intrinsic themselves, with a software fallback, or exposing new explicit HWIntrinsics which BinaryPrimitives would then call; the code that we need to add to the VM/JIT is relatively the same.

The only real drawback, that I can see, of doing the latter (exposing new HWIntrinsics) is that we have a larger public API surface area that is exposed. However, I can see multiple benefits; including (but not limited to):

It is low additional cost in comparison
- Basically just exposing the additional public APIs
It provides a guarantee that calling this method will emit the desired code
- Where-as the former is a "if the runtime and hardware both supports it" scenario, with no way to actually validate that the better code will be emitted
The fact that BinaryPrimitives will emit the more optimized intrinsic code is documented in the source code, with an easy way to determine under exactly what conditions the intrinsics will be used
- This currently requires you to know that it will be treated as intrinsic (generally via the Intrinsic attribute), then find the corresponding code in the VM/JIT that handles this type of intrinsic, and finally investigate all the places that the intrinsic is referenced in the JIT to determine exactly what support is there (as it varies from intrinsic to intrinsic). You still have roughly the same steps for HWIntrinsics if you want to dig in, but they can generally be summed to "I emit this instruction".
- This ensures that other target runtimes (Mono, CoreRT, etc) are also aware that the method can be "intrinsified" with clear guidance on the exact semantics/behavior that the optimized code paths should have.
- This also ensures that all target runtimes can stay in sync with the "most up to date" versions of various changes made to the implementation (rather than ending up where runtime A makes a fix and then it needs to propogate so that all other runtimes also need to update their internals)
- This also gives users a clear/easy way to disable these optimizations if they find some need to (as there is a single central configuration knob for all usages of a given ISA).
- This also means that CoreRT (or other AOT targets) have an easy way for users to enable usage of these ISAs, if they wish to compile for something other than the lower-common CPU.
It gives end-users the ability to write more "low-level" code if the have the need or want

jkotas commented 5 years ago

My concern is that the new intrinsic APIs like this have diminishing value. If I put the new public type and methods on one side of the scale and all the benefits you have mentioned on the other side of the scale, I think it is not worth adding the new intrinsic APIs like this.

If we take this approach to extreme, we would have a hardware intrinsic for add instruction - because of somebody may want to emit the exact add instruction in some situations instead of lea that the JIT likes to optimize the add into in some cases.

tannergooding commented 5 years ago

My concern is that the new intrinsic APIs like this have diminishing value.

I do agree that some of these intrinsics, including movbe, have diminishing value. Unlike the vector instructions (which are often combined in a near limitless number of ways to create interesting algorithms), many of the bit manipulation instructions have a single-stable implementation and only operate on primitive types (however, often having explicit support for types smaller than int32). This means that they are much more suitable to be exposed indirectly as intrinsics and users will likely end up with the same codegen in the end.

If we take this approach to extreme, we would have a hardware intrinsic for add instruction - because of somebody may want to emit the exact add instruction in some situations instead of lea that the JIT likes to optimize the add into in some cases.

I would think the primary difference here is that add is considered a baseline instruction and has both a direct IL equivalent (add) and a direct higher-level equivalent (+) -- In both these cases, this applies to just primitive types. The instrinsics exposed thus-far (and the ones being proposed here), do not have direct IL or higher-level equivalents. It requires either pattern-based recognition (which tends to be iffy at best) or an explicit method that is special-cased by the runtime.

Looking at native code and the types of intrinisics they expose, it doesn't include things like add. It does, however, include things like movebe (_storebe_u16, _loadbe_u16, etc), adc (addcarry_u16, etc), mul (generally just for 128-bit or upper xx-bit multiply: _mul128, _mulh, etc), rol/ror (_rotl8, _rotr32, etc), and more All of these fit into the category of having explicit CPU support but requiring either explicit pattern or method-based recognition to support.

It's worth noting that these are all cases where the pattern-based support is sufficiently complex. Scenarios like DivRem do not generally have intrinsics because the support for recognizing that pattern is almost always trivial.

So my thoughts are basically that, in the cases where we have a need to do pattern-based recognition; it is probably better to just have an explicitly handled method (at the very least, it makes the resulting code more readable, and we tend to wrap such things in helper methods ourselves anyways) and since we will then have explicitly handled methods, it would be beneficial to have them centralized, exposed, and handled in a consistent manner. The System.Runtime.Intrinsics namespace is a very good place for that to happen and provides all the benefits listed above at the cost of some minimal additional API surface area.

I do also think we want to be careful about just exposing intrinsics for anything and that we should ensure that new intrinsics are properly prototyped and tested to show that it is worthwhile supporting and exposing. I think, on top of what we have already exposed, there are probably a handful of instructions that are both commonly supported and frequently used enough to be worthwhile (e.g. AddWithCarry, BorrowWithSubtract, 128-bit Multiply/Divide, RotateLeft/Right). However, the good news is that the existing HWIntrinsic infrastructure makes it fairly easy to prototype these and get perf numbers to determine whether or not a given proposal is actually worthwhile.

benaadams commented 5 years ago

Is a public hardware specific intrinsic api necessary though? Rather than a platform neutral api method?

To implement it efficiently it may require an internal intrinsic + software fallback; but that's different.

To me the question is; would you write different code if you had access to the intrinsic vs having an api that will software fallback?

Vector-type stuff yes, as the available instrinsics determine the algorithm; Popcnt and TrailingZeroCount meet this bar as more loosely as they impact loop unrolling choices https://github.com/aspnet/AspNetCore/pull/5715#issuecomment-448399157 if used in sequence due to instruction latency.

I'll take a cmov intrinsic if we are doing them 😉 (though maybe we'll get pattern matching for it https://github.com/dotnet/coreclr/pull/16156)

tannergooding commented 5 years ago

Is a public hardware specific intrinsic api necessary though? Rather than a platform neutral api method? To implement it efficiently it may require an internal intrinsic + software fallback; but that's different.

Fair, having the same intrinsic but having it be internal provides all the benefits I listed above without the drawback of the larger public surface area. The argument for it then becomes "why not, given its so trivial" (but that isn't a very good argument 😄).

To me the question is; would you write different code if you had access to the intrinsic vs having an api that will software fallback? Popcnt and TrailingZeroCount meet this bar as more loosely as they impact loop unrolling choices aspnet/AspNetCore#5715 (comment) if used in sequence due to instruction latency.

I think almost all the HWIntrinsics would actually hit this category. In the software fallback case you either have a function call or larger inlined code. In the intrinsic case, you generally end up with a single (fairly low latency) instruction that can impact pipelining, loop unrolling, etc. In the case of movbe, you can also combine a load/store with the byte-swapping, which saves you some additional size as well (allowing tighter loops and the like).

benaadams commented 5 years ago

I think almost all the HWIntrinsics would actually hit this category. ...

That's more why you'd want intrinsic to be used in preference to the software fallback; which I'd agree with 😄

The PopCnt example https://github.com/aspnet/AspNetCore/pull/5715#issuecomment-448399157 is because the intrinsic is a higher latency instruction (3 cycles); but with a high throughput (1 cycle); so to get maximum throughput and hide the latency you'd want to change your loop to unroll it 3 times; whereas unconditionally unrolling (including the software fallback) would cause unnecessary code bloat.

MOVBE doesn't look to have any latency; so its unlikely you'd change your loop to try and hide it?

tannergooding commented 5 years ago

MOVBE doesn't look to have any latency; so its unlikely you'd change your loop to try and hide it?

Right, but it does support execution on multiple ports, which makes it a possible candidate for pipelining optimizations.

GrabYourPitchforks commented 5 years ago

For my particular scenario, I'd be fine with an enlightened BinaryPrimitives implementation (though we'd want to add one that took ref TInt as a parameter) or a JIT that recognized when it was about to emit a bswap (or rol/ror for 16-bit) / mov pair and instead turned it into movbe on capable hardware. I don't need an intrinsic to accomplish my particular scenario.

Consider the below line (from the UTF-8 transcoding prototype), which occurs on a hot path:

return BinaryPrimitives.ReverseEndianness((ushort)(temp + value + 0xC080u)); // [ 00000000 00000000 10xxxxxx 110yyyyy ]

This eventually makes its way into an Unsafe.WriteUnaligned<ushort>(...), which means that it's essentially four instructions: lea, movzx, ror, mov. If the JIT can turn this into two instructions (lea, movbe) without further action on my part, that's goodness.

However, let me also share my experience with implementing bswap. When I did the original UTF-8 prototype work, I avoided BinaryPrimitives.ReverseEndianness as much as I could since it was implemented as software rather than an intrinsic. This meant that I ended up performing contorted bit shifting manually within the transcoding logic to try to squeeze as many operations as possible into each cycle. It had a measurable effect on performance and on the type of code I wrote. The only reason I switched the logic to using BinaryPrimitives.ReverseEndianness is because I know for a fact that it's now implemented as a single-instruction hardware intrinsic. If I didn't have that guarantee, I'd probably keep the weird convoluted logic in the transcoding source.

benaadams commented 5 years ago

An enlightened BinaryPrimitives implementation would also mean you wrote the code with a single path; whereas raw intrinsics would mean you'd need an if else switch for each architecture intrinsic*.

*Which may be what's needed inside the BinaryPrimitives implementation; but at least its only done in one place and the messiness is hidden.

mikedn commented 5 years ago

or a JIT that recognized when it was about to emit a bswap (or rol/ror for 16-bit) / mov pair and instead turned it into movbe on capable hardware

That would make more sense than adding yet another intrinsic.

GrabYourPitchforks commented 5 years ago

If we want to do this purely at the JIT pattern recognition level I'll support that. But my fear is that if we go this route this will turn into yet another issue that stays open for years, where a new API (even an internal one used as an implementation detail of an existing public helper) could allow us to get this up and running in much less time.

scalablecory commented 4 years ago

We'd make use of such an API in networking. socket addresses, HTTP/2, HTTP/3 all use big-endian integers.

stephentoub commented 4 years ago

We should just improve the BinaryPrimitives implementation now. Whoever improves BinaryPrimitives can do so in whatever way makes the most sense, be it an internal intrinsic that's used or a JIT optimization.

We'd make use of such an API in networking. socket addresses, HTTP/2, HTTP/3 all use big-endian integers.

Would we actually do anything differently than we currently do? We use / should use BinaryPrimitives, and that just gets better.

scalablecory commented 4 years ago

Whoever improves BinaryPrimitives can do so in whatever way makes the most sense, be it an internal intrinsic that's used or a JIT optimization.

JIT optimization could help with users of IPAddress.HostToNetworkOrder, but I'd take BinaryPrimitives too.

GrabYourPitchforks commented 4 years ago

To clarify, if this were implemented wholly in the JIT (no public API surface), then under this proposal the patterns the JIT would recognize would be:

A call to BinaryPrimitives.ReverseEndianness followed by a memory store; or
A memory load followed by a call to BinaryPrimitives.ReverseEndianess.

Helper APIs like BinaryPrimitives.ReadInt32BigEndian are already implemented using this pattern, so they should presumably get the benefits for free. APIs like IPAddress.HostToNetworkOrder are already implemented in terms of BinaryPrimitives, so presumably they'd get the benefits for free as well.

BruceForstall commented 4 years ago

But my fear is that if we go this route this will turn into yet another issue that stays open for years

Unfortunately, it's hard to see how this proposed JIT optimization will be prioritized higher than many, many others.

tannergooding commented 4 years ago

This might actually be something where having a small doc indicating how to add new instructions for x86/ARM/ARM64 is worthwhile.

The process here for x86 is basically:

Add the new instruction to instrsxarch.h
Add a new entry to hwintrinsiclistxarch.h
Mark the API in S.P.Corelib as [Intrinsic] (in this case BinaryPrimitives.ReverseEndianness)
In importation, handle the named intrinsic by checking if compSupports(InstructionSet_Name) and returning a gtNewScalarHWIntrinsicNode if it is supported
Do any special fixups like containment in lowerxarch.cpp and lsraxarch.cpp
Do any special codegen requirements in hwintrinsiccodegenxarch.cpp
Do any special emit requirements in emitxarch.cpp

The last three steps are generally not necessary unless there is something new/unique about the instruction/intrinsic that isn't already covered or expressible by the flags. The hwintrinsic support is generally reusable and isn't strictly tied to System.Runtime.Intrinsics, we do similar replacements for things like System.Math.FusedMultiplyAdd. Reusing the hwintrinsic support is a bit simpler than adding a new node type (which we've also done in the past) and will get other optimizations (like value numbering support) as that comes online.

BruceForstall commented 4 years ago

Looks like it's basically described in https://github.com/dotnet/runtime/blob/master/docs/design/features/hw-intrinsics.md, but if you have suggested improvements, please make them.

This issue, though, seems to have morphed from an API design to a specific JIT pattern-based optimization request.

dotnet / runtime

JIT should emit "movbe" instead of "mov / bswap" on compatible hardware #953