dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Consider exposing a HWIntrinsic that allows efficient loading, regardless of encoding. #954

Closed tannergooding closed 4 years ago

tannergooding commented 5 years ago

Rationale

On x86 hardware, for SIMD instructions, there are effectively two encoding formats: "legacy" and "vex".

There are some minor differences between these two encodings, including the number of parameters they take and whether the memory operand has aligned or unaligned semantics.

As a brief summary:

Today, we expose both Load (which has unaligned semantics) and LoadAligned (which has aligned semantics). Given that a user will often want to generate the "most efficient" code possible and that the JIT, in order to preserve semantics (not silently get rid of an exception that would otherwise be thrown), changes which method it will "fold" depending on the encoding it is currently emitting, it may be beneficial to expose an intrinsic which allows the user to specify that "This address is aligned, do whatever load is most efficient". This would ensure that it can be folded regardless of the current encoding.

Proposed API

namespace System.Runtime.Intrinsics.X86
{
    public partial abstract class Sse
    {
        // New APIs:
        public Vector128<float> LoadVector128Unsafe(float* address);

        // Existing APIs:
        public Vector128<float> LoadVector128(float* address);
        public Vector128<float> LoadAlignedVector128(float* address);
    }

    public partial abstract class Sse2
    {
        // New APIs:
        public Vector128<double> LoadVector128Unsafe(double* address);
        public Vector128<byte> LoadVector128Unsafe(byte* address);
        public Vector128<sbyte> LoadVector128Unsafe(sbyte* address);
        public Vector128<short> LoadVector128Unsafe(short* address);
        public Vector128<ushort> LoadVector128Unsafe(ushort* address);
        public Vector128<int> LoadVector128Unsafe(int* address);
        public Vector128<uint> LoadVector128Unsafe(uint* address);
        public Vector128<long> LoadVector128Unsafe(long* address);
        public Vector128<ulong> LoadVector128Unsafe(ulong* address);

        // Existing APIs:
        public Vector128<double> LoadVector128(double* address);
        public Vector128<byte> LoadVector128(byte* address);
        public Vector128<sbyte> LoadVector128(sbyte* address);
        public Vector128<short> LoadVector128(short* address);
        public Vector128<ushort> LoadVector128(ushort* address);
        public Vector128<int> LoadVector128(int* address);
        public Vector128<uint> LoadVector128(uint* address);
        public Vector128<long> LoadVector128(long* address);
        public Vector128<ulong> LoadVector128(ulong* address);

        public Vector128<double> LoadAlignedVector128(double* address);
        public Vector128<byte> LoadAlignedVector128(byte* address);
        public Vector128<sbyte> LoadAlignedVector128(sbyte* address);
        public Vector128<short> LoadAlignedVector128(short* address);
        public Vector128<ushort> LoadAlignedVector128(ushort* address);
        public Vector128<int> LoadAlignedVector128(int* address);
        public Vector128<uint> LoadAlignedVector128(uint* address);
        public Vector128<long> LoadAlignedVector128(long* address);
        public Vector128<ulong> LoadAlignedVector128(ulong* address);
    }
}

API Semantics

This shows the semantics of each API under the legacy and VEX encoding.

Most insrtuctions support having the last operand be either a register or memory operand: ins reg, reg/[mem]

When folding does not happen the load is an explicit separate instruction:

For example, Sse.Add(input, Sse.Load(address)) would generate
* movups xmm1, [address]
* addps xmm0, xmm1

While, folding allows the load to be folded into the calling instruction:

For example, Sse.Add(input, Sse.Load(address)) would generate
* addps xmm0, [address]

On Legacy hardware, the folded-form of the instructions assert that address is aligned (generally this means (address % 16) == 0). On VEX hardware, the folded-form does no such validation and allows any input.

Load (mov unaligned)

Unaligned Input Aligned Input
Legacy No Folding No Folding
VEX Folds Folds

LoadAligned (mov aligned)

Unaligned Input Aligned Input
Legacy Folds, Throws Folds
VEX No Folding, Throws No Folding

LoadUnsafe

Unaligned Input Aligned Input
Legacy Folds, Throws Folds
VEX Folds, Throws in Debug Folds

Additional Info

Some open ended questions:

tannergooding commented 5 years ago

CC. @eerhardt, @fiigii, @CarolEidt for thoughts

gfoidl commented 5 years ago

Today, we expose both Load (which has unaligned semantics) and LoadAligned (which has aligned semantics).

I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load, as the proposed name is quite long.

"This address is aligned, do whatever load is most efficient".

For the name Load would match.

Should we also expose StoreAlignedOrUnaligned methods?

If load is exposed, the same should be done for store. Hence rename Store to StoreUnaligned and use the name Store for the method analogous to this proposal.

Should we also expose equivalents on Avx/Avx2?

In my proposal for method-names this wouldn't be in symmetry with Avx/Avx2 anymore, so yes there should be the same methods -- even if they just forward to the unaligned methods.

eerhardt commented 5 years ago

I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load,

I was thinking the same thing.

tannergooding commented 5 years ago

Updated to be LoadVector128, LoadAlignedVector128, and LoadUnalignedVector128.

fiigii commented 5 years ago

Sorry for the delay to reply. I don't quite understand this proposal. Currently, LoadAligned can be folded into the consumer instruction with SSE and VEX encoding, but Load (LoadUnaligned in this proposal) only can be folded with VEX encoding. This behavior also matches C++ compilers. So, what is the difference between the proposed Load intrinsic and the existing LoadAligned?

Meanwhile, guarantee memory alignment should be user's responsibility, and the dynamic assertion ("Asserts (address % 16) == 0 when optimizations are disabled") does not always work.

fiigii commented 5 years ago

Additionly, on modern Intel CPUs, the unaligned load/store won't be slower than aligned load/store (if the memory access hits in one cache line), so we suggest always using unaligned load/store. That is why Load intrinsic maps to the unaligned load instruction.

tannergooding commented 5 years ago

I don't quite understand this proposal.

The JIT has to preserve the semantics of the underlying instruction. Currently LoadAligned will cause a hardware exception to be raised if the input address is not aligned; LoadUnaligned (presently named just Load) currently has unaligned semantics and will not raise an exception, regardless of the alignment of the input address.

When using the legacy encoding, the vast majority of the instructions which allow a memory operand have "aligned" semantics and will cause a hardware exception to be raised if the input address is not aligned. When using the VEX encoding, the reverse is true, and the vast majority of the instructions which allow a memory operand have "unaligned" semantics and will not cause a hardware exception to be raised if the input address is not aligned.

This means that, when we emit using the legacy encoding, we can only fold and emit efficient code for LoadAligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we can emit: addps xmm0, [pRight]; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we have to instead emit movups xmm1, [pRight]; addps xmm0, xmm1

And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]

We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).

So, what is the difference between the proposed Load intrinsic and the existing LoadAligned

We can expect that users will want to emit the most efficient code possible; However, we can also expect that users,,where possible, will be using various well-known tricks to get their data aligned so that they don't incur the penalty for reading/writing across a cache-line or page boundary (which will occur every 4 reads/writes on most modern processors, when working with 128-bit data).

This leads users to writing algorithms that do the following (where possible):

// Process first few elements to become aligned
// Process data as aligned
// Process trailing elements that don't fit in a single `VectorXXX<T>`

They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).

However, it is also not unreasonable that they will want to emit efficient codegen if the user happens to be running on some pre-AVX hardware (as is the case with the "Benchmarks Games" which run on a Q6600 -- which it is also worth noting doesn't have fast unaligned loads). If the user wants to additionally support this scenario, they need to write a helper method which effectively does the following:

public static Vector128<float> EfficientLoad(float* address)
{
    if (Avx.IsSupported)
    {
        // Must assert, since this would fail for non-AVX otherwise
        Debug.Assert((address % 16) == 0);
        return Sse.LoadUnaligned(address);
    }
    else
    {
        return Sse.LoadAligned(address);
    }
}

Which also depends on the JIT being able to successfully recognize and fold this code.

So, this proposal suggests providing this helper method built-in by renaming the existing Load method to be LoadUnaligned and to expose a new method Load (I had originally suggested something like LoadAlignedOrUnaligned, but it was suggested I change it to the current) which does effectively the above. That is, on hardware with VEX encoding support, it uses LoadUnaligned and when optimizations are disabled it will validate that the input address is aligned; while on older hardware that uses the legacy encoding, it will use LoadAligned. This ensures that the code can always be folded efficiently and it validates the semantics remains the same (without hurting perf when users compile their executables with optimizations enabled). We also still have the LoadUnaligned and LoadAligned methods explicitly for the cases where the user can't always do the "optimal" thing (sometimes you just have to work with data that can be unaligned).

fiigii commented 5 years ago

And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]

We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).

Ah, I see the problem now. But we should fold LoadAligned with VEX encoding, as it does not throw exception.

They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).

This assert is useless. address could be either aligned or unaligned in different runs.

tannergooding commented 5 years ago

as it does not throw exception.

That is the problem and why we can't fold it. LoadAligned emits movaps which always throws if the address is unaligned. If you fold it into a VEX-encoded instruction, it will no longer throw. This means that if you passed in an unaligned address, the exception would silently disappear whenever the JIT folded it.

This assert is useless. address could be either aligned or unaligned in different runs.

The point of the assert is to help catch places where the user is calling Load without pinning the data and ensuring that it is properly aligned. It will still be possible for a user to do this incorrectly, but it would be an explicit opt-in. We could potentially use another name, such as LoadUnsafe, or something that conveys that the data must be aligned but we won't always validate that it is.

CarolEidt commented 5 years ago

I like this proposal, and agree that the assert, while not guaranteeing to catch all potential dynamic data configurations, is very likely to catch nearly all cases. Given that there is a great deal of expertise required to use the hardware intrinsics effectively, I don't think it's necessary to encumber this with the Unsafe suffix. Rather, documenting its behavior and asserting in the unoptimized case should be sufficient.

fiigii commented 5 years ago

Ah, looked the manual again, yes, vmovaps can still throw exceptions with VEX-encoding, but other instructions won't. Thanks.

fiigii commented 5 years ago

I don't think it's necessary to encumber this with the Unsafe suffix

Agree, actually other Load* are not safe neither 😄

mikedn commented 5 years ago

and this is more a nicety for if people are supporting older hardware or working around a JIT bug (COMPlus_EnableAVX=0)

Are there enough such people to justify adding more intrinsics?

Should we also expose equivalents on Avx/Avx2?

It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic. But that would imply adding more intrinsics. Or perhaps AVX Load should simply be renamed to LoadUnaligned, though the longer name sucks. Same for stores.

But to quote @CarolEidt "Given that there is a great deal of expertise required to use the hardware intrinsics effectively", I don't think such difference in semantics justifies adding more intrinsics.

Should these instructions have a Debug.Assert((address % 16) == 0)?

Where does this assert go exactly? Is Load a true intrinsic or a helper? Who's using debug builds of corelib, besides the CoreCLR team?

fiigii commented 5 years ago

I don't think such difference in semantics justifies adding more intrinsic.

Agree, I suggest that we can fold more existing load intrinsic rather than adding more intrinsics.

The above codegen strategy is already adopted by all the mainstream C/C++ compilers.

tannergooding commented 5 years ago

Are there enough such people to justify adding more intrinsics?

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic

Yes, if this API is approved, we would (at a minimum rename) the AVX intrinsic to be LoadUnaligned. I already have a note for the API review about whether we should have Store mirror these names.

Where does this assert go exactly? Is Load a true intrinsic or a helper?

A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.

and folded LoadAligned just throws exceptions with 0% possibility

I don't think this is acceptable without an explicit user opt-in. The runtime has had a very strict policy on not silently removing side-effects

The above codegen strategy is already adopted by all the mainstream C/C++ compilers.

C/C++ makes a number of optimizations that we have, at this point, decided not to do and to instead address via Analyzers or look at more in depth later. We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).

fiigii commented 5 years ago

The runtime has had a very strict policy on not silently removing side-effects

It is not removing side-effects, it has no any side-effects (hardware exceptions).

tannergooding commented 5 years ago

It is not removing side-effects, it has no any side-effects (hardware exceptions).

All exceptions, including HardwareExceptions are considered side-effects.

The movaps raises the hardware #GP exception if its input is not properly aligned and the JIT interecepts that and converts it into a System.AccessViolationException.

mikedn commented 5 years ago

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

I didn't ask about cases where you tried to do that. I asked if there are enough people who are using .NET Core on old or gimped CPU and expect things to work perfectly rather than just reasonably.

A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.

Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned. Whether it's a good idea to do that in debug builds it's debatable. Some people may end up disabling optimizations to workaround a JIT bug or to be able to get a correct stacktrace for a hard to track down bug. Now they risk getting exceptions that with optimizations enabled would not occur.

The runtime has had a very strict policy on not silently removing side-effects

Please do not mislead people. The runtime does not impose any semantics on intrinsics, it's the other way around. LoadAligned may be defined as "throws an exception if the address is unaligned" or as "may throw an exception if the address is unaligned". That may or may not be a good idea. But it doesn't have anything to do with any runtime policy.

We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).

Indeed. Squabbles over names resulted in string intrinsic being removed from .NET 3. Yet adding non-deterministic intrinsics to support old CPUs is doable.

fiigii commented 5 years ago

Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned.

I believe the result of this proposal is just forcing most people to change Load to LoadUnaligned in their programs.

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches

for (;IsUnaligned(addr); addr++)
{
    // processing data until the memory get aligned
}
for (; addr < end; addr += Vector128<int>.Count)
{
    var v = LoadAlignedVector128(addr); 
    // or LoadUnalignedVector128(addr), whatever, we recommand always using LoadUnalignedVector128
    ......
}

Or

for (; addr < end; addr += Vector128<int>.Count)
{
    var v = LoadUnalignedVector128(addr); 
    ......
}

Using which approach is based on profiling and it is much more important than folding loads for SSE encoding.

tannergooding commented 5 years ago

I didn't ask about cases where you tried to do that

I wasn't saying that with regards to personal attempts.

Now they risk getting exceptions that with optimizations enabled would not occur

Correct, but the API is explicitly documented to do as such and they have the option to enforce the desired behavior by explicitly using one of the other two APIs (LoadAligned or LoadUnaligned).

By not having all three, users either have to write their own wrappers for something that we could (and in my opinion should) reasonably handle.

Additionally, if we only expose two APIs and make one of them "sometimes validate alignment", then we can't provide a guarantee to the end user that they can enforce the semantics where required.

Please do not mislead people. The runtime does not impose any semantics on intrinsics

I don't think this is misleading at all. We have had a very clear contract and design for the platform-specific hardware intrinsics that each is tied to a very particular instruction.

LoadAligned is documented to emit movaps and LoadUnaligned (currently just called Load) will emit movups. This gives a very clear story and guarantees that, if someone uses this API, they will get exactly that instruction.

Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.

It is worth noting, however, that we have also tried to determine common patterns, use-cases, and limitations with the decision to tie a given hardware intrinsic to a particular instruction. When we have found such limitations, we have discussed what we could do about it (both the "ideal" and realistic scenarios) and determined an appropriate solution. So far, out of all the scenarios that have been brought up, we have determined to deviate only for CreateScalarUnsafe and ToVector256Unsafe (where we say that the upper bits will be non-deterministic, rather than explicitly zeroed; but for which we also expose "safe" versions that do explicitly zero the upper bits). This may be another case, that is more trivial to resolve via an additional API and that is worthwhile to do.

Squabbles over names resulted in string intrinsic being removed from .NET 3

They have not necessarily been removed from netcoreapp3.0, instead we decided that we would pull them until we can determine the appropriate way to expose these APIs as there were concerns over the useability and understandability of said APIs. The underlying instructions behave quite differently from most other intrinsics we've exposed and so they require some additional thought. This is no different from any other API that we decide to expose in that we don't want to ship something that has a bad surface area and we are still working towards making sure these APIs ship in a good and useable state.

Yet adding non-deterministic intrinsics to support old CPUs is doable.

As mentioned above, both cases where we exposed an additional non-deterministic API were discussed in length.

tannergooding commented 5 years ago

I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches

Once you have pinned the memory, it is guaranteed not to move and you can rely on the assert. Anyone dealing with perf-sensitive scenarios and larger blocks of memory (for example, ML.NET) should be pinning their data anyways, as the GC could move the data otherwise and that can mess with the cache, alignment, etc.

Using which approach is based on profiling and it is much more important than folding loads.

The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" has multiple pages/sections all discussing this. The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:

All this additional API does is give the user the option to choose between:

A good number of vectorizable algorithms (especially the ones used in cases like ML.NET) can be handled with a worst-case scenario of two-unaligned loads maximum, regardless of the alignment of the input data. For such algorithms, the only question of whether to align or not really comes down to how much data is being processed in total and the cases where you don't want to make the data aligned can generally be special-cased.

fiigii commented 5 years ago

The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:

Always align data when possible Generally prefer aligned stores over aligned loads, if you have to choose Consider just using two 128-bit unaligned loads if you can't align your 256-bit data etc

Yes, that is what I meant. And aligned loads != movap*.

I can't align my data (they use LoadUnaligned) My data is aligned and should never be unaligned (they use LoadAligned) My data is aligned, but you don't have to check (they use Load)

All these three situations are suggested to use LoadUnaligned.

mikedn commented 5 years ago

Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.

For the fortieth and two time: the runtime has nothing to do with this. I give up. We're probably speaking different languages or something.

CarolEidt commented 5 years ago

@mikedn, I think the issue is a runtime issue when we start talking about folding the loads into the operations, in ways that might cause a runtime failure to be masked (i.e. if the load would throw on an unaligned address, but the operation into which it is folded would not). And I think what @tannergooding is after here is the ability to have a single load semantic that supports folding.

All that said, I'm still not sure where I fall on this issue.

fiigii commented 5 years ago

Import @CarolEidt 's comment from https://github.com/dotnet/coreclr/issues/21308

I agree that we don't want to silently remove an exception. It is not just that it violates the "strict" behavior that developers expect from the CLR, but could also cause sudden unexpected failures if something changed in the user code (or the optimization settings used) that caused the JIT to no longer fold the instruction.

If users write LoadAligned, they should be responsible to handle the possible exceptions, whatever it is from JIT optimization or actual memory misalignment.

tannergooding commented 5 years ago

And aligned loads != movap*.

I would disagree here. We have two instructions that perform explicit loads (including movupd and movapd here):

We additionally have the ability to support load semantics in most instructions:

I would say that the unaligned load could just be called a load, as it loads data with no special semantics (and it is still just a load even if the data happens to be aligned). I would say that an aligned load must also validate that the data is aligned (which only movaps does).

All these three situations are suggested to use LoadUnaligned.

LoadUnaligned does not successfully fulfill all three scenarios by itself. It definitively fills the first scenario, but the second scenario requires an additional validation step (and must fail if the requirements are not met) and the third scenario has an optional validation step (that would fail if the requirements are not met and optimizations are disabled; otherwise, when optimizations are enabled, it would skip the validation check).

mikedn commented 5 years ago

I think the issue is a runtime issue when we start talking about folding the loads into the operations

@CarolEidt It most definitely isn't. As far as the runtime/ECMA spec is concerned, intrinsics are just method calls. There's nothing anywhere in the ECMA spec that says that LoadAligned must always throw an exception if the data is not aligned. That's solely in the intrinsic's spec courtyard.

Also, I don't claim that we should change the current LoadAligned semantic. I just want to see reasonable argumentation for adding a bunch of new intrinsics. And I'm not seeing it.

fiigii commented 5 years ago

And aligned loads != movap*.

I would disagree here.

LoadUnaligned does not successfully fulfill all three scenarios by itself.

movups and movaps has the same performance and movups is always safe. Why not use it?

The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.

tannergooding commented 5 years ago

movups and movaps has the same performance and movups is always safe. Why not use it

Because we can't always safely fold a movups. Some examples are being on older/legacy hardware, working around a JIT bug (so you have disabled VEX-encoding support), using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...

The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.

Yes, and the purpose of the proposed intrinsic is to:

  1. Provide some debug-mode validation that you are not causing cache-line splits or page faults
  2. Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding

It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults; and they will not make any attempt to align the input data. As such, you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?

fiigii commented 5 years ago

using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...

It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults;

No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned). Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.

  1. Provide some debug-mode validation that you are not causing cache-line splits or page faults
  2. Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding

If users do not guarantee the alignment, the compiler-generated validation (1) is not reliable, so that makes (2) unacceptable on older CPUs. If users already guarantee the alignment, the compiler-generated validation (1) is useless, then i) using LoadUnaligned normally, or ii) using LoadAligned if the code-size on older CPUs is critical (needs https://github.com/dotnet/coreclr/issues/21308)

you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?

Right.

tannergooding commented 5 years ago

the compiler-generated validation (1) is not reliable

Given the low-level nature of the code, that the intrinsic is explicitly documented to do the validation, and that the GC currently only guarantees 8-byte alignment; a JIT inserted check for debug-builds should catch essentially every case (especially across multiple runs).

(needs dotnet/coreclr#21308)

I think that is a non-starter; not only would it silently remove the exception; but it removes the ability for users to guarantee the generation of a movaps (which will always validate the alignment). The current proposal allows users to continue having explicit aligned or unaligned semantics, if desired; but also gives them the ability to do the efficient thing.

No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned). Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.

I believe that for "most" workloads, dev's will be concerned with cache-line splits and page faults, so they will attempt to align their data if the input is unaligned (these intrinsics are geared towards high-perf and hotspot scenarios, after all). They will want to additionally validate that they will not accidentally cause a cache-line split in production (their alignment logic is correct, they pinned the data, etc) and they will want to generate efficient code.

Therefore, it is my belief that they would write something like (this is using today's APIs and ignoring this proposal):

int misalignment = (int)(pCurrent) % 16;

// note: may need to handle unnaturally aligned data, such as if `float`
//       which should be `(pCurrent % 4) == 0` is something else.

if (misalignment != 0)
{
    // Vector128<T>.Count is a power of a two and a constant, this should optimize to a right shift
    misalignment /= Vector128<T>.Count;
    misalignment = Vector128<T>.Count - misalignment;

    for (int i = 0; i < misalignment; i++)
    {
       // Process the individual elements, non-vectorized
    }

    count -= misalignment;
}

if (remainder >= Vector128<T>.Count)
{
    remainder = count % Vector128<T>.Count;
    count -= remainder;

    for (int i = 0; i < count; i++)
    {
        // Assert we are still aligned
        Debug.Assert(((int)(pCurrent) % 16) == 0);

        // Process elements using vectorization and the appropriate Load(pCurrent)
    }
}
else
{
    remainder = count;
}

for (int i = 0; i < remainder; i++)
{
    // Process the individual elements, non-vectorized
}

This proposal does a couple of things:

The code the user writes remains the same and if they want to have explicit Unaligned semantics or explicit Aligned semantics; they still have the option to use one of the other two overloads and have the correct/expected behavior for those

fiigii commented 5 years ago

We just repeat our words again and again. I would like to stop the discussion and leave the proposal to the API review.

My original proposal was to call this something like LoadAlignedOrUnaligned to explicitly call attention to the semantics, but it was suggested I change that to the current name).

At least, do not name the new APIs to Load.

gfoidl commented 5 years ago

At least, do not name the new APIs to Load.

Right now I also think this shouldn't be named Load, as it hides something. Maybe (just thinking loud): LoadSafe (as opposed to Unsafe), LoadAutomatic, LoadFast, ...

On recent x86-hardware there is no need for this method -- I think here we have consensus, because always using unaligned reads has no penalty, and is recommended by intel, too.

But as @tannergooding tries to point out there exists hardware, on which the same code should run optimal without any need to change, where unaligned reads do have a penalty, so aligned reads should be used, also depending on the JIT if he will emit VEX or not. This case could be handled with this new method. Therefore I think it is reasonable to add it.

But the name should be simple and obvious, not to overwhelm users and leave them with open questions about which Load-method to use.

saucecontrol commented 5 years ago

If these methods have to exist, I agree Load is probably a bad name for them. On the other hand, it's better than the self-documenting LoadAlignedOnOldCpusOrUnalignedOnNewCpusOrBetterYetJustFoldTheLoadIfPossible.

I think @mikedn is right here, though. This is a new API, and if the documentation says LoadAligned may throw an exception on unaligned data, that's good enough. Is there really ever a scenario where you need to guarantee a movaps will be emitted for the intrinsic method? I like the idea of a 1:1 mapping of intrinsics to instructions, but since we can't access registers directly or specify the reg/mem targets of the instructions directly, there's a limit to how far that philosophy can go.

For high-performance scenarios, people will ensuring the data is aligned anyway, and if that's the only way to get the load folded, that just further encourages the best practice.

saucecontrol commented 5 years ago

Thinking about this more, I reckon that if LoadAligned were required for folding, you'd end up with people writing code like this:

Vector128<float> v;
if (Avx.IsSupported) {
    v = Sse.LoadAligned(...) // Data isn't really aligned but this allows folding the load
} else {
    v = Sse.LoadUnaligned(...)
}
...

Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method. What if we had Sse.LoadUnaligned, Sse.LoadAligned, and Avx.LoadVector128?

Sse.LoadAligned could be folded in both encoding forms (again with the caveat that the docs don't guarantee that unaligned addresses would throw on all CPUs), and Avx.LoadVector128 could be always be folded because it would only be available when VEX encoding is supported.

fiigii commented 5 years ago

I reckon that if LoadAligned were required for folding, you'd end up with people writing code like this:

No, always using LoadUnaligned. But if you worry about the SSE encoding codegen, use Sse.LoadAligned. Selecting different load based onif (Avx.IsSupported) is unnecessary because you have to align your data for LoadAligned like https://github.com/dotnet/corefx/issues/33566#issuecomment-443362604, so the Sse.LoadUnaligned(...) branch is meaningless but complicating your code.

saucecontrol commented 5 years ago

Right, that was in reference to my previous suggestion that LoadAligned be required for folding in the API even though it's not required in all encodings. But people would end up calling the aligned method to get the folding even with unaligned data when they know it's supported by the CPU. I was just shooting down my first suggestion and then offering another.

fiigii commented 5 years ago

Oh, I see, thanks for the explanation.

tannergooding commented 5 years ago

people writing code like this:

The code is backwards. When using the VEX encoding (Avx.IsSupported is true), we can only fold LoadUnaligned because the instructions do not validate the alignment of the address. If the user opts-in, we could additionally fold some method that sometimes checks alignment (whether that is via a 3rd, separate API; or via some other mechanism).

When not using the VEX encoding, we can only fold LoadAligned because the instructions themselves will assert alignment and will fail if the address is not aligned (there is no way to workaround this).

Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method

That is what this proposal is trying to do. It is trying to keep the two existing LoadUnaligned (currently just called Load) and LoadAligned methods, so that users who need to can explicitly use the required behavior.

It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").

tannergooding commented 5 years ago

I think doing anything without an explicit opt-in is a non-starter and the general rule has been that we don't expose a non-deterministic API without also exposing a deterministic one; so I don't think we can or should just relax LoadAligned.

I am not convinced that just calling the new API Load is right, as I don't think it conveys all the necessary semantics (which is why I originally had it named differently).

I am convinced that exposing an additional API for this behavior would be beneficial. We already know of cases where users will be running certain types of ML.NET code on low-power/IOT devices (such as Raspberry PI). Not everyone is buying the latest CPU's; and sometimes when they do they go for budget (which may mean they purchase a modern CPU without AVX support, and there are several of them that get released each generation).

gfoidl commented 5 years ago

such as Raspberry PI

is running on ARM, so it's out of the game here.

tannergooding commented 5 years ago

is running on ARM, so it's out of the game here.

Right, but there are also modern low-power/budget x86 computers without AVX support (from both AMD and Intel). I am unsure if we have any specific numbers on usage, but given that there are other low-power/IOT devices targeting ML.NET, I feel they shouldn't be excluded.

gfoidl commented 5 years ago

That's my opinion too.

saucecontrol commented 5 years ago

The code is backwards.

My sample refers to my proposed implementation where LoadAligned doesn't guarantee an exception on unaligned data and is the only load type that is considered for folding by the JIT because it would work in both encodings. My point was that if the developer knows that method is the only one that can be folded and knows that on VEX processors, the data doesn't actually have to be aligned, they'd end up calling LoadAligned specifically on unaligned data but only on VEX processors, which is even more confusing than your proposal. I talked myself out of that.

When using the VEX encoding (Avx.IsSupported is true), we can only fold LoadUnaligned because the instructions do not validate the alignment of the address

I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data. Intel guarantees the fault in the movaps instruction, but then they recommend against using that instruction on newer processors. I'm saying that if you never guarantee that movaps will actually be emitted, then you're not breaking anything by not throwing.

It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").

This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned. An Avx.LoadVector128 could be safely folded in all cases, whereas the too-clever Load would work on VEX processors and crash on legacy with unaligned data. The proposed assert in minopts mitigates that possibility but then creates an unnecessary restriction on the VEX processors.

Not everyone is buying the latest CPU's

I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths. Not being able to completely optimize the AVX path because of a restriction on the older/lower-end processors is bad too.

tannergooding commented 5 years ago

I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths.

Libraries will likely have both 128-bit and 256-bit optimized code paths. For the 128-bit code path, they may consider both the VEX-encoding and the legacy encoding scenarios.

I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data.

An API called LoadAligned that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.

This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned.

Correct. And if your data is not going to be aligned or if it is not beneficial for you to handle leading/trailing elements to make the rest of the handling aligned; you would call the explicit LoadUnaligned method.

However, as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned" (so that they don't have the cache-line penalty every 4 reads/writes or a page fault every X reads/writes). They just won't use LoadAligned because the JIT won't emit the efficient thing on modern CPUs. They will, however, likely want to have validation in Debug mode that their input is actually aligned.

but then they recommend against using that instruction on newer processors

The optimization manual makes no recommendations about using or not using movaps; rather they recommend that you align your data when possible and have a number of recommendations on what to do if you can't guarantee data alignment (such as preferring aligned stores over aligned reads). They also recommend using the more efficient encoding (rather than an explicit load) when possible.

The only reason not to use LoadAligned is because it will not be folded (emit the more efficient encoding) on VEX-enabled hardware. Yes, this is currently an API limitation, but no, I don't think it will be relaxed. The reason to use LoadUnaligned is because it will be folded (emit the more efficient encoding) on VEX-enabled hardware. The downside is that, when you are doing performance optimizations and your data it aligned, you get no validation of that fact and you must add explicit Debug.Asserts to your own code.

This proposal came from the belief that needing to use LoadUnaligned and additionally assert the alignment was a common enough case. It has the additional benefit that it will folded on either legacy or VEX-encoding while still providing explicit APIs for the Aligned vs Unaligned case.

saucecontrol commented 5 years ago

An API called LoadAligned that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.

I see your point, but a method called Load that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.

as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned"

This was my initial thought as well, but then I thought back to the challenge of re-implementing System.Numerics using S.R.I. On a granular method like operator+, you can't make any kind of assumption about the alignment, but the folding would still be beneficial. I don't know if some of the plans around first class structs and containment will address that through some other trickery, but limiting reg/mem encoding to only aligned memory scenarios is bound to cause problems elsewhere.

They also recommend using the more efficient encoding (rather than an explicit load) when possible.

This is the part that's most important to me. I can figure out whether I've aligned my pointers, but I want to get the most efficient encoding in every possible case.

tannergooding commented 5 years ago

I see your point, but a method called Load that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.

Right. I expect the API review team will learn towards another name (like LoadUnsafe, due to its non-deterministic nature).

you can't make any kind of assumption about the alignment, but the folding would still be beneficial.

Right, and in this case (where you are doing one off operations) you would have to use LoadUnaligned. It will do the efficient encoding on modern hardware, but will involve an additional instruction for older hardware. -- However, you also wont likely be doing explicit loads here. I would imagine you would just use Unsafe.As to go from Vector4 to Vector128 and the JIT would do the efficient conversion (as we will often already be in register).

saucecontrol commented 5 years ago

Ah, ok. As long as LoadUnaligned can still be folded on VEX hardware, that's fine. I was under the impression you were proposing LoadAligned always emitted movaps, LoadUnaligned always emitted movups, and that only Load (or LoadUnsafe or whatever) could be re-encoded into to another instruction.

tannergooding commented 5 years ago

I was under the impression you were proposing

Ah, definitely not. LoadAligned and LoadUnaligned would continue on with the exact same semantics as we have today (and would get folded in the same scenarios). The new API would behave as a LoadAligned and would validate the address under minopts, but would be folded under both VEX and legacy encodings when optimizations are enabled.

tannergooding commented 5 years ago

It's also worth noting that this proposal matches basically what the native compilers do.

For example (see https://godbolt.org/z/oITxgp), clang will emit exclusively movaps for _mm_load_ps and movups for _mm_loadu_ps in debug mode. However, in Release mode it will fold either for the VEX encoding and only _mm_load_ps for the legacy encoding.

The proposed API here does the same, but still exposes an explicit LoadAligned for when users explicitly want movaps.

-- cc. @fiigii l, since he suggested matching native behavior in the folding