Closed tannergooding closed 4 years ago
CC. @eerhardt, @fiigii, @CarolEidt for thoughts
Today, we expose both
Load
(which has unaligned semantics) andLoadAligned
(which has aligned semantics).
I would rename the unaligned load to that what it is LoadUnaligned
, then this proposed method can be called just Load
, as the proposed name is quite long.
"This address is aligned, do whatever load is most efficient".
For the name Load
would match.
Should we also expose
StoreAlignedOrUnaligned
methods?
If load is exposed, the same should be done for store. Hence rename Store
to StoreUnaligned
and use the name Store
for the method analogous to this proposal.
Should we also expose equivalents on
Avx
/Avx2
?
In my proposal for method-names this wouldn't be in symmetry with Avx
/Avx2
anymore, so yes there should be the same methods -- even if they just forward to the unaligned methods.
I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load,
I was thinking the same thing.
Updated to be LoadVector128
, LoadAlignedVector128
, and LoadUnalignedVector128
.
Sorry for the delay to reply.
I don't quite understand this proposal. Currently, LoadAligned
can be folded into the consumer instruction with SSE and VEX encoding, but Load
(LoadUnaligned
in this proposal) only can be folded with VEX encoding. This behavior also matches C++ compilers.
So, what is the difference between the proposed Load
intrinsic and the existing LoadAligned
?
Meanwhile, guarantee memory alignment should be user's responsibility, and the dynamic assertion ("Asserts (address % 16) == 0
when optimizations are disabled") does not always work.
Additionly, on modern Intel CPUs, the unaligned load/store won't be slower than aligned load/store (if the memory access hits in one cache line), so we suggest always using unaligned load/store. That is why Load
intrinsic maps to the unaligned load instruction.
I don't quite understand this proposal.
The JIT has to preserve the semantics of the underlying instruction. Currently LoadAligned
will cause a hardware exception to be raised if the input address is not aligned; LoadUnaligned
(presently named just Load
) currently has unaligned semantics and will not raise an exception, regardless of the alignment of the input address.
When using the legacy encoding, the vast majority of the instructions which allow a memory operand have "aligned" semantics and will cause a hardware exception to be raised if the input address is not aligned. When using the VEX encoding, the reverse is true, and the vast majority of the instructions which allow a memory operand have "unaligned" semantics and will not cause a hardware exception to be raised if the input address is not aligned.
This means that, when we emit using the legacy encoding, we can only fold and emit efficient code for LoadAligned
. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight)
we can emit: addps xmm0, [pRight]
; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight)
we have to instead emit movups xmm1, [pRight]; addps xmm0, xmm1
And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned
. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight)
we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1
; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight)
we can just emit addps xmm0, xmm0, [pRight]
We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned
when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned
when emitting the legacy encoding).
So, what is the difference between the proposed Load intrinsic and the existing LoadAligned
We can expect that users will want to emit the most efficient code possible; However, we can also expect that users,,where possible, will be using various well-known tricks to get their data aligned so that they don't incur the penalty for reading/writing across a cache-line or page boundary (which will occur every 4 reads/writes on most modern processors, when working with 128-bit data).
This leads users to writing algorithms that do the following (where possible):
// Process first few elements to become aligned
// Process data as aligned
// Process trailing elements that don't fit in a single `VectorXXX<T>`
They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned
section) likely have a Debug.Assert((address % 16) == 0)
and will be using LoadUnaligned
(presently just called Load
).
However, it is also not unreasonable that they will want to emit efficient codegen if the user happens to be running on some pre-AVX hardware (as is the case with the "Benchmarks Games" which run on a Q6600 -- which it is also worth noting doesn't have fast unaligned loads). If the user wants to additionally support this scenario, they need to write a helper method which effectively does the following:
public static Vector128<float> EfficientLoad(float* address)
{
if (Avx.IsSupported)
{
// Must assert, since this would fail for non-AVX otherwise
Debug.Assert((address % 16) == 0);
return Sse.LoadUnaligned(address);
}
else
{
return Sse.LoadAligned(address);
}
}
Which also depends on the JIT being able to successfully recognize and fold this code.
So, this proposal suggests providing this helper method built-in by renaming the existing Load
method to be LoadUnaligned
and to expose a new method Load
(I had originally suggested something like LoadAlignedOrUnaligned
, but it was suggested I change it to the current) which does effectively the above. That is, on hardware with VEX encoding support, it uses LoadUnaligned
and when optimizations are disabled it will validate that the input address is aligned; while on older hardware that uses the legacy encoding, it will use LoadAligned
. This ensures that the code can always be folded efficiently and it validates the semantics remains the same (without hurting perf when users compile their executables with optimizations enabled). We also still have the LoadUnaligned
and LoadAligned
methods explicitly for the cases where the user can't always do the "optimal" thing (sometimes you just have to work with data that can be unaligned).
And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]
We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).
Ah, I see the problem now. But we should fold LoadAligned
with VEX encoding, as it does not throw exception.
They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).
This assert is useless. address
could be either aligned or unaligned in different runs.
as it does not throw exception.
That is the problem and why we can't fold it. LoadAligned
emits movaps
which always throws if the address is unaligned. If you fold it into a VEX-encoded
instruction, it will no longer throw. This means that if you passed in an unaligned address, the exception would silently disappear whenever the JIT folded it.
This assert is useless. address could be either aligned or unaligned in different runs.
The point of the assert is to help catch places where the user is calling Load
without pinning the data and ensuring that it is properly aligned. It will still be possible for a user to do this incorrectly, but it would be an explicit opt-in.
We could potentially use another name, such as LoadUnsafe
, or something that conveys that the data must be aligned but we won't always validate that it is.
I like this proposal, and agree that the assert, while not guaranteeing to catch all potential dynamic data configurations, is very likely to catch nearly all cases.
Given that there is a great deal of expertise required to use the hardware intrinsics effectively, I don't think it's necessary to encumber this with the Unsafe
suffix. Rather, documenting its behavior and asserting in the unoptimized case should be sufficient.
Ah, looked the manual again, yes, vmovaps
can still throw exceptions with VEX-encoding, but other instructions won't. Thanks.
I don't think it's necessary to encumber this with the Unsafe suffix
Agree, actually other Load*
are not safe neither 😄
and this is more a nicety for if people are supporting older hardware or working around a JIT bug (COMPlus_EnableAVX=0)
Are there enough such people to justify adding more intrinsics?
Should we also expose equivalents on Avx/Avx2?
It's perhaps weird for SSE Load
to have one semantic and AVX Load
to have a different semantic. But that would imply adding more intrinsics. Or perhaps AVX Load
should simply be renamed to LoadUnaligned
, though the longer name sucks. Same for stores.
But to quote @CarolEidt "Given that there is a great deal of expertise required to use the hardware intrinsics effectively", I don't think such difference in semantics justifies adding more intrinsics.
Should these instructions have a Debug.Assert((address % 16) == 0)?
Where does this assert go exactly? Is Load
a true intrinsic or a helper? Who's using debug builds of corelib, besides the CoreCLR team?
I don't think such difference in semantics justifies adding more intrinsic.
Agree, I suggest that we can fold more existing load intrinsic rather than adding more intrinsics.
LoadAligned
because folding LoadUnaligned
would throw exceptions that the users did not expect.LoadAligned
and LoadUnaligned
both. The individual LoadAligned
may throw exceptions but folded ones won't, but I think this is fine. We assume that users "expect" exceptions from LoadAligned
and handle in their own code, and folded LoadAligned
just throws exceptions with 0% possibility.The above codegen strategy is already adopted by all the mainstream C/C++ compilers.
Are there enough such people to justify adding more intrinsics?
Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.
It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic
Yes, if this API is approved, we would (at a minimum rename) the AVX intrinsic to be LoadUnaligned
. I already have a note for the API review about whether we should have Store
mirror these names.
Where does this assert go exactly? Is Load a true intrinsic or a helper?
A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.
and folded LoadAligned just throws exceptions with 0% possibility
I don't think this is acceptable without an explicit user opt-in. The runtime has had a very strict policy on not silently removing side-effects
The above codegen strategy is already adopted by all the mainstream C/C++ compilers.
C/C++ makes a number of optimizations that we have, at this point, decided not to do and to instead address via Analyzers or look at more in depth later.
We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe
and ToVector256Unsafe
are both good examples of this).
The runtime has had a very strict policy on not silently removing side-effects
It is not removing side-effects, it has no any side-effects (hardware exceptions).
It is not removing side-effects, it has no any side-effects (hardware exceptions).
All exceptions, including HardwareExceptions are considered side-effects.
The movaps
raises the hardware #GP
exception if its input is not properly aligned and the JIT interecepts that and converts it into a System.AccessViolationException
.
Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.
I didn't ask about cases where you tried to do that. I asked if there are enough people who are using .NET Core on old or gimped CPU and expect things to work perfectly rather than just reasonably.
A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.
Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned. Whether it's a good idea to do that in debug builds it's debatable. Some people may end up disabling optimizations to workaround a JIT bug or to be able to get a correct stacktrace for a hard to track down bug. Now they risk getting exceptions that with optimizations enabled would not occur.
The runtime has had a very strict policy on not silently removing side-effects
Please do not mislead people. The runtime does not impose any semantics on intrinsics, it's the other way around. LoadAligned
may be defined as "throws an exception if the address is unaligned" or as "may throw an exception if the address is unaligned". That may or may not be a good idea. But it doesn't have anything to do with any runtime policy.
We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).
Indeed. Squabbles over names resulted in string intrinsic being removed from .NET 3. Yet adding non-deterministic intrinsics to support old CPUs is doable.
Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned.
I believe the result of this proposal is just forcing most people to change Load
to LoadUnaligned
in their programs.
Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.
I want to say again Debug.Assert((address % 16) == 0);
is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches
for (;IsUnaligned(addr); addr++)
{
// processing data until the memory get aligned
}
for (; addr < end; addr += Vector128<int>.Count)
{
var v = LoadAlignedVector128(addr);
// or LoadUnalignedVector128(addr), whatever, we recommand always using LoadUnalignedVector128
......
}
Or
for (; addr < end; addr += Vector128<int>.Count)
{
var v = LoadUnalignedVector128(addr);
......
}
Using which approach is based on profiling and it is much more important than folding loads for SSE encoding.
I didn't ask about cases where you tried to do that
I wasn't saying that with regards to personal attempts.
Now they risk getting exceptions that with optimizations enabled would not occur
Correct, but the API is explicitly documented to do as such and they have the option to enforce the desired behavior by explicitly using one of the other two APIs (LoadAligned
or LoadUnaligned
).
By not having all three, users either have to write their own wrappers for something that we could (and in my opinion should) reasonably handle.
Additionally, if we only expose two APIs and make one of them "sometimes validate alignment", then we can't provide a guarantee to the end user that they can enforce the semantics where required.
Please do not mislead people. The runtime does not impose any semantics on intrinsics
I don't think this is misleading at all. We have had a very clear contract and design for the platform-specific hardware intrinsics that each is tied to a very particular instruction.
LoadAligned
is documented to emit movaps
and LoadUnaligned
(currently just called Load
) will emit movups
. This gives a very clear story and guarantees that, if someone uses this API, they will get exactly that instruction.
Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned
under the VEX-encoding.
It is worth noting, however, that we have also tried to determine common patterns, use-cases, and limitations with the decision to tie a given hardware intrinsic to a particular instruction. When we have found such limitations, we have discussed what we could do about it (both the "ideal" and realistic scenarios) and determined an appropriate solution. So far, out of all the scenarios that have been brought up, we have determined to deviate only for CreateScalarUnsafe
and ToVector256Unsafe
(where we say that the upper bits will be non-deterministic, rather than explicitly zeroed; but for which we also expose "safe" versions that do explicitly zero the upper bits). This may be another case, that is more trivial to resolve via an additional API and that is worthwhile to do.
Squabbles over names resulted in string intrinsic being removed from .NET 3
They have not necessarily been removed from netcoreapp3.0, instead we decided that we would pull them until we can determine the appropriate way to expose these APIs as there were concerns over the useability and understandability of said APIs. The underlying instructions behave quite differently from most other intrinsics we've exposed and so they require some additional thought. This is no different from any other API that we decide to expose in that we don't want to ship something that has a bad surface area and we are still working towards making sure these APIs ship in a good and useable state.
Yet adding non-deterministic intrinsics to support old CPUs is doable.
As mentioned above, both cases where we exposed an additional non-deterministic API were discussed in length.
I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches
Once you have pinned the memory, it is guaranteed not to move and you can rely on the assert. Anyone dealing with perf-sensitive scenarios and larger blocks of memory (for example, ML.NET) should be pinning their data anyways, as the GC could move the data otherwise and that can mess with the cache, alignment, etc.
Using which approach is based on profiling and it is much more important than folding loads.
The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" has multiple pages/sections all discussing this. The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:
All this additional API does is give the user the option to choose between:
LoadUnaligned
)LoadAligned
)Load
)A good number of vectorizable algorithms (especially the ones used in cases like ML.NET) can be handled with a worst-case scenario of two-unaligned loads maximum, regardless of the alignment of the input data. For such algorithms, the only question of whether to align or not really comes down to how much data is being processed in total and the cases where you don't want to make the data aligned can generally be special-cased.
The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:
Always align data when possible Generally prefer aligned stores over aligned loads, if you have to choose Consider just using two 128-bit unaligned loads if you can't align your 256-bit data etc
Yes, that is what I meant. And aligned loads != movap*.
I can't align my data (they use LoadUnaligned) My data is aligned and should never be unaligned (they use LoadAligned) My data is aligned, but you don't have to check (they use Load)
All these three situations are suggested to use LoadUnaligned
.
Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.
For the fortieth and two time: the runtime has nothing to do with this. I give up. We're probably speaking different languages or something.
@mikedn, I think the issue is a runtime issue when we start talking about folding the loads into the operations, in ways that might cause a runtime failure to be masked (i.e. if the load would throw on an unaligned address, but the operation into which it is folded would not). And I think what @tannergooding is after here is the ability to have a single load semantic that supports folding.
All that said, I'm still not sure where I fall on this issue.
Import @CarolEidt 's comment from https://github.com/dotnet/coreclr/issues/21308
I agree that we don't want to silently remove an exception. It is not just that it violates the "strict" behavior that developers expect from the CLR, but could also cause sudden unexpected failures if something changed in the user code (or the optimization settings used) that caused the JIT to no longer fold the instruction.
If users write LoadAligned
, they should be responsible to handle the possible exceptions, whatever it is from JIT optimization or actual memory misalignment.
And aligned loads != movap*.
I would disagree here. We have two instructions that perform explicit loads (including movupd
and movapd
here):
movups
, which works for any datamovaps
, which only works for aligned dataWe additionally have the ability to support load semantics in most instructions:
I would say that the unaligned
load could just be called a load
, as it loads data with no special semantics (and it is still just a load
even if the data happens to be aligned). I would say that an aligned load
must also validate that the data is aligned (which only movaps
does).
All these three situations are suggested to use LoadUnaligned.
LoadUnaligned
does not successfully fulfill all three scenarios by itself. It definitively fills the first scenario, but the second scenario requires an additional validation step (and must fail if the requirements are not met) and the third scenario has an optional validation step (that would fail if the requirements are not met and optimizations are disabled; otherwise, when optimizations are enabled, it would skip the validation check).
I think the issue is a runtime issue when we start talking about folding the loads into the operations
@CarolEidt It most definitely isn't. As far as the runtime/ECMA spec is concerned, intrinsics are just method calls. There's nothing anywhere in the ECMA spec that says that LoadAligned
must always throw an exception if the data is not aligned. That's solely in the intrinsic's spec courtyard.
Also, I don't claim that we should change the current LoadAligned
semantic. I just want to see reasonable argumentation for adding a bunch of new intrinsics. And I'm not seeing it.
And aligned loads != movap*.
I would disagree here.
LoadUnaligned does not successfully fulfill all three scenarios by itself.
movups
and movaps
has the same performance and movups
is always safe. Why not use it?
The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups
itself.
movups and movaps has the same performance and movups is always safe. Why not use it
Because we can't always safely fold a movups
. Some examples are being on older/legacy hardware, working around a JIT bug (so you have disabled VEX-encoding support), using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...
The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.
Yes, and the purpose of the proposed intrinsic is to:
It sounds like your concern is that "most" workloads will just use movups
and will not care about cache-line splits or page-faults; and they will not make any attempt to align the input data. As such, you think that calling this intrinsic just Load
is undesirable because most people will actually use/want LoadUnaligned
. Is that about right?
using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...
It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults;
No, my concern is that "most" workloads will just use movups
and will not care about folding on older CPUs (or other low-power devices you mentioned).
Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps
.
- Provide some debug-mode validation that you are not causing cache-line splits or page faults
- Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding
If users do not guarantee the alignment, the compiler-generated validation (1) is not reliable, so that makes (2) unacceptable on older CPUs.
If users already guarantee the alignment, the compiler-generated validation (1) is useless, then i) using LoadUnaligned
normally, or ii) using LoadAligned
if the code-size on older CPUs is critical (needs https://github.com/dotnet/coreclr/issues/21308)
you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?
Right.
the compiler-generated validation (1) is not reliable
Given the low-level nature of the code, that the intrinsic is explicitly documented to do the validation, and that the GC currently only guarantees 8-byte alignment; a JIT inserted check for debug-builds should catch essentially every case (especially across multiple runs).
LoadAlignedOrUnaligned
to explicitly call attention to the semantics, but it was suggested I change that to the current name).(needs dotnet/coreclr#21308)
I think that is a non-starter; not only would it silently remove the exception; but it removes the ability for users to guarantee the generation of a movaps
(which will always validate the alignment). The current proposal allows users to continue having explicit aligned or unaligned semantics, if desired; but also gives them the ability to do the efficient thing.
No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned). Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.
I believe that for "most" workloads, dev's will be concerned with cache-line splits and page faults, so they will attempt to align their data if the input is unaligned (these intrinsics are geared towards high-perf and hotspot scenarios, after all). They will want to additionally validate that they will not accidentally cause a cache-line split in production (their alignment logic is correct, they pinned the data, etc) and they will want to generate efficient code.
Therefore, it is my belief that they would write something like (this is using today's APIs and ignoring this proposal):
int misalignment = (int)(pCurrent) % 16;
// note: may need to handle unnaturally aligned data, such as if `float`
// which should be `(pCurrent % 4) == 0` is something else.
if (misalignment != 0)
{
// Vector128<T>.Count is a power of a two and a constant, this should optimize to a right shift
misalignment /= Vector128<T>.Count;
misalignment = Vector128<T>.Count - misalignment;
for (int i = 0; i < misalignment; i++)
{
// Process the individual elements, non-vectorized
}
count -= misalignment;
}
if (remainder >= Vector128<T>.Count)
{
remainder = count % Vector128<T>.Count;
count -= remainder;
for (int i = 0; i < count; i++)
{
// Assert we are still aligned
Debug.Assert(((int)(pCurrent) % 16) == 0);
// Process elements using vectorization and the appropriate Load(pCurrent)
}
}
else
{
remainder = count;
}
for (int i = 0; i < remainder; i++)
{
// Process the individual elements, non-vectorized
}
This proposal does a couple of things:
Debug.Assert((address % 16) == 0)
, as we will do the equivalent automaticallyThe code the user writes remains the same and if they want to have explicit Unaligned
semantics or explicit Aligned
semantics; they still have the option to use one of the other two overloads and have the correct/expected behavior for those
We just repeat our words again and again. I would like to stop the discussion and leave the proposal to the API review.
My original proposal was to call this something like LoadAlignedOrUnaligned to explicitly call attention to the semantics, but it was suggested I change that to the current name).
At least, do not name the new APIs to Load
.
At least, do not name the new APIs to
Load
.
Right now I also think this shouldn't be named Load
, as it hides something.
Maybe (just thinking loud): LoadSafe
(as opposed to Unsafe), LoadAutomatic
, LoadFast
, ...
On recent x86-hardware there is no need for this method -- I think here we have consensus, because always using unaligned reads has no penalty, and is recommended by intel, too.
But as @tannergooding tries to point out there exists hardware, on which the same code should run optimal without any need to change, where unaligned reads do have a penalty, so aligned reads should be used, also depending on the JIT if he will emit VEX or not. This case could be handled with this new method. Therefore I think it is reasonable to add it.
But the name should be simple and obvious, not to overwhelm users and leave them with open questions about which Load-method to use.
If these methods have to exist, I agree Load
is probably a bad name for them. On the other hand, it's better than the self-documenting LoadAlignedOnOldCpusOrUnalignedOnNewCpusOrBetterYetJustFoldTheLoadIfPossible
.
I think @mikedn is right here, though. This is a new API, and if the documentation says LoadAligned
may throw an exception on unaligned data, that's good enough. Is there really ever a scenario where you need to guarantee a movaps
will be emitted for the intrinsic method? I like the idea of a 1:1 mapping of intrinsics to instructions, but since we can't access registers directly or specify the reg/mem targets of the instructions directly, there's a limit to how far that philosophy can go.
For high-performance scenarios, people will ensuring the data is aligned anyway, and if that's the only way to get the load folded, that just further encourages the best practice.
Thinking about this more, I reckon that if LoadAligned
were required for folding, you'd end up with people writing code like this:
Vector128<float> v;
if (Avx.IsSupported) {
v = Sse.LoadAligned(...) // Data isn't really aligned but this allows folding the load
} else {
v = Sse.LoadUnaligned(...)
}
...
Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method. What if we had Sse.LoadUnaligned
, Sse.LoadAligned
, and Avx.LoadVector128
?
Sse.LoadAligned
could be folded in both encoding forms (again with the caveat that the docs don't guarantee that unaligned addresses would throw on all CPUs), and Avx.LoadVector128
could be always be folded because it would only be available when VEX encoding is supported.
I reckon that if LoadAligned were required for folding, you'd end up with people writing code like this:
No, always using LoadUnaligned
. But if you worry about the SSE encoding codegen, use Sse.LoadAligned
.
Selecting different load based onif (Avx.IsSupported)
is unnecessary because you have to align your data for LoadAligned
like https://github.com/dotnet/corefx/issues/33566#issuecomment-443362604, so the Sse.LoadUnaligned(...)
branch is meaningless but complicating your code.
Right, that was in reference to my previous suggestion that LoadAligned
be required for folding in the API even though it's not required in all encodings. But people would end up calling the aligned method to get the folding even with unaligned data when they know it's supported by the CPU. I was just shooting down my first suggestion and then offering another.
Oh, I see, thanks for the explanation.
people writing code like this:
The code is backwards. When using the VEX encoding (Avx.IsSupported
is true
), we can only fold LoadUnaligned
because the instructions do not validate the alignment of the address. If the user opts-in, we could additionally fold some method that sometimes checks alignment (whether that is via a 3rd, separate API; or via some other mechanism).
When not using the VEX encoding, we can only fold LoadAligned
because the instructions themselves will assert alignment and will fail if the address is not aligned (there is no way to workaround this).
Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method
That is what this proposal is trying to do. It is trying to keep the two existing LoadUnaligned
(currently just called Load
) and LoadAligned
methods, so that users who need to can explicitly use the required behavior.
It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").
I think doing anything without an explicit opt-in is a non-starter and the general rule has been that we don't expose a non-deterministic API without also exposing a deterministic one; so I don't think we can or should just relax LoadAligned
.
I am not convinced that just calling the new API Load
is right, as I don't think it conveys all the necessary semantics (which is why I originally had it named differently).
I am convinced that exposing an additional API for this behavior would be beneficial. We already know of cases where users will be running certain types of ML.NET code on low-power/IOT devices (such as Raspberry PI). Not everyone is buying the latest CPU's; and sometimes when they do they go for budget (which may mean they purchase a modern CPU without AVX support, and there are several of them that get released each generation).
such as Raspberry PI
is running on ARM, so it's out of the game here.
is running on ARM, so it's out of the game here.
Right, but there are also modern low-power/budget x86 computers without AVX support (from both AMD and Intel). I am unsure if we have any specific numbers on usage, but given that there are other low-power/IOT devices targeting ML.NET, I feel they shouldn't be excluded.
That's my opinion too.
The code is backwards.
My sample refers to my proposed implementation where LoadAligned
doesn't guarantee an exception on unaligned data and is the only load type that is considered for folding by the JIT because it would work in both encodings. My point was that if the developer knows that method is the only one that can be folded and knows that on VEX processors, the data doesn't actually have to be aligned, they'd end up calling LoadAligned
specifically on unaligned data but only on VEX processors, which is even more confusing than your proposal. I talked myself out of that.
When using the VEX encoding (Avx.IsSupported is true), we can only fold LoadUnaligned because the instructions do not validate the alignment of the address
I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data. Intel guarantees the fault in the movaps
instruction, but then they recommend against using that instruction on newer processors. I'm saying that if you never guarantee that movaps
will actually be emitted, then you're not breaking anything by not throwing.
It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").
This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned. An Avx.LoadVector128
could be safely folded in all cases, whereas the too-clever Load
would work on VEX processors and crash on legacy with unaligned data. The proposed assert in minopts mitigates that possibility but then creates an unnecessary restriction on the VEX processors.
Not everyone is buying the latest CPU's
I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths. Not being able to completely optimize the AVX path because of a restriction on the older/lower-end processors is bad too.
I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths.
Libraries will likely have both 128-bit and 256-bit optimized code paths. For the 128-bit code path, they may consider both the VEX-encoding and the legacy encoding scenarios.
I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data.
An API called LoadAligned
that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.
This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned.
Correct. And if your data is not going to be aligned or if it is not beneficial for you to handle leading/trailing elements to make the rest of the handling aligned; you would call the explicit LoadUnaligned
method.
However, as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned" (so that they don't have the cache-line penalty every 4 reads/writes or a page fault every X reads/writes). They just won't use LoadAligned
because the JIT won't emit the efficient thing on modern CPUs. They will, however, likely want to have validation in Debug mode that their input is actually aligned.
but then they recommend against using that instruction on newer processors
The optimization manual makes no recommendations about using or not using movaps
; rather they recommend that you align your data when possible and have a number of recommendations on what to do if you can't guarantee data alignment (such as preferring aligned stores over aligned reads). They also recommend using the more efficient encoding (rather than an explicit load) when possible.
The only reason not to use LoadAligned
is because it will not be folded (emit the more efficient encoding) on VEX-enabled hardware. Yes, this is currently an API limitation, but no, I don't think it will be relaxed.
The reason to use LoadUnaligned
is because it will be folded (emit the more efficient encoding) on VEX-enabled hardware. The downside is that, when you are doing performance optimizations and your data it aligned, you get no validation of that fact and you must add explicit Debug.Asserts
to your own code.
This proposal came from the belief that needing to use LoadUnaligned
and additionally assert the alignment was a common enough case. It has the additional benefit that it will folded on either legacy or VEX-encoding while still providing explicit APIs for the Aligned vs Unaligned case.
An API called LoadAligned that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.
I see your point, but a method called Load
that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.
as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned"
This was my initial thought as well, but then I thought back to the challenge of re-implementing System.Numerics
using S.R.I. On a granular method like operator+
, you can't make any kind of assumption about the alignment, but the folding would still be beneficial. I don't know if some of the plans around first class structs and containment will address that through some other trickery, but limiting reg/mem encoding to only aligned memory scenarios is bound to cause problems elsewhere.
They also recommend using the more efficient encoding (rather than an explicit load) when possible.
This is the part that's most important to me. I can figure out whether I've aligned my pointers, but I want to get the most efficient encoding in every possible case.
I see your point, but a method called Load that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.
Right. I expect the API review team will learn towards another name (like LoadUnsafe
, due to its non-deterministic nature).
you can't make any kind of assumption about the alignment, but the folding would still be beneficial.
Right, and in this case (where you are doing one off operations) you would have to use LoadUnaligned
. It will do the efficient encoding on modern hardware, but will involve an additional instruction for older hardware. -- However, you also wont likely be doing explicit loads here. I would imagine you would just use Unsafe.As
to go from Vector4 to Vector128 and the JIT would do the efficient conversion (as we will often already be in register).
Ah, ok. As long as LoadUnaligned
can still be folded on VEX hardware, that's fine. I was under the impression you were proposing LoadAligned
always emitted movaps
, LoadUnaligned
always emitted movups
, and that only Load
(or LoadUnsafe
or whatever) could be re-encoded into to another instruction.
I was under the impression you were proposing
Ah, definitely not. LoadAligned
and LoadUnaligned
would continue on with the exact same semantics as we have today (and would get folded in the same scenarios). The new API would behave as a LoadAligned
and would validate the address under minopts, but would be folded under both VEX and legacy encodings when optimizations are enabled.
It's also worth noting that this proposal matches basically what the native compilers do.
For example (see https://godbolt.org/z/oITxgp), clang will emit exclusively movaps
for _mm_load_ps
and movups
for _mm_loadu_ps
in debug mode. However, in Release mode it will fold either for the VEX encoding and only _mm_load_ps
for the legacy encoding.
The proposed API here does the same, but still exposes an explicit LoadAligned
for when users explicitly want movaps
.
-- cc. @fiigii l, since he suggested matching native behavior in the folding
Rationale
On x86 hardware, for SIMD instructions, there are effectively two encoding formats: "legacy" and "vex".
There are some minor differences between these two encodings, including the number of parameters they take and whether the memory operand has aligned or unaligned semantics.
As a brief summary:
ins reg, reg/[mem]
), the first serves as both a source and destination, and the last as a source which can generally be a register or memory address. The memory address form has "aligned" semantics and will cause an exception to be thrown if the address is not "naturally aligned" (generally 16-bytes).ins reg, reg, reg/[mem]
), the first being the destination, the second being a source, and the last as a source which can generally be a register or memory address. The memory address form has "unaligned" semantics and will not cause an exception to be thrown, regardless of the input.Today, we expose both
Load
(which has unaligned semantics) andLoadAligned
(which has aligned semantics). Given that a user will often want to generate the "most efficient" code possible and that the JIT, in order to preserve semantics (not silently get rid of an exception that would otherwise be thrown), changes which method it will "fold" depending on the encoding it is currently emitting, it may be beneficial to expose an intrinsic which allows the user to specify that "This address is aligned, do whatever load is most efficient". This would ensure that it can be folded regardless of the current encoding.Proposed API
API Semantics
This shows the semantics of each API under the legacy and VEX encoding.
Most insrtuctions support having the last operand be either a register or memory operand:
ins reg, reg/[mem]
When folding does not happen the load is an explicit separate instruction:
While, folding allows the load to be folded into the calling instruction:
On
Legacy
hardware, the folded-form of the instructions assert thataddress
is aligned (generally this means(address % 16) == 0
). OnVEX
hardware, the folded-form does no such validation and allows any input.Load (mov unaligned)
LoadAligned (mov aligned)
LoadUnsafe
Additional Info
Some open ended questions:
COMPlus_EnableAVX=0
)Store
,StoreUnaligned
, andStoreAligned
counterparts or just keepStore
andStoreAligned
?Avx
/Avx2
?Avx
/Avx2
instructions, they require the VEX encoding and therefore always have "unaligned" semantics, but that Avx/Avx2 define and exposeLoadAlignedVector256
/StoreAligned
instructions anyways