Open tannergooding opened 6 years ago
This would change calling convention for these types. In general, calling convention changes invalidate pre-compile code out there. I think it would be a good idea to disable loading of these types during crossgen so that they are not baked into any precompiled code and we do not have our hands tied with changing the calling convention.
@jkotas, Is there a good example of existing types disabled during crossgen?
I would like to include that change in https://github.com/dotnet/coreclr/pull/15942, if possible.
Take a look how Vector<T>
is handled in vm\methodtablebuilder.cpp. Look for IDS_EE_SIMD_NGEN_DISALLOWED
.
@jkotas, @CarolEidt. Part of the ABI work for these types is respecting their larger packing sizes (8 for m64, 16 for m128, 32 for __m256).
Do you think it is reasonable to have the packing sizes respected for v1 (it looks like it only needs a relatively small update in the VM layer)?
Do you think it is reasonable to have the packing sizes respected for v1
I think it is reasonable.
Should this be in 2.1 (not 2.0.x) ?
Should this be in 2.1 (not 2.0.x) ?
Yes, I believe so.
An explicit example of where the current ABI is wrong is for x64 Windows with SIMD returns.
The default calling convention for x64 Windows specifies that m128, m128i, and __m128d are returned in XMM0: https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019#return-values
These types correspond to the System.Runtime.Intrinsics.Vector128
dotnet/coreclr#23899 adds support for passing Vector128
@tannergooding is this still relevant?
Yes. We still need to ensure that these types are being correctly handled at the ABI level so we can enable them for interop scenarios and so we know we have good/correct/performant codegen even in managed only land.
This won't make 7.0.
I came up with a hacky workaround for Vector128<T>
value returns on Windows x64 in interop scenarios, which I'm posting here in case it's helpful for others: https://gist.github.com/tgjones/2e8ea8e365837e67c1f2d8d67d92105e
I use a custom marshaller for the return type of a [LibraryImport]
declaration:
[LibraryImport("MyNativeLibrary")]
[return: MarshalUsing(typeof(Vector128ReturnMarshaller<>))]
public static partial Vector128<float> MyNativeFunction(Vector128<float>* input);
And in the custom marshaller, I create a little bit of assembly code to copy the XMM0 register into an instance field in the marshaller. I write this assembly code to some native memory that is marked as executable, and then create a function pointer for it. Then I call that function pointer, and the instance field is returned from the marshaller's ToManaged
implementation.
This is just for Windows at x64 - I haven't looked at other platforms / architectures.
As I say, this is just a hacky workaround, and there may well be better ways to do it, but this is just what I came up with.
I came also recently with a similar case where I had to call an interop function taking/returning a Vector128<float>
and couldn't make it working without hacking stuffs like @tgjones did.
One piece also that I discovered is that even function pointers over managed functions that are using/returning Vector128<float>
(e.g delegate*<Vector128<float>, Vector128<float>>
) generate invalid code as well (they assume the Vector is a struct and pass it by pointer on the stack).
I'm wondering what happened to #32278 not to be merged actually?
they assume the Vector is a struct and pass it by pointer on the stack
This is correct for Windows. The default calling convention for x64 windows does not pass any SIMD values in register, it only returns them in register (which we aren't doing the latter today).
To be passed in register, you must use __vectorcall
which is on the backlog to eventually support.
To be passed in register, you must use
__vectorcall
which is on the backlog to eventually support.
But for function pointers to regular static managed functions, would that require this? Shouldn't it be part of the built-in support for managed function? (Or should we make a Roslyn error when trying to take the address of a static managed function that has e.g Vector128<float>
?)
But for function pointers to regular static managed functions, would that require this?
The managed calling convention currently defaults to the "default calling convention" for the operating system. That is ms_abi
for Windows and sysv_abi
on Unix. We could certainly change this in the future to be __vectorcall
on Windows instead, but that would be a transparent change for end users.
For unmanaged
you must annotate your function as UnmanagedCallersOnly
, this also defaults to the "default calling convention" for the operating system. When we eventually get __vectorcall
support added, there will be a corresponding CallConvVectorcall
member added and you would annotate your UnmanagedCallersOnly
with this as well as ensuring your fnptr is marked as delegate* unmanaged[Vectorcall]<...>
.
(Or should we make a Roslyn error when trying to take the address of a static managed function that has e.g Vector128
?
I don't see the need or benefit it would simply block a scenario that already works and works correctly. The scenario that doesn't work today is interop with a native function that uses __vectorcall
(we also block Vector128<T>
in general due to the bug in how vector returns work, but there are workarounds you can do to make it "work").
I don't see the need or benefit it would simply block a scenario that already works and works correctly.
Oh right, I mixed the fact that the managed code is still passing VectorXXX<T>
arguments through the stack, but I thought it was not and I was expecting somehow a vectorcall calling convention.
So that's indeed only for the case of unmanaged.
Sadly won't get to this one in 8.0 either.
The managed calling convention currently defaults to the "default calling convention" for the operating system. That is
ms_abi
for Windows andsysv_abi
on Unix. We could certainly change this in the future to be__vectorcall
on Windows instead, but that would be a transparent change for end users.
Out of curiosity, what's the roadblock in executing on this? I just finished experimenting with using the SIMD intrinsic classes as a primitive for 4-component float/double vector math - originally wrapping them in a readonly struct
for ease-of-use but eventually just using the native types and methods directly - and determined that it was wildly unsuitable for that. (I'm aware that, despite their name, vector math is not the primary use case for SIMD vector instructions.) The SIMD intrinsic methods (Vector128.AndNot
and such, and also the operators) don't allow themselves to be inlined - they have to appear by name in the source of the method where they're used, or they'll emit a function call into the asm.
That wouldn't be quite as much of a problem as it is, if the vectors weren't going through a store/load with each and every call and return, and function prologs and epilogs everywhere. I wouldn't mind the extra calls nearly so much if they were just vaddps xmm0,xmm0,xmm1; ret
- I'd expect the CPU cache to be able to inline that away itself, making it nearly as good as a compiler inline, but all these stack operations make that an impossibility.
It's obviously and sadly too late to do anything about this on current desktop systems, but is there anything that can be done to help move this along so that maybe in another five years or so, it might be possible to use SIMD vectors as vector math primitives in C# in an end-user-facing application?
Also, I had one other question: why are managed calls executed solely within the runtime required to conform to the system ABI, anyway? I'd have thought that the CLR JIT would make better use of system resources, given that - unlike basically every other language platform in the world - it knows what hardware it will run on as it compiles, and it has full executive control over both sides of every function call (at least, the ones that are managed-to-managed code). I don't think any other platform could say "we could change the calling convention to __vectorcall
and it would be transparent to end users", and I don't understand why that hasn't been done, given the potential performance benefits. I'm sure there's a very good reason for the internal implementation of the code to respect the ABI even in cases where no unmanaged code is being used, but I have to admit I have not the slightest clue what it could be.
Out of curiosity, what's the roadblock in executing on this?
Mostly just being low priority. Interop with native code that directly takes SIMD parameters is rare, especially when most functionality can be trivially written in C#/F# instead and the places where it isn't trivial typically take arrays/pointers as input, not SIMD types.
I just finished experimenting with using the SIMD intrinsic classes as a primitive for 4-component float/double vector math
Why not just use System.Numerics.Vector4
, which is already optimized to do this? It doesn't handle double today, but using double for this is rare in the first place.
The SIMD intrinsic methods (Vector128.AndNot and such, and also the operators) don't allow themselves to be inlined - they have to appear by name in the source of the method where they're used, or they'll emit a function call into the asm.
Not sure what you mean? Vector128.AndNot(x, y)
emits andnps
, andnpd
, or pandn
on x86/x64, there is no inlining as its directly intrinsic. The same is true for most of the vector APIs.
That wouldn't be quite as much of a problem as it is, if the vectors weren't going through a store/load with each and every call and return, and function prologs and epilogs everywhere. I wouldn't mind the extra calls nearly so much if they were just vaddps xmm0,xmm0,xmm1; ret - I'd expect the CPU cache to be able to inline that away itself, making it nearly as good as a compiler inline, but all these stack operations make that an impossibility.
Could you share some code, your CPU, what runtime you're targeting, and what you're seeing vs expecting to see? From what you're describing, it sounds like you're doing something unique or not looking in the right places to see the real codegen output.
Also, I had one other question: why are managed calls executed solely within the runtime required to conform to the system ABI, anyway?
They are not, but as a matter of convenience and overall performance (especially as it pertains to interop, which regularly happens at all layers of the stack), it tends to be the best approach.
That is, if we deviate from the standard, then every context switch now has to account for that and do additional save/restore work, which often doesn't pay off as compared to simply matching the underlying default ABI.
Additionally, things like __vectorcall
aren't that common, you typically have the calls inlined and that avoids the need to pass things across a method boundary in register altogether.
Why not just use System.Numerics.Vector4, which is already optimized to do this? It doesn't handle double today, but using double for this is rare in the first place.
I'd love to, but this is for game code - collision detection and the like, in a custom engine. It has to be at double precision, or the FP errors just add up too fast, especially when checking collisions at some distance from the measurement origin. (Also, ideally, it should be possible to use an int
-based vector for doing block/tile math, without having to go through FP conversion.) The existing code uses unaligned structs to pass around vectors of all three types, and I'd really been hoping I could make a drop-in replacement for them based on the SIMD vectors.
They are not, but as a matter of convenience and overall performance (especially as it pertains to interop, which regularly happens at all layers of the stack), it tends to be the best approach.
Ah, okay, that makes sense! Yeah, if the managed codegen isn't using the system ABI, you'd have to keep two copies of the jitted code in memory or dynamically check and recompile on encountering an unmanaged ⇒ managed transition - and yeah, that's extra processor time spent on something that doesn't generally provide a benefit. Makes sense. It's good to know that isn't a hard requirement of the CLR, though.
Could you share some code, your CPU, what runtime you're targeting, and what you're seeing vs expecting to see? From what you're describing, it sounds like you're doing something unique or not looking in the right places to see the real codegen output.
I'd be delighted! I'm running release-optimized .NET 7 code compiled by Visual Studio Community 2022 on a Windows 11 laptop with a 12th-gen Core i9 processor. I followed these steps for peeking at the codegen - I'm compiling in release mode, it's including symbols, it's not disabling managed code optimization in the debug settings, and I'm setting a breakpoint in the method in question, which I've reproduced below:
There are two declarations of the flipSigns
variable, one commented out. When the code is in the state as I've listed it here, using all the SIMD intrinsics directly in the UpdateMotion method, the generated code shows up as follows in the Disassembly window:
;SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
00007FFA6CF2DFE8 vmovupd ymm0,ymmword ptr [rbp-130h]
00007FFA6CF2DFF0 vmovupd ymm1,ymmword ptr [rbp-130h]
00007FFA6CF2DFF8 vandpd ymm0,ymm0,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07D0h (07FFA6CF2E340h)]
00007FFA6CF2E000 vandnpd ymm1,ymm1,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07F0h (07FFA6CF2E360h)]
00007FFA6CF2E008 vorpd ymm0,ymm0,ymm1
00007FFA6CF2E00C mov rdx,qword ptr [rbp-40h]
00007FFA6CF2E010 vmovupd ymmword ptr [rdx+28h],ymm0
;//SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;
When I switch to the other declaration, which uses convenience static/extension methods which simply call and return the relevant intrinsic, the codegen is as follows:
;//SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
;SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
00007FFA6FCFA518 lea rcx,[rbp-290h]
00007FFA6FCFA51F vmovsd xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0800h (07FFA6FCFA8B0h)]
00007FFA6FCFA527 call qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)] ; pointer to SIMDVec.Double
00007FFA6FCFA52D lea rcx,[rbp-2B0h]
00007FFA6FCFA534 vmovsd xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0808h (07FFA6FCFA8B8h)]
00007FFA6FCFA53C call qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)] ; pointer to SIMDVec.Double
00007FFA6FCFA542 mov rcx,qword ptr [rbp-40h]
00007FFA6FCFA546 cmp byte ptr [rcx],cl
00007FFA6FCFA548 mov rcx,qword ptr [rbp-40h]
00007FFA6FCFA54C add rcx,28h
00007FFA6FCFA550 mov qword ptr [rbp-538h],rcx
00007FFA6FCFA557 vmovupd ymm0,ymmword ptr [rbp-130h]
00007FFA6FCFA55F vmovupd ymmword ptr [rbp-4F0h],ymm0
00007FFA6FCFA567 vmovupd ymm0,ymmword ptr [rbp-290h]
00007FFA6FCFA56F vmovupd ymmword ptr [rbp-510h],ymm0
00007FFA6FCFA577 vmovupd ymm0,ymmword ptr [rbp-2B0h]
00007FFA6FCFA57F vmovupd ymmword ptr [rbp-530h],ymm0
00007FFA6FCFA587 mov rcx,qword ptr [rbp-538h]
00007FFA6FCFA58E lea rdx,[rbp-4F0h]
00007FFA6FCFA595 lea r8,[rbp-510h]
00007FFA6FCFA59C lea r9,[rbp-530h]
00007FFA6FCFA5A3 call qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FE959C0 (07FFA6FE959C0h)] ;pointer to SIMDVec.ConditionalSelect
;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;
As you can see, all the relevant opcodes eventually get executed, but the amount of overhead dwarfs the actual business logic by an order of magnitude or more. This pattern - that the intrinsic instructions only ever get emitted into the method that mentions the VectorNNN class by name (or, in this case, by type alias) - held no matter what MethodImpl
attributes I applied and what qualifiers I added to the various parameters (in
, out
, ref
, scoped
, etc). When I was using a readonly struct
to hold the VectorNNN value (in order to take advantage of custom operator overloads, type conversions, and so forth) the result was the same, but with an additional layer of indirection (despite my efforts, I never managed to convince the runtime to reinterpret a VectorNNN as a wrapper struct, or a wrapper struct as a VectorNNN, without at least an extra copy, and usually including an additional function call to the non-inlined conversion, as well).
If you've got any insight, or any thoughts on other avenues to try, I'd be extremely interested to hear!
I'd love to, but this is for game code - collision detection and the like, in a custom engine. It has to be at double precision, or the FP errors just add up too fast, especially when checking collisions at some distance from the measurement origin
This is a common bug in how you're doing your logic. You should implement something akin to a floating-origin and chunking your world so that you're never doing computations in a way that could cause such large errors to exist.
There are many talks about this presented during things like GDC or which have had deep dive talks given from AAA game companies, they all tend towards using float
for performance (both CPU and GPU) and some have even switched to half
(GPU) for performance reasons in really hot code.
I'd be delighted! I'm running release-optimized .NET 7 code compiled by Visual Studio Community 2022 on a Windows 11 laptop with a 12th-gen Core i9 processor. I followed these steps for peeking at the codegen - I'm compiling in release mode, it's including symbols, it's not disabling managed code optimization in the debug settings, and I'm setting a breakpoint in the method in question, which I've reproduced below:
Notably, in .NET 7+ you can also just use DOTNET_JitDisasm="MethodName"
and run your program, that's a bit easier/more reliable and will let you see both the T0 (unoptimized) and T1 (optimized) codegen, as well as whether or not it was invoked enough to allow rejit to occur and therefore T1 code to be produced.
@EgorBo has also provided the amazing https://github.com/EgorBo/Disasmo extension for VS which makes it very trivial to right click->Disasm this
and see the optimized disassembly output (with UI to help with a number of other common scenarios as well).
As you can see, all the relevant opcodes eventually get executed, but the amount of overhead dwarfs the actual business logic by an order of magnitude or more.
This should be fixed in .NET 8+ (not enough code here for me to 100% confirm that, however). I expect the issue was due to the shadow copy and lack of forward sub. If you had taken the parameters as in
then the JIT would likely do the right thing on .NET 7 as well as it would avoid the shadow copy and see its not mutated after inlining.
Notably, rather than multiplying by -1
or +1
, just x ^ V256.Create(-0.0)
to flip the sign (float/double are one's complement, so you just need to toggle the sign bit, hence xor
with negative zero
).
Notably, rather than multiplying by -1 or +1, just x ^ V256.Create(-0.0) to flip the sign (float/double are one's complement, so you just need to toggle the sign bit, hence xor with negative zero).
Brilliant, using -0.0! I'd thought of xor'ing the sign bit, but I was feeling lazy and couldn't recall what the exact representation of a double was. (And in any case, I was just doing this as a proof-of-concept, to see what the runtime does with the SIMD vectors in general - most especially, whether they could be passed in registers across method calls. This game passes a lot of vectors across method calls. It's also worth noting that the implementation in question does not, in fact, actually work - I didn't bother to work out any of the bugs, I just wanted a set of representative SIMD operations in a method that does vector math.)
You should implement something akin to a floating-origin and chunking your world so that you're never doing computations in a way that could cause such large errors to exist.
Yeah, that'd have been my thought as well, actually - I may have seen one or two of those GDC talks 😂 This isn't my game or my code, though, so I mainly wanted to see if it was even a possibility to make a drop-in replacement for the existing double-based vector structs, because if so, I could toggle between the two implementations with nothing more than a build flag.
Notably, in .NET 7+ you can also just use DOTNET_JitDisasm="MethodName" and run your program
Oooh, that's neat! I'm definitely gonna have to play around with both that and the Disasmo extension, because I love poking into implementations and seeing how they work.
If you had taken the parameters as in then the JIT would likely do the right thing on .NET 7 as well
I did try putting in
or scoped in
on various parameters, as well as moving things between instance methods and extension methods (when I was using the readonly struct
wrapper), but I wasn't able to figure out a combination that would allow it to inline an intrinsic. (That doesn't mean there isn't one, but if there is, I couldn't figure out what it should be.) That said, even if I could get that working properly, my gut instinct is that without being able to pass a SIMD vector in a single register, there will just be too much overhead for it to grant any performance benefit.
Which, I mean, again - this isn't what SIMD is for, so I knew I was gonna be working against the grain when I started poking into this 😅
Is the code open source on GitHub? I might be able to give it a lookover and provide suggested fixes for the obvious cases and ensure we have tracking bugs for any of the less obvious ones.
Not yet but I'll clean this up a touch and push it so you can take a look! It's a fork of https://github.com/anegostudios/vsapi, with the existing non-readonly structs in https://github.com/anegostudios/vsapi/tree/master/Math/Vector (they're the FastVec
types - the Vec
types are classes).
I've gone ahead and pushed my code to https://github.com/dmchurch/vsapi/commit/simd-experiments but please, don't spend too much time on this! Especially since Disasmo has shown me that what I was looking at was, in fact, the T0 compilation - looks like the learn.microsoft link I followed is a bit out of date on its recommendations, ha.
Anyway, feel free to take a look, but please don't take this as anything other than me experimenting with the SIMD capabilities of C#. 😂
The SIMD HWIntrinsic types (
Vector64<T>
.Vector128<T>
, andVector256<T>
) are special and represent the__m64
,__m128
, and__m256
ABI types.These types have special handling in both the System V and Windows ABI and are treated as "scalar" (e.g. non aggregate and non union) types for the purpose of parameter passing or value returns. They additionally play some role in the selection of MultiReg or HVA (also known as HFA) structs.
We should add the appropriate support for these types to ensure we are meeting the requirement of the underlying ABI for a given platform/system.
category:correctness theme:runtime skill-level:expert cost:large impact:medium