dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.09k stars 4.7k forks source link

Add the appropriate ABI handling for the SIMD HWIntrinsic types #9578

Open tannergooding opened 6 years ago

tannergooding commented 6 years ago

The SIMD HWIntrinsic types (Vector64<T>. Vector128<T>, and Vector256<T>) are special and represent the __m64, __m128, and __m256 ABI types.

These types have special handling in both the System V and Windows ABI and are treated as "scalar" (e.g. non aggregate and non union) types for the purpose of parameter passing or value returns. They additionally play some role in the selection of MultiReg or HVA (also known as HFA) structs.

We should add the appropriate support for these types to ensure we are meeting the requirement of the underlying ABI for a given platform/system.

category:correctness theme:runtime skill-level:expert cost:large impact:medium

jkotas commented 6 years ago

This would change calling convention for these types. In general, calling convention changes invalidate pre-compile code out there. I think it would be a good idea to disable loading of these types during crossgen so that they are not baked into any precompiled code and we do not have our hands tied with changing the calling convention.

tannergooding commented 6 years ago

@jkotas, Is there a good example of existing types disabled during crossgen?

I would like to include that change in https://github.com/dotnet/coreclr/pull/15942, if possible.

jkotas commented 6 years ago

Take a look how Vector<T> is handled in vm\methodtablebuilder.cpp. Look for IDS_EE_SIMD_NGEN_DISALLOWED.

tannergooding commented 6 years ago

@jkotas, @CarolEidt. Part of the ABI work for these types is respecting their larger packing sizes (8 for m64, 16 for m128, 32 for __m256).

Do you think it is reasonable to have the packing sizes respected for v1 (it looks like it only needs a relatively small update in the VM layer)?

jkotas commented 6 years ago

Do you think it is reasonable to have the packing sizes respected for v1

I think it is reasonable.

AndyAyersMS commented 6 years ago

Should this be in 2.1 (not 2.0.x) ?

CarolEidt commented 6 years ago

Should this be in 2.1 (not 2.0.x) ?

Yes, I believe so.

tannergooding commented 5 years ago

An explicit example of where the current ABI is wrong is for x64 Windows with SIMD returns.

The default calling convention for x64 Windows specifies that m128, m128i, and __m128d are returned in XMM0: https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019#return-values

These types correspond to the System.Runtime.Intrinsics.Vector128 type on the managed side and it is not currently being returned in XMM0.

dotnet/coreclr#23899 adds support for passing Vector128 across interop boundaries and so this will need to be correctly handled.

AndyAyersMS commented 2 years ago

@tannergooding is this still relevant?

tannergooding commented 2 years ago

Yes. We still need to ensure that these types are being correctly handled at the ABI level so we can enable them for interop scenarios and so we know we have good/correct/performant codegen even in managed only land.

jakobbotsch commented 2 years ago

This won't make 7.0.

tgjones commented 1 year ago

I came up with a hacky workaround for Vector128<T> value returns on Windows x64 in interop scenarios, which I'm posting here in case it's helpful for others: https://gist.github.com/tgjones/2e8ea8e365837e67c1f2d8d67d92105e

I use a custom marshaller for the return type of a [LibraryImport] declaration:

[LibraryImport("MyNativeLibrary")]
[return: MarshalUsing(typeof(Vector128ReturnMarshaller<>))]
public static partial Vector128<float> MyNativeFunction(Vector128<float>* input);

And in the custom marshaller, I create a little bit of assembly code to copy the XMM0 register into an instance field in the marshaller. I write this assembly code to some native memory that is marked as executable, and then create a function pointer for it. Then I call that function pointer, and the instance field is returned from the marshaller's ToManaged implementation.

This is just for Windows at x64 - I haven't looked at other platforms / architectures.

As I say, this is just a hacky workaround, and there may well be better ways to do it, but this is just what I came up with.

xoofx commented 1 year ago

I came also recently with a similar case where I had to call an interop function taking/returning a Vector128<float> and couldn't make it working without hacking stuffs like @tgjones did.

One piece also that I discovered is that even function pointers over managed functions that are using/returning Vector128<float> (e.g delegate*<Vector128<float>, Vector128<float>>) generate invalid code as well (they assume the Vector is a struct and pass it by pointer on the stack).

I'm wondering what happened to #32278 not to be merged actually?

tannergooding commented 1 year ago

they assume the Vector is a struct and pass it by pointer on the stack

This is correct for Windows. The default calling convention for x64 windows does not pass any SIMD values in register, it only returns them in register (which we aren't doing the latter today).

To be passed in register, you must use __vectorcall which is on the backlog to eventually support.

xoofx commented 1 year ago

To be passed in register, you must use __vectorcall which is on the backlog to eventually support.

But for function pointers to regular static managed functions, would that require this? Shouldn't it be part of the built-in support for managed function? (Or should we make a Roslyn error when trying to take the address of a static managed function that has e.g Vector128<float>?)

tannergooding commented 1 year ago

But for function pointers to regular static managed functions, would that require this?

The managed calling convention currently defaults to the "default calling convention" for the operating system. That is ms_abi for Windows and sysv_abi on Unix. We could certainly change this in the future to be __vectorcall on Windows instead, but that would be a transparent change for end users.

For unmanaged you must annotate your function as UnmanagedCallersOnly, this also defaults to the "default calling convention" for the operating system. When we eventually get __vectorcall support added, there will be a corresponding CallConvVectorcall member added and you would annotate your UnmanagedCallersOnly with this as well as ensuring your fnptr is marked as delegate* unmanaged[Vectorcall]<...>.

(Or should we make a Roslyn error when trying to take the address of a static managed function that has e.g Vector128?

I don't see the need or benefit it would simply block a scenario that already works and works correctly. The scenario that doesn't work today is interop with a native function that uses __vectorcall (we also block Vector128<T> in general due to the bug in how vector returns work, but there are workarounds you can do to make it "work").

xoofx commented 1 year ago

I don't see the need or benefit it would simply block a scenario that already works and works correctly.

Oh right, I mixed the fact that the managed code is still passing VectorXXX<T> arguments through the stack, but I thought it was not and I was expecting somehow a vectorcall calling convention.

So that's indeed only for the case of unmanaged.

jakobbotsch commented 1 year ago

Sadly won't get to this one in 8.0 either.

dmchurch commented 4 months ago

The managed calling convention currently defaults to the "default calling convention" for the operating system. That is ms_abi for Windows and sysv_abi on Unix. We could certainly change this in the future to be __vectorcall on Windows instead, but that would be a transparent change for end users.

Out of curiosity, what's the roadblock in executing on this? I just finished experimenting with using the SIMD intrinsic classes as a primitive for 4-component float/double vector math - originally wrapping them in a readonly struct for ease-of-use but eventually just using the native types and methods directly - and determined that it was wildly unsuitable for that. (I'm aware that, despite their name, vector math is not the primary use case for SIMD vector instructions.) The SIMD intrinsic methods (Vector128.AndNot and such, and also the operators) don't allow themselves to be inlined - they have to appear by name in the source of the method where they're used, or they'll emit a function call into the asm.

That wouldn't be quite as much of a problem as it is, if the vectors weren't going through a store/load with each and every call and return, and function prologs and epilogs everywhere. I wouldn't mind the extra calls nearly so much if they were just vaddps xmm0,xmm0,xmm1; ret - I'd expect the CPU cache to be able to inline that away itself, making it nearly as good as a compiler inline, but all these stack operations make that an impossibility.

It's obviously and sadly too late to do anything about this on current desktop systems, but is there anything that can be done to help move this along so that maybe in another five years or so, it might be possible to use SIMD vectors as vector math primitives in C# in an end-user-facing application?

Also, I had one other question: why are managed calls executed solely within the runtime required to conform to the system ABI, anyway? I'd have thought that the CLR JIT would make better use of system resources, given that - unlike basically every other language platform in the world - it knows what hardware it will run on as it compiles, and it has full executive control over both sides of every function call (at least, the ones that are managed-to-managed code). I don't think any other platform could say "we could change the calling convention to __vectorcall and it would be transparent to end users", and I don't understand why that hasn't been done, given the potential performance benefits. I'm sure there's a very good reason for the internal implementation of the code to respect the ABI even in cases where no unmanaged code is being used, but I have to admit I have not the slightest clue what it could be.

tannergooding commented 4 months ago

Out of curiosity, what's the roadblock in executing on this?

Mostly just being low priority. Interop with native code that directly takes SIMD parameters is rare, especially when most functionality can be trivially written in C#/F# instead and the places where it isn't trivial typically take arrays/pointers as input, not SIMD types.

I just finished experimenting with using the SIMD intrinsic classes as a primitive for 4-component float/double vector math

Why not just use System.Numerics.Vector4, which is already optimized to do this? It doesn't handle double today, but using double for this is rare in the first place.

The SIMD intrinsic methods (Vector128.AndNot and such, and also the operators) don't allow themselves to be inlined - they have to appear by name in the source of the method where they're used, or they'll emit a function call into the asm.

Not sure what you mean? Vector128.AndNot(x, y) emits andnps, andnpd, or pandn on x86/x64, there is no inlining as its directly intrinsic. The same is true for most of the vector APIs.

That wouldn't be quite as much of a problem as it is, if the vectors weren't going through a store/load with each and every call and return, and function prologs and epilogs everywhere. I wouldn't mind the extra calls nearly so much if they were just vaddps xmm0,xmm0,xmm1; ret - I'd expect the CPU cache to be able to inline that away itself, making it nearly as good as a compiler inline, but all these stack operations make that an impossibility.

Could you share some code, your CPU, what runtime you're targeting, and what you're seeing vs expecting to see? From what you're describing, it sounds like you're doing something unique or not looking in the right places to see the real codegen output.

Also, I had one other question: why are managed calls executed solely within the runtime required to conform to the system ABI, anyway?

They are not, but as a matter of convenience and overall performance (especially as it pertains to interop, which regularly happens at all layers of the stack), it tends to be the best approach.

That is, if we deviate from the standard, then every context switch now has to account for that and do additional save/restore work, which often doesn't pay off as compared to simply matching the underlying default ABI.

Additionally, things like __vectorcall aren't that common, you typically have the calls inlined and that avoids the need to pass things across a method boundary in register altogether.

dmchurch commented 4 months ago

Why not just use System.Numerics.Vector4, which is already optimized to do this? It doesn't handle double today, but using double for this is rare in the first place.

I'd love to, but this is for game code - collision detection and the like, in a custom engine. It has to be at double precision, or the FP errors just add up too fast, especially when checking collisions at some distance from the measurement origin. (Also, ideally, it should be possible to use an int-based vector for doing block/tile math, without having to go through FP conversion.) The existing code uses unaligned structs to pass around vectors of all three types, and I'd really been hoping I could make a drop-in replacement for them based on the SIMD vectors.

They are not, but as a matter of convenience and overall performance (especially as it pertains to interop, which regularly happens at all layers of the stack), it tends to be the best approach.

Ah, okay, that makes sense! Yeah, if the managed codegen isn't using the system ABI, you'd have to keep two copies of the jitted code in memory or dynamically check and recompile on encountering an unmanaged ⇒ managed transition - and yeah, that's extra processor time spent on something that doesn't generally provide a benefit. Makes sense. It's good to know that isn't a hard requirement of the CLR, though.

Could you share some code, your CPU, what runtime you're targeting, and what you're seeing vs expecting to see? From what you're describing, it sounds like you're doing something unique or not looking in the right places to see the real codegen output.

I'd be delighted! I'm running release-optimized .NET 7 code compiled by Visual Studio Community 2022 on a Windows 11 laptop with a 12th-gen Core i9 processor. I followed these steps for peeking at the codegen - I'm compiling in release mode, it's including symbols, it's not disabling managed code optimization in the debug settings, and I'm setting a breakpoint in the method in question, which I've reproduced below:

public EnumCollideFlags ParticlePhysics.UpdateMotion(pos, ref motion, size) ```csharp using V128 = System.Runtime.Intrinsics.Vector128; using V256 = System.Runtime.Intrinsics.Vector256; using SIMDVec4i = System.Runtime.Intrinsics.Vector128; using SIMDVec4f = System.Runtime.Intrinsics.Vector128; using SIMDVec4l = System.Runtime.Intrinsics.Vector256; using SIMDVec4d = System.Runtime.Intrinsics.Vector256; public partial class ParticlePhysics { public EnumCollideFlags UpdateMotion(SIMDVec4d pos, ref SIMDVec4f motion, float size) { double halfSize = size / 2; SIMDVec4d collMin = pos - V256.Create(halfSize, 0d, halfSize, 0d); SIMDVec4d collMax = pos + V256.Create(halfSize, halfSize, halfSize, 0d); FastCuboidd particleCollBox = new(collMin, collMax); motion = motion.Clamp(-MotionCap, MotionCap); SIMDVec4i minLoc = (particleCollBox.Min + motion.ToDouble()).ToInt(); minLoc -= V128.Create(0, 1, 0, 0); // -1 for the extra high collision box of fences SIMDVec4i maxLoc = (particleCollBox.Max + motion.ToDouble()).ToInt(); minPos.Set(minLoc.X(), minLoc.Y(), minLoc.Z()); maxPos.Set(maxLoc.X(), maxLoc.Y(), maxLoc.Z()); EnumCollideFlags flags = 0; // It's easier to compute collisions if we're traveling forward on each axis // BREAKPOINT BELOW ↓ SIMDVec4d negativeComponents = motion.ToDouble().LessThan(SIMDVec4d.Zero); SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d)); //SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1)); SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns; particleCollBox.From *= flipSigns; particleCollBox.To *= flipSigns; particleCollBox.Normalize(); SIMDVec4d motionMask = flippedMotion.GreaterThan(SIMDVec4d.Zero); fastBoxCount = 0; BlockAccess.WalkBlocks(minPos, maxPos, (cblock, x, y, z) => { Cuboidf[] collisionBoxes = cblock.GetParticleCollisionBoxes(BlockAccess, tmpPos.Set(x, y, z)); if (collisionBoxes != null) { foreach (Cuboidf collisionBox in collisionBoxes) { FastCuboidd box = collisionBox; box.From *= flipSigns; box.To *= flipSigns; box.Normalize(); while (fastBoxCount >= fastBoxList.Length) { Array.Resize(ref fastBoxList, fastBoxList.Length * 2); } fastBoxList[fastBoxCount++] = box; } } }, false); for (int i = 0; i < fastBoxCount; i++) { ref FastCuboidd box = ref fastBoxList[i]; flags |= box.PushOutNormalized(particleCollBox, ref flippedMotion); } // Restore motion to non-flipped motion = (flippedMotion * flipSigns).ToFloat(); return flags; } } ```

There are two declarations of the flipSigns variable, one commented out. When the code is in the state as I've listed it here, using all the SIMD intrinsics directly in the UpdateMotion method, the generated code shows up as follows in the Disassembly window:

            ;SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
00007FFA6CF2DFE8  vmovupd     ymm0,ymmword ptr [rbp-130h]  
00007FFA6CF2DFF0  vmovupd     ymm1,ymmword ptr [rbp-130h]  
00007FFA6CF2DFF8  vandpd      ymm0,ymm0,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07D0h (07FFA6CF2E340h)]  
00007FFA6CF2E000  vandnpd     ymm1,ymm1,ymmword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+07F0h (07FFA6CF2E360h)]  
00007FFA6CF2E008  vorpd       ymm0,ymm0,ymm1  
00007FFA6CF2E00C  mov         rdx,qword ptr [rbp-40h]  
00007FFA6CF2E010  vmovupd     ymmword ptr [rdx+28h],ymm0  
            ;//SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
            ;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;

When I switch to the other declaration, which uses convenience static/extension methods which simply call and return the relevant intrinsic, the codegen is as follows:

            ;//SIMDVec4d flipSigns = V256.ConditionalSelect(negativeComponents, V256.Create(-1d), V256.Create(1d));
            ;SIMDVec4d flipSigns = negativeComponents.ConditionalSelect(SIMDVec.Double(-1), SIMDVec.Double(1));
00007FFA6FCFA518  lea         rcx,[rbp-290h]  
00007FFA6FCFA51F  vmovsd      xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0800h (07FFA6FCFA8B0h)]  
00007FFA6FCFA527  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)]  ; pointer to SIMDVec.Double
00007FFA6FCFA52D  lea         rcx,[rbp-2B0h]  
00007FFA6FCFA534  vmovsd      xmm1,qword ptr [Vintagestory.API.Client.ParticlePhysics.UpdateMotion(System.Runtime.Intrinsics.Vector256`1<Double>, System.Runtime.Intrinsics.Vector128`1<Single> ByRef, Single)+0808h (07FFA6FCFA8B8h)]  
00007FFA6FCFA53C  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FDA7630 (07FFA6FDA7630h)]  ; pointer to SIMDVec.Double
00007FFA6FCFA542  mov         rcx,qword ptr [rbp-40h]  
00007FFA6FCFA546  cmp         byte ptr [rcx],cl  
00007FFA6FCFA548  mov         rcx,qword ptr [rbp-40h]  
00007FFA6FCFA54C  add         rcx,28h  
00007FFA6FCFA550  mov         qword ptr [rbp-538h],rcx  
00007FFA6FCFA557  vmovupd     ymm0,ymmword ptr [rbp-130h]  
00007FFA6FCFA55F  vmovupd     ymmword ptr [rbp-4F0h],ymm0  
00007FFA6FCFA567  vmovupd     ymm0,ymmword ptr [rbp-290h]  
00007FFA6FCFA56F  vmovupd     ymmword ptr [rbp-510h],ymm0  
00007FFA6FCFA577  vmovupd     ymm0,ymmword ptr [rbp-2B0h]  
00007FFA6FCFA57F  vmovupd     ymmword ptr [rbp-530h],ymm0  
00007FFA6FCFA587  mov         rcx,qword ptr [rbp-538h]  
00007FFA6FCFA58E  lea         rdx,[rbp-4F0h]  
00007FFA6FCFA595  lea         r8,[rbp-510h]  
00007FFA6FCFA59C  lea         r9,[rbp-530h]  
00007FFA6FCFA5A3  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFA6FE959C0 (07FFA6FE959C0h)]  ;pointer to SIMDVec.ConditionalSelect
            ;SIMDVec4d flippedMotion = motion.ToDouble() * flipSigns;
public static SIMDVec4d Double(double xyzw) codegen ```asm ;[MethodImpl(MethodImplOptions.AggressiveInlining)] public static SIMDVec4d Double(double xyzw) => V256.Create(xyzw); 00007FFA6FD01C00 push rbp 00007FFA6FD01C01 vzeroupper 00007FFA6FD01C04 mov rbp,rsp 00007FFA6FD01C07 mov qword ptr [rbp+10h],rcx 00007FFA6FD01C0B vmovsd qword ptr [rbp+18h],xmm1 00007FFA6FD01C10 vbroadcastsd ymm0,mmword ptr [rbp+18h] 00007FFA6FD01C16 mov rax,qword ptr [rbp+10h] 00007FFA6FD01C1A vmovupd ymmword ptr [rax],ymm0 00007FFA6FD01C1E mov rax,qword ptr [rbp+10h] 00007FFA6FD01C22 vzeroupper 00007FFA6FD01C25 pop rbp 00007FFA6FD01C26 ret ```
public static Vector256 ConditionalSelect(this condition, valueIfTrue, valueIfFalse) codegen ```asm ;public static Vector256 ConditionalSelect(this Vector256 condition, Vector256 valueIfTrue, Vector256 valueIfFalse) where T : struct => V256.ConditionalSelect(condition, valueIfTrue, valueIfFalse); 00007FFA6FD0C730 push rbp 00007FFA6FD0C731 sub rsp,30h 00007FFA6FD0C735 vzeroupper 00007FFA6FD0C738 lea rbp,[rsp+30h] 00007FFA6FD0C73D mov qword ptr [rbp+10h],rcx 00007FFA6FD0C741 mov qword ptr [rbp+18h],rdx 00007FFA6FD0C745 mov qword ptr [rbp+20h],r8 00007FFA6FD0C749 mov qword ptr [rbp+28h],r9 00007FFA6FD0C74D mov rax,qword ptr [rbp+18h] 00007FFA6FD0C751 vmovupd ymm0,ymmword ptr [rax] 00007FFA6FD0C755 vmovupd ymmword ptr [rbp-30h],ymm0 00007FFA6FD0C75A vmovupd ymm0,ymmword ptr [rbp-30h] 00007FFA6FD0C75F vmovupd ymm1,ymmword ptr [rbp-30h] 00007FFA6FD0C764 mov rax,qword ptr [rbp+20h] 00007FFA6FD0C768 vandpd ymm0,ymm0,ymmword ptr [rax] 00007FFA6FD0C76C mov rax,qword ptr [rbp+28h] 00007FFA6FD0C770 vandnpd ymm1,ymm1,ymmword ptr [rax] 00007FFA6FD0C774 vorpd ymm0,ymm0,ymm1 00007FFA6FD0C778 mov rax,qword ptr [rbp+10h] 00007FFA6FD0C77C vmovupd ymmword ptr [rax],ymm0 00007FFA6FD0C780 mov rax,qword ptr [rbp+10h] 00007FFA6FD0C784 vzeroupper 00007FFA6FD0C787 add rsp,30h 00007FFA6FD0C78B pop rbp 00007FFA6FD0C78C ret ```

As you can see, all the relevant opcodes eventually get executed, but the amount of overhead dwarfs the actual business logic by an order of magnitude or more. This pattern - that the intrinsic instructions only ever get emitted into the method that mentions the VectorNNN class by name (or, in this case, by type alias) - held no matter what MethodImpl attributes I applied and what qualifiers I added to the various parameters (in, out, ref, scoped, etc). When I was using a readonly struct to hold the VectorNNN value (in order to take advantage of custom operator overloads, type conversions, and so forth) the result was the same, but with an additional layer of indirection (despite my efforts, I never managed to convince the runtime to reinterpret a VectorNNN as a wrapper struct, or a wrapper struct as a VectorNNN, without at least an extra copy, and usually including an additional function call to the non-inlined conversion, as well).

If you've got any insight, or any thoughts on other avenues to try, I'd be extremely interested to hear!

tannergooding commented 4 months ago

I'd love to, but this is for game code - collision detection and the like, in a custom engine. It has to be at double precision, or the FP errors just add up too fast, especially when checking collisions at some distance from the measurement origin

This is a common bug in how you're doing your logic. You should implement something akin to a floating-origin and chunking your world so that you're never doing computations in a way that could cause such large errors to exist.

There are many talks about this presented during things like GDC or which have had deep dive talks given from AAA game companies, they all tend towards using float for performance (both CPU and GPU) and some have even switched to half (GPU) for performance reasons in really hot code.

I'd be delighted! I'm running release-optimized .NET 7 code compiled by Visual Studio Community 2022 on a Windows 11 laptop with a 12th-gen Core i9 processor. I followed these steps for peeking at the codegen - I'm compiling in release mode, it's including symbols, it's not disabling managed code optimization in the debug settings, and I'm setting a breakpoint in the method in question, which I've reproduced below:

Notably, in .NET 7+ you can also just use DOTNET_JitDisasm="MethodName" and run your program, that's a bit easier/more reliable and will let you see both the T0 (unoptimized) and T1 (optimized) codegen, as well as whether or not it was invoked enough to allow rejit to occur and therefore T1 code to be produced.

@EgorBo has also provided the amazing https://github.com/EgorBo/Disasmo extension for VS which makes it very trivial to right click->Disasm this and see the optimized disassembly output (with UI to help with a number of other common scenarios as well).

As you can see, all the relevant opcodes eventually get executed, but the amount of overhead dwarfs the actual business logic by an order of magnitude or more.

This should be fixed in .NET 8+ (not enough code here for me to 100% confirm that, however). I expect the issue was due to the shadow copy and lack of forward sub. If you had taken the parameters as in then the JIT would likely do the right thing on .NET 7 as well as it would avoid the shadow copy and see its not mutated after inlining.

Notably, rather than multiplying by -1 or +1, just x ^ V256.Create(-0.0) to flip the sign (float/double are one's complement, so you just need to toggle the sign bit, hence xor with negative zero).

dmchurch commented 4 months ago

Notably, rather than multiplying by -1 or +1, just x ^ V256.Create(-0.0) to flip the sign (float/double are one's complement, so you just need to toggle the sign bit, hence xor with negative zero).

Brilliant, using -0.0! I'd thought of xor'ing the sign bit, but I was feeling lazy and couldn't recall what the exact representation of a double was. (And in any case, I was just doing this as a proof-of-concept, to see what the runtime does with the SIMD vectors in general - most especially, whether they could be passed in registers across method calls. This game passes a lot of vectors across method calls. It's also worth noting that the implementation in question does not, in fact, actually work - I didn't bother to work out any of the bugs, I just wanted a set of representative SIMD operations in a method that does vector math.)

You should implement something akin to a floating-origin and chunking your world so that you're never doing computations in a way that could cause such large errors to exist.

Yeah, that'd have been my thought as well, actually - I may have seen one or two of those GDC talks 😂 This isn't my game or my code, though, so I mainly wanted to see if it was even a possibility to make a drop-in replacement for the existing double-based vector structs, because if so, I could toggle between the two implementations with nothing more than a build flag.

Notably, in .NET 7+ you can also just use DOTNET_JitDisasm="MethodName" and run your program

Oooh, that's neat! I'm definitely gonna have to play around with both that and the Disasmo extension, because I love poking into implementations and seeing how they work.

If you had taken the parameters as in then the JIT would likely do the right thing on .NET 7 as well

I did try putting in or scoped in on various parameters, as well as moving things between instance methods and extension methods (when I was using the readonly struct wrapper), but I wasn't able to figure out a combination that would allow it to inline an intrinsic. (That doesn't mean there isn't one, but if there is, I couldn't figure out what it should be.) That said, even if I could get that working properly, my gut instinct is that without being able to pass a SIMD vector in a single register, there will just be too much overhead for it to grant any performance benefit.

Which, I mean, again - this isn't what SIMD is for, so I knew I was gonna be working against the grain when I started poking into this 😅

tannergooding commented 4 months ago

Is the code open source on GitHub? I might be able to give it a lookover and provide suggested fixes for the obvious cases and ensure we have tracking bugs for any of the less obvious ones.

dmchurch commented 4 months ago

Not yet but I'll clean this up a touch and push it so you can take a look! It's a fork of https://github.com/anegostudios/vsapi, with the existing non-readonly structs in https://github.com/anegostudios/vsapi/tree/master/Math/Vector (they're the FastVec types - the Vec types are classes).

dmchurch commented 4 months ago

I've gone ahead and pushed my code to https://github.com/dmchurch/vsapi/commit/simd-experiments but please, don't spend too much time on this! Especially since Disasmo has shown me that what I was looking at was, in fact, the T0 compilation - looks like the learn.microsoft link I followed is a bit out of date on its recommendations, ha.

Anyway, feel free to take a look, but please don't take this as anything other than me experimenting with the SIMD capabilities of C#. 😂