Support batching GC transitions for P/Invokes

tannergooding commented 3 years ago

Summary

Today, when performing a normal P/Invoke the JIT inserts a GC-transition before and after the call.

This means that if you have logic such as:

PInvoke();
PInvoke();
PInvoke();

You will end up actually getting something like:

// gc-transition: coop to preemp
PInvoke();
// gc-transition: preemp to coop

// gc-transition: coop to preemp
PInvoke();
// gc-transition: preemp to coop

// gc-transition: coop to preemp
PInvoke();
// gc-transition: preemp to coop

This can be undesirable for interop heavy code as it means many transitions are occuring. Ideally we could instead recognize this pattern and simplify it to something more like:

// gc-transition: coop to preemp
PInvoke();
PInvoke();
PInvoke();
// gc-transition: preemp to coop

This would increase the performance of interop heavy code, but would have the code in question in in pre-emptive mode for longer which may also have negative side effects and so may need additional consideration.

category:cq theme:pinvoke skill-level:expert cost:medium impact:medium

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

tannergooding commented 3 years ago

Likewise, it would be beneficial if UnmanagedCallersOnly didn't incur the inverse transition (from preemp to coop) in certain cases.

For example, given that a UnmanagedCallersOnly function can only have blittable parameters, then you can only have GC tracked data if you do your own allocations or deal with ref/out/in and so one might conceive that the gc-transition: preemp to coop is only needed if you do something like perform an allocation, create a string, or call into a managed method.

This would allow a user to write something like the following and expect that, since its only calling other unmanaged code (P/Invokes) that no transitions exist and this could be used as a way to efficiently batch native calls together:

[UnmanagedCallersOnly]
private static void TransitionMethod()
{
    PInvoke();
    PInvoke();
    PInvoke();
}

tannergooding commented 3 years ago

CC. @AaronRobinsonMSFT, @jkoritzinsky

tannergooding commented 3 years ago

https://github.com/dotnet/runtime/issues/54106 is related and requests a way to explicitly mark transition regions

AaronRobinsonMSFT commented 3 years ago

@tannergooding It would be helpful to see the general pattern for this in practice and how we might compute the fall over point where we insert the poll. Batching was discussed during the SuppressGCTransition development. Its design was to be correct by construction and ensure it permitted cases where it added benefit but didn't negatively impact our GC SLA goals. Loops was another case where this case up. This proposal would appear to enable violating those goals and as such seems to be not appropriate given its potential impact. The example appears to enable writing a native code callback ("macro") in pure C# so one doesn't need any native code assets – a laudable goal. However, I don't see arguments in these cases, are the P/Invokes just calls?, would the arguments simply be forwarded to these P/Invokes?, or are we expecting users to write non-GC impacting statements?

I'd like to see more than "Helps a niche scenario" (i.e., Vulkan) and a concrete proposal as to how the JIT or the user determines how many to batch is too much, too little, or just right. It isn't obvious to me how this feature can be used safely and correctly and adhere to our GC SLA goals.

Likewise, it would be beneficial if UnmanagedCallersOnly didn't incur the inverse transition (from preemp to coop) in certain cases.

When would we make the transition then? How would it be determined? Does the JIT simply make a transitions when a call to something not with a SuppressGCTransition or UnmanagedCallersOnly is made?

AaronRobinsonMSFT commented 3 years ago

I've given this some more thought. I'm struggling to see any other use case here than "writing a native macro function in pure C#". The goal here is for Vulkan like scenarios where one wants to pass a low overhead callback to native and not pay the cost of owning native assets – very interesting. It seems to me that if this is the use case then what is desired is being able to define a UnmanagedCallersOnly and have it operate in what we could call an "unsafe" and "GC agnostic" context. I could imagine updating UnmanagedCallersOnly with a bool that indicates GC agnostic and this avoids the transition when call the method from native code. The code within this method could then operate on non-ref types and call unmanaged function pointers. We could also consider accepting P/Invokes marked with SuppressGCTransition. Roslyn could help enforce this I think.

I believe this would enable the specific use case and not add any potentially complicated batching logic – which I find dubious in value. If there is another scenario here I'd like to understand them.

jkotas commented 3 years ago

Note that we do batch the most expensive part of the GC transition (saving of callee saved registers, etc.) already. The part that is not batched is just few instructions that flip the GC mode that is very cheap, especially relative to call that it wraps.

When it comes to micro-optimizations of PInvoke transitions like this, there are more generally applicable ones we can do: Make InlinedPInvokeFrame cheaper or reduce number of indirections required to get to the preemptive flag.

jkotas commented 3 years ago

I could imagine updating UnmanagedCallersOnly with a bool that indicates GC agnostic and this avoids the transition when call the method from native code.

I do not think we would want to compilicate the public surface for these sort of optimizations. This would make sense to do as a JIT optimization - make the JIT recognize certain patterns where the transitions can be batched or omitted.

tannergooding commented 3 years ago

I'm struggling to see any other use case here than "writing a native macro function in pure C#".

@AaronRobinsonMSFT, I think I didn't explain well.

The general premise is that there are many scenarios, such as DirectX or Vulkan, where you may have many "back to back" P/Invoke calls. These back to back calls each have a transition frame inserted that adds up and can negatively impact performance/throughput in some scenarios (and for graphics scenarios, where you are targeting say 60fps, this can be meaningful).

A simple example of DirectX 12 in C# can be seen here: https://source.terrafx.dev/#TerraFX.Samples.DirectX/D3D12/Shared/DX12Sample.cs

Because of this, I think it would be good if the JIT could be smarter here (either implicitly or via some explicit opt-in) such that the general cost here can be reduced.

The workaround for this is to write your own C library that batches calls itself, so you end up with 1 transition. But this has its own complications, such as building and shipping a C library for all platforms you need to support; where all the logic is writeable in C#, even without ever interacting with reference types/the GC in many scenarios.

The simplest comparison is likely in a method that just does:

        ulong start;
        QueryPerformanceCounter(&start);

        ulong end;
        QueryPerformanceCounter(&end);

        return end - start;

Such a method (assuming its static) doesn't ever interact with the GC in a meaningful way (that is, it never touches a GC tracked type, it doesn't even use byrefs, and it never allocates anything). However, we get 245 bytes of codegen, spilling many registers, and performing multiple memory reads/writes, all of which add up:

    L0000: push rbp
    L0001: push r15
    L0003: push r14
    L0005: push r13
    L0007: push r12
    L0009: push rdi
    L000a: push rsi
    L000b: push rbx
    L000c: sub rsp, 0x78
    L0010: lea rbp, [rsp+0xb0]
    L0018: xor ecx, ecx
    L001a: mov [rbp-0x40], rcx
    L001e: mov [rbp-0x48], rcx
    L0022: lea rcx, [rbp-0x88]
    L0029: mov rdx, r10
    L002c: call 0x00007ffaa5ca1660
    L0031: mov rsi, rax
    L0034: mov rcx, rsp
    L0037: mov [rbp-0x68], rcx
    L003b: mov rcx, rbp
    L003e: mov [rbp-0x58], rcx
    L0042: lea rcx, [rbp-0x40]
    L0046: mov rax, 0x7ffa4ce4ccb0
    L0050: mov [rbp-0x78], rax
    L0054: lea rax, [L0074]
    L005b: mov [rbp-0x60], rax
    L005f: lea rax, [rbp-0x88]
    L0066: mov [rsi+0x10], rax
    L006a: mov byte ptr [rsi+0xc], 0
    L006e: call qword ptr [0x7ffa4ce4cd08]
    L0074: mov byte ptr [rsi+0xc], 1
    L0078: cmp dword ptr [0x7ffaa606c07c], 0
    L007f: je short L0087
    L0081: call qword ptr [0x7ffaa605e368]
    L0087: mov rcx, [rbp-0x80]
    L008b: mov [rsi+0x10], rcx
    L008f: lea rcx, [rbp-0x48]
    L0093: mov rax, 0x7ffa4ce4ccb0
    L009d: mov [rbp-0x78], rax
    L00a1: lea rax, [L00c1]
    L00a8: mov [rbp-0x60], rax
    L00ac: lea rax, [rbp-0x88]
    L00b3: mov [rsi+0x10], rax
    L00b7: mov byte ptr [rsi+0xc], 0
    L00bb: call qword ptr [0x7ffa4ce4cd08]
    L00c1: mov byte ptr [rsi+0xc], 1
    L00c5: cmp dword ptr [0x7ffaa606c07c], 0
    L00cc: je short L00d4
    L00ce: call qword ptr [0x7ffaa605e368]
    L00d4: mov rax, [rbp-0x80]
    L00d8: mov [rsi+0x10], rax
    L00dc: mov rax, [rbp-0x48]
    L00e0: sub rax, [rbp-0x40]
    L00e4: lea rsp, [rbp-0x38]
    L00e8: pop rbx
    L00e9: pop rsi
    L00ea: pop rdi
    L00eb: pop r12
    L00ed: pop r13
    L00ef: pop r14
    L00f1: pop r15
    L00f3: pop rbp
    L00f4: ret

Compare this with the codegen for MSVC of the same method, which is roughly 1/5th the size at ~42 bytes:

        sub     rsp, 40
        lea     rcx, QWORD PTR start$[rsp]
        call    QWORD PTR __imp_QueryPerformanceCounter
        lea     rcx, QWORD PTR end$[rsp]
        call    QWORD PTR __imp_QueryPerformanceCounter
        mov     rax, QWORD PTR end$[rsp]
        sub     rax, QWORD PTR start$[rsp]
        add     rsp, 40
        ret     0

-- I definitely understand this can't ever be a 1-to-1 comparison and there are limitations with what the JIT can do here given we have a GC and need to transition and that being in pre-emptive mode for too long can have negative side effects as well. However, I also think this is something that we could be better in and that we could improve perf and throughput for these types of scenarios in a meaningful and measurable way.

Likewise its worth calling out that while SuppressGCTransition helps in a number of scenarios, there are others where its not possible to use. There are many functions (especially for Win32) where the functionality is closed source and so its impossible to determine if it does something like take a lock. Likewise, there are many functions (including things like malloc) which qualify for SuppressGCTransition in the majority scenario, but where some edge case scenario ends up doing something like taking a lock and this blocks SuppressGCTransition from being used (at least according to the documentation).

malloc is a good example I think because it often tries to be as fast as possible and ends up heavily optimized. Many open source implementations (including Microsofts own mimalloc) only lock in the "unlikely" scenario where it needs to either create a heap or a new page and uses things like thread locals to be thread-safe.

AndyAyersMS commented 2 years ago

Don't think we'll get to this for 7.0 (if ever) so moving to future.

tannergooding commented 1 year ago

Was talking with @AaronRobinsonMSFT and @jkoritzinsky about this a bit more in relation to some questions I had on P/Invokes in relation to https://github.com/dotnet/runtime/issues/82132.

Ended up iterating here that the exact mechanic through which this "batching" is supported isn't of particular interest to me, but rather that the general scenario be supported in some fashion.

To that end, allowing the user to write something like the following would also satisfy my interest here:

GC.TransitionToNative();
{
    SgctPinvoke1();
    SgctPinvoke2();
    SgctPinvoke3();
}
GC.TransitionToManaged();

This would of course come with special rules about what you could/could not do within the confines of the transitioned area, but would achieve the same general effect with less JIT/VM work required.

tannergooding commented 1 year ago

@jkotas, any thoughts on an API such as the following to achieve this:

namespace System;

public static partial class GC
{
    public static void TransitionToNative();    // Switches the thread to preemptive mode
    public static void TransitionToManaged();   // Switches the thread to coooperative mode
}

There are of course a few variations on the general shape that could be done, such as being closer to TryStartNoGCRegion/EndNoGCRegion, but it does avoid the considerations around needing to otherwise model such transitions more generally in IR and having the JIT implicitly do the analysis to determine "safe or not", so it becomes more pay to play.

It would ultimately be similar to what we have internally in the JIT (via the GCX_COOP and GCX_PREEMP macros) and with the general limitations such code has. Thus once you transition to native you can not use any managed references and can only interact with unmanaged data or pointers to pinned data that were pinned outside the "transitioned region".

This then allows the original premise of avoiding repeated transitions in interop heavy code, such as might appear for certain types of UI, games, GPU compute, or other scenarios. I imagine the biggest concern is around whether or not power users can use such APIs safely/correctly given the special rules/considerations required and that using these is similar to writing a QCALL/FCALL helper internally.

jkotas commented 1 year ago

it does avoid the considerations around needing to otherwise model such transitions more generally in IR

I am not convinced about this. The JIT IR would need to know about these transitions so that optimizations do not accidentally move operations on GC tracked references inside the block.

given the special rules/considerations required and that using these is similar to writing a QCALL/FCALL helper internally.

Right. FCalls and manually managed code are a bug farm and these bugs are very hard to diagnose.

I would strongly prefer that the JIT does the analysis on whether it is safe to perform this optimization.

tannergooding commented 1 year ago

Thanks for the insight!

I think for the JIT to do it implicitly, it would effectively require us modeling the two transitions in IR (likely as "side-effecting intrinsics") and then looking for neighboring TransitionToNative, TransitionToManaged (simple case) or for TransitionToNative, ..., TransitionToManaged where ... touches no object refs (complex case).

This could potentially be made more "pay for play" by only modeling the transitions in IR for methods which have multiple intrinsic calls.

This is also notably similar to how the JIT could handle such public APIs (mark as side effecting to prevent operation movement, and when present do a pass to validate the user didn't use any gcrefs), but I understand the concerns around the potential bug farm it could end up with.

alexrp commented 1 year ago

I am not convinced about this. The JIT IR would need to know about these transitions so that optimizations do not accidentally move operations on GC tracked references inside the block.

I'm not hugely familiar with RyuJIT IR, but my first thought upon reading Tanner's idea is that, conceptually, it just sounds like a variation on memory barriers. Would it be much more complicated to model than those?

tannergooding commented 1 year ago

Memory barriers block reordering of reads/writes but don't have any special rules beyond that.

This is more like modeling state transitions where what you are allowed to do in state A is different from what you're allowed to do in state B.

Doing the "wrong thing" after transitioning into "preemptive" mode can be very bad for the JIT/GC and can lead to all kinds of unexpected crashes/failures (much like using SuppressGCTransition incorrectly).

jkotas commented 1 year ago

Also, memory barriers block reordering of reads/writes in globally visible data only. They do not necessarily block reordering of read/writes of non-escaping data. (https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md#thread-local-memory-accesses)

AaronRobinsonMSFT commented 1 year ago

The JIT IR would need to know about these transitions so that optimizations do not accidentally move operations on GC tracked references inside the block.

@jkotas That is a fair criticism that I'd not considered. It would be an interesting exercise to see the cost of having the JIT do all the work and make the decision relative to the user providing an intrinsic that acted as an indicator the JIT can use to limit optimizations. I'm not convinced the JIT can make these decisions without some user gesture, that is my primary reason for advocating a similar model to what @tannergooding is proposing.

jkotas commented 1 year ago

I think that @tannergooding description of this optimization in https://github.com/dotnet/runtime/issues/54107#issuecomment-1431532634 is fairly accurate and it looks pretty doable to me based on that description. It does not look like a very expensive optimization to me.

To re-iterate the estimated benefit of this optimization: The motivating code example from https://github.com/dotnet/runtime/issues/54107#issuecomment-860127522 would go from 63 instructions to 58 instructions (wall clock time improvement would be less). It would be still a far cry from 9 instructions that the equivalent C++ code compiles to.

tannergooding commented 1 year ago

Wouldn't it be a bit better than -5 instructions or are we still needing to create a P/Invoke frame between the two P/Invokes?

That is, assuming we don't need to create a secondary P/Invoke frame since we wouldn't be transitioning to cooperative and then back into pre-emptive, we currently get and should be able to remove something like this:

; Final local variable assignments
;
;  V00 loc0         [V00    ] (  2,  2   )    long  ->  [rbp-40H]   do-not-enreg[X] addr-exposed ld-addr-op
;  V01 loc1         [V01    ] (  1,  1   )    long  ->  [rbp-48H]   do-not-enreg[X] addr-exposed ld-addr-op
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T01] (  2,  4   )    long  ->  rax         "Single return block return value"
;  V04 FramesRoot   [V04,T00] ( 10, 10   )    long  ->  rsi         "Pinvoke FrameListRoot"
;  V05 PInvokeFrame [V05    ] ( 12, 12   )     blk (72) [rbp-90H]   do-not-enreg[X] addr-exposed "Pinvoke FrameVar"
;
; Lcl frame size = 120

; Method Program:Delta():ulong
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC78             sub      rsp, 120
       488DAC24B0000000     lea      rbp, [rsp+B0H]

G_M000_IG02:                ;; offset=0018H
       488D8D78FFFFFF       lea      rcx, [rbp-88H]
       498BD2               mov      rdx, r10
       E8A939A65F           call     CORINFO_HELP_INIT_PINVOKE_FRAME
       488BF0               mov      rsi, rax
       488BCC               mov      rcx, rsp
       48894D98             mov      qword ptr [rbp-68H], rcx
       488BCD               mov      rcx, rbp
       48894DA8             mov      qword ptr [rbp-58H], rcx
       488D4DC0             lea      rcx, [rbp-40H]
       48B8009DE65FFA7F0000 mov      rax, 0x7FFA5FE69D00
       48894588             mov      qword ptr [rbp-78H], rax
       488D0519000000       lea      rax, G_M000_IG04
       488945A0             mov      qword ptr [rbp-60H], rax
       488D8578FFFFFF       lea      rax, bword ptr [rbp-88H]
       48894610             mov      qword ptr [rsi+10H], rax
       C6460C00             mov      byte  ptr [rsi+0CH], 0

G_M000_IG03:                ;; offset=0064H
       FF1536632300         call     [Program:QueryPerformanceCounter(ulong):int]

-G_M000_IG04:                ;; offset=006AH
-       C6460C01             mov      byte  ptr [rsi+0CH], 1
-       833D8F9EE25F00       cmp      dword ptr [(reloc 0x7ffabfa5d8a4)], 0
-       7406                 je       SHORT G_M000_IG05
-       FF151BA9E15F         call     [CORINFO_HELP_STOP_FOR_GC]

G_M000_IG05:                ;; offset=007DH
-       488B4D80             mov      rcx, bword ptr [rbp-80H]
-       48894E10             mov      qword ptr [rsi+10H], rcx
       488D4DB8             lea      rcx, [rbp-48H]
-       48B8009DE65FFA7F0000 mov      rax, 0x7FFA5FE69D00
-       48894588             mov      qword ptr [rbp-78H], rax
-       488D0519000000       lea      rax, G_M000_IG07
-       488945A0             mov      qword ptr [rbp-60H], rax
-       488D8578FFFFFF       lea      rax, bword ptr [rbp-88H]
-       48894610             mov      qword ptr [rsi+10H], rax
-       C6460C00             mov      byte  ptr [rsi+0CH], 0

G_M000_IG06:                ;; offset=00B1H
       FF15E9622300         call     [Program:QueryPerformanceCounter(ulong):int]

G_M000_IG07:                ;; offset=00B7H
       C6460C01             mov      byte  ptr [rsi+0CH], 1
       833D429EE25F00       cmp      dword ptr [(reloc 0x7ffabfa5d8a4)], 0
       7406                 je       SHORT G_M000_IG08
       FF15CEA8E15F         call     [CORINFO_HELP_STOP_FOR_GC]

G_M000_IG08:                ;; offset=00CAH
       488B4580             mov      rax, bword ptr [rbp-80H]
       48894610             mov      qword ptr [rsi+10H], rax
       488B45B8             mov      rax, qword ptr [rbp-48H]
       482B45C0             sub      rax, qword ptr [rbp-40H]

G_M000_IG09:                ;; offset=00DAH
       4883C478             add      rsp, 120
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      
; Total bytes of code: 235

That would end up remove some branches and multiple dependent memory accesses which are taking at least 4-12 cycles each.

Basically allowing us to remote the additional epilogue after the first P/Invoke

jkotas commented 1 year ago

You are right that it would be more with the current scheme. I have been thinking about the ideal P/Invoke transition sequence that would be smaller (maybe not as small as 5 instructions, but close).

-       48B8009DE65FFA7F0000 mov      rax, 0x7FFA5FE69D00
-       48894588             mov      qword ptr [rbp-78H], rax

This is keeps track of the PInvoke that we are in. It is necessary to produce accurate stacktraces. We would end up producing inaccurate stacktraces if we removed this.

tannergooding commented 1 year ago

I have been thinking about the ideal P/Invoke transition sequence that would be smaller (maybe not as small as 5 instructions, but close).

Gotcha. Would definitely be nice to get it smaller in general. The overhead isn't terrible or anything, but it does add up and can impact the general frame to frame latency for things like games from what I've seen.

This is keeps track of the PInvoke that we are in. It is necessary to produce accurate stacktraces. We would end up producing inaccurate stacktraces if we removed this.

Is that an issue for release code? We notably don't seem to emit them for SuppressGCTransition methods.

In general removing them for release code seems "no worse" than the type of stack trace/debugging info we lose from inlining or other optimizations. We would likely not be doing this opt for min-opts or debuggable code, but could preserve that info for such code if we did want a partial opt.

jkotas commented 1 year ago

SuppressGCTransition methods have issues with diagnostics in general. They cannot be debugged, stacktraces are less unreliable, etc.

Is that an issue for release code?

It is take-back. It would have to be announced as diagnostic breaking change. It is fairly common for code to spend a non-trivial amount of time in unmanaged code. The diagnostic tools that work with stacktraces would stop seeing the name of the method that is P/Invoked for these cases and the diagnostic of these cases would be harder.

rickbrew commented 1 year ago

As someone who might benefit from this in my code, I think it could be useful to have a language construct for this:

native
{
    NativeCall1(a, b, c);
    NativeCall2(d, e);
}

Inside the native region, the compiler would only permit the use of constructs that are compatible with preemptive mode. Which (IIUC) means P/Invokes and unmanaged types. Anything you use inside of the native region would probably have to be prepared outside of it, even including the use of an overloaded + operator since that is a managed method call.

This would help me get the most out of this, as otherwise an area of code that is highly optimized for this (assuming the JIT infers it automagically) could end up becoming de-optimized by some later maintenance work and there would be no indication that performance had regressed or why.

dotnet / runtime

Support batching GC transitions for P/Invokes #54107

Summary