Open tannergooding opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Likewise, it would be beneficial if UnmanagedCallersOnly
didn't incur the inverse transition (from preemp to coop) in certain cases.
For example, given that a UnmanagedCallersOnly
function can only have blittable parameters, then you can only have GC tracked data if you do your own allocations or deal with ref/out/in
and so one might conceive that the gc-transition: preemp to coop
is only needed if you do something like perform an allocation, create a string, or call into a managed method.
This would allow a user to write something like the following and expect that, since its only calling other unmanaged code (P/Invokes) that no transitions exist and this could be used as a way to efficiently batch native calls together:
[UnmanagedCallersOnly]
private static void TransitionMethod()
{
PInvoke();
PInvoke();
PInvoke();
}
CC. @AaronRobinsonMSFT, @jkoritzinsky
https://github.com/dotnet/runtime/issues/54106 is related and requests a way to explicitly mark transition regions
@tannergooding It would be helpful to see the general pattern for this in practice and how we might compute the fall over point where we insert the poll. Batching was discussed during the SuppressGCTransition
development. Its design was to be correct by construction and ensure it permitted cases where it added benefit but didn't negatively impact our GC SLA goals. Loops was another case where this case up. This proposal would appear to enable violating those goals and as such seems to be not appropriate given its potential impact. The example appears to enable writing a native code callback ("macro") in pure C# so one doesn't need any native code assets – a laudable goal. However, I don't see arguments in these cases, are the P/Invokes just calls?, would the arguments simply be forwarded to these P/Invokes?, or are we expecting users to write non-GC impacting statements?
I'd like to see more than "Helps a niche scenario" (i.e., Vulkan) and a concrete proposal as to how the JIT or the user determines how many to batch is too much, too little, or just right. It isn't obvious to me how this feature can be used safely and correctly and adhere to our GC SLA goals.
Likewise, it would be beneficial if UnmanagedCallersOnly didn't incur the inverse transition (from preemp to coop) in certain cases.
When would we make the transition then? How would it be determined? Does the JIT simply make a transitions when a call to something not with a SuppressGCTransition
or UnmanagedCallersOnly
is made?
I've given this some more thought. I'm struggling to see any other use case here than "writing a native macro function in pure C#". The goal here is for Vulkan like scenarios where one wants to pass a low overhead callback to native and not pay the cost of owning native assets – very interesting. It seems to me that if this is the use case then what is desired is being able to define a UnmanagedCallersOnly
and have it operate in what we could call an "unsafe" and "GC agnostic" context. I could imagine updating UnmanagedCallersOnly
with a bool
that indicates GC agnostic
and this avoids the transition when call the method from native code. The code within this method could then operate on non-ref types and call unmanaged
function pointers. We could also consider accepting P/Invokes marked with SuppressGCTransition
. Roslyn could help enforce this I think.
I believe this would enable the specific use case and not add any potentially complicated batching logic – which I find dubious in value. If there is another scenario here I'd like to understand them.
Note that we do batch the most expensive part of the GC transition (saving of callee saved registers, etc.) already. The part that is not batched is just few instructions that flip the GC mode that is very cheap, especially relative to call that it wraps.
When it comes to micro-optimizations of PInvoke transitions like this, there are more generally applicable ones we can do: Make InlinedPInvokeFrame cheaper or reduce number of indirections required to get to the preemptive flag.
I could imagine updating UnmanagedCallersOnly with a bool that indicates GC agnostic and this avoids the transition when call the method from native code.
I do not think we would want to compilicate the public surface for these sort of optimizations. This would make sense to do as a JIT optimization - make the JIT recognize certain patterns where the transitions can be batched or omitted.
I'm struggling to see any other use case here than "writing a native macro function in pure C#".
@AaronRobinsonMSFT, I think I didn't explain well.
The general premise is that there are many scenarios, such as DirectX
or Vulkan
, where you may have many "back to back" P/Invoke calls. These back to back calls each have a transition frame inserted that adds up and can negatively impact performance/throughput in some scenarios (and for graphics scenarios, where you are targeting say 60fps, this can be meaningful).
Because of this, I think it would be good if the JIT could be smarter here (either implicitly or via some explicit opt-in) such that the general cost here can be reduced.
The simplest comparison is likely in a method that just does:
ulong start;
QueryPerformanceCounter(&start);
ulong end;
QueryPerformanceCounter(&end);
return end - start;
Such a method (assuming its static) doesn't ever interact with the GC in a meaningful way (that is, it never touches a GC tracked type, it doesn't even use byrefs, and it never allocates anything). However, we get 245 bytes of codegen, spilling many registers, and performing multiple memory reads/writes, all of which add up:
L0000: push rbp
L0001: push r15
L0003: push r14
L0005: push r13
L0007: push r12
L0009: push rdi
L000a: push rsi
L000b: push rbx
L000c: sub rsp, 0x78
L0010: lea rbp, [rsp+0xb0]
L0018: xor ecx, ecx
L001a: mov [rbp-0x40], rcx
L001e: mov [rbp-0x48], rcx
L0022: lea rcx, [rbp-0x88]
L0029: mov rdx, r10
L002c: call 0x00007ffaa5ca1660
L0031: mov rsi, rax
L0034: mov rcx, rsp
L0037: mov [rbp-0x68], rcx
L003b: mov rcx, rbp
L003e: mov [rbp-0x58], rcx
L0042: lea rcx, [rbp-0x40]
L0046: mov rax, 0x7ffa4ce4ccb0
L0050: mov [rbp-0x78], rax
L0054: lea rax, [L0074]
L005b: mov [rbp-0x60], rax
L005f: lea rax, [rbp-0x88]
L0066: mov [rsi+0x10], rax
L006a: mov byte ptr [rsi+0xc], 0
L006e: call qword ptr [0x7ffa4ce4cd08]
L0074: mov byte ptr [rsi+0xc], 1
L0078: cmp dword ptr [0x7ffaa606c07c], 0
L007f: je short L0087
L0081: call qword ptr [0x7ffaa605e368]
L0087: mov rcx, [rbp-0x80]
L008b: mov [rsi+0x10], rcx
L008f: lea rcx, [rbp-0x48]
L0093: mov rax, 0x7ffa4ce4ccb0
L009d: mov [rbp-0x78], rax
L00a1: lea rax, [L00c1]
L00a8: mov [rbp-0x60], rax
L00ac: lea rax, [rbp-0x88]
L00b3: mov [rsi+0x10], rax
L00b7: mov byte ptr [rsi+0xc], 0
L00bb: call qword ptr [0x7ffa4ce4cd08]
L00c1: mov byte ptr [rsi+0xc], 1
L00c5: cmp dword ptr [0x7ffaa606c07c], 0
L00cc: je short L00d4
L00ce: call qword ptr [0x7ffaa605e368]
L00d4: mov rax, [rbp-0x80]
L00d8: mov [rsi+0x10], rax
L00dc: mov rax, [rbp-0x48]
L00e0: sub rax, [rbp-0x40]
L00e4: lea rsp, [rbp-0x38]
L00e8: pop rbx
L00e9: pop rsi
L00ea: pop rdi
L00eb: pop r12
L00ed: pop r13
L00ef: pop r14
L00f1: pop r15
L00f3: pop rbp
L00f4: ret
Compare this with the codegen for MSVC of the same method, which is roughly 1/5th the size at ~42 bytes:
sub rsp, 40
lea rcx, QWORD PTR start$[rsp]
call QWORD PTR __imp_QueryPerformanceCounter
lea rcx, QWORD PTR end$[rsp]
call QWORD PTR __imp_QueryPerformanceCounter
mov rax, QWORD PTR end$[rsp]
sub rax, QWORD PTR start$[rsp]
add rsp, 40
ret 0
-- I definitely understand this can't ever be a 1-to-1 comparison and there are limitations with what the JIT can do here given we have a GC and need to transition and that being in pre-emptive mode for too long can have negative side effects as well. However, I also think this is something that we could be better in and that we could improve perf and throughput for these types of scenarios in a meaningful and measurable way.
Likewise its worth calling out that while SuppressGCTransition
helps in a number of scenarios, there are others where its not possible to use. There are many functions (especially for Win32) where the functionality is closed source and so its impossible to determine if it does something like take a lock. Likewise, there are many functions (including things like malloc
) which qualify for SuppressGCTransition
in the majority scenario, but where some edge case scenario ends up doing something like taking a lock and this blocks SuppressGCTransition
from being used (at least according to the documentation).
malloc
is a good example I think because it often tries to be as fast as possible and ends up heavily optimized. Many open source implementations (including Microsofts own mimalloc) only lock in the "unlikely" scenario where it needs to either create a heap or a new page and uses things like thread locals to be thread-safe.Don't think we'll get to this for 7.0 (if ever) so moving to future.
Was talking with @AaronRobinsonMSFT and @jkoritzinsky about this a bit more in relation to some questions I had on P/Invokes in relation to https://github.com/dotnet/runtime/issues/82132.
Ended up iterating here that the exact mechanic through which this "batching" is supported isn't of particular interest to me, but rather that the general scenario be supported in some fashion.
To that end, allowing the user to write something like the following would also satisfy my interest here:
GC.TransitionToNative();
{
SgctPinvoke1();
SgctPinvoke2();
SgctPinvoke3();
}
GC.TransitionToManaged();
This would of course come with special rules about what you could/could not do within the confines of the transitioned area, but would achieve the same general effect with less JIT/VM work required.
@jkotas, any thoughts on an API such as the following to achieve this:
namespace System;
public static partial class GC
{
public static void TransitionToNative(); // Switches the thread to preemptive mode
public static void TransitionToManaged(); // Switches the thread to coooperative mode
}
There are of course a few variations on the general shape that could be done, such as being closer to TryStartNoGCRegion
/EndNoGCRegion
, but it does avoid the considerations around needing to otherwise model such transitions more generally in IR and having the JIT implicitly do the analysis to determine "safe or not", so it becomes more pay to play.
It would ultimately be similar to what we have internally in the JIT (via the GCX_COOP
and GCX_PREEMP
macros) and with the general limitations such code has. Thus once you transition to native
you can not use any managed references and can only interact with unmanaged data or pointers to pinned data that were pinned outside the "transitioned region".
This then allows the original premise of avoiding repeated transitions in interop heavy code, such as might appear for certain types of UI, games, GPU compute, or other scenarios. I imagine the biggest concern is around whether or not power users can use such APIs safely/correctly given the special rules/considerations required and that using these is similar to writing a QCALL/FCALL helper internally.
it does avoid the considerations around needing to otherwise model such transitions more generally in IR
I am not convinced about this. The JIT IR would need to know about these transitions so that optimizations do not accidentally move operations on GC tracked references inside the block.
given the special rules/considerations required and that using these is similar to writing a QCALL/FCALL helper internally.
Right. FCalls and manually managed code are a bug farm and these bugs are very hard to diagnose.
I would strongly prefer that the JIT does the analysis on whether it is safe to perform this optimization.
Thanks for the insight!
I think for the JIT to do it implicitly, it would effectively require us modeling the two transitions in IR (likely as "side-effecting intrinsics") and then looking for neighboring TransitionToNative, TransitionToManaged
(simple case) or for TransitionToNative, ..., TransitionToManaged
where ...
touches no object refs (complex case).
This could potentially be made more "pay for play" by only modeling the transitions in IR for methods which have multiple intrinsic calls.
This is also notably similar to how the JIT could handle such public APIs (mark as side effecting to prevent operation movement, and when present do a pass to validate the user didn't use any gcrefs), but I understand the concerns around the potential bug farm it could end up with.
I am not convinced about this. The JIT IR would need to know about these transitions so that optimizations do not accidentally move operations on GC tracked references inside the block.
I'm not hugely familiar with RyuJIT IR, but my first thought upon reading Tanner's idea is that, conceptually, it just sounds like a variation on memory barriers. Would it be much more complicated to model than those?
Memory barriers block reordering of reads/writes but don't have any special rules beyond that.
This is more like modeling state transitions where what you are allowed to do in state A
is different from what you're allowed to do in state B
.
Doing the "wrong thing" after transitioning into "preemptive" mode can be very bad for the JIT/GC and can lead to all kinds of unexpected crashes/failures (much like using SuppressGCTransition
incorrectly).
Also, memory barriers block reordering of reads/writes in globally visible data only. They do not necessarily block reordering of read/writes of non-escaping data. (https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md#thread-local-memory-accesses)
The JIT IR would need to know about these transitions so that optimizations do not accidentally move operations on GC tracked references inside the block.
@jkotas That is a fair criticism that I'd not considered. It would be an interesting exercise to see the cost of having the JIT do all the work and make the decision relative to the user providing an intrinsic that acted as an indicator the JIT can use to limit optimizations. I'm not convinced the JIT can make these decisions without some user gesture, that is my primary reason for advocating a similar model to what @tannergooding is proposing.
I think that @tannergooding description of this optimization in https://github.com/dotnet/runtime/issues/54107#issuecomment-1431532634 is fairly accurate and it looks pretty doable to me based on that description. It does not look like a very expensive optimization to me.
To re-iterate the estimated benefit of this optimization: The motivating code example from https://github.com/dotnet/runtime/issues/54107#issuecomment-860127522 would go from 63 instructions to 58 instructions (wall clock time improvement would be less). It would be still a far cry from 9 instructions that the equivalent C++ code compiles to.
Wouldn't it be a bit better than -5 instructions
or are we still needing to create a P/Invoke frame between the two P/Invokes?
That is, assuming we don't need to create a secondary P/Invoke frame since we wouldn't be transitioning to cooperative and then back into pre-emptive, we currently get and should be able to remove something like this:
; Final local variable assignments
;
; V00 loc0 [V00 ] ( 2, 2 ) long -> [rbp-40H] do-not-enreg[X] addr-exposed ld-addr-op
; V01 loc1 [V01 ] ( 1, 1 ) long -> [rbp-48H] do-not-enreg[X] addr-exposed ld-addr-op
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T01] ( 2, 4 ) long -> rax "Single return block return value"
; V04 FramesRoot [V04,T00] ( 10, 10 ) long -> rsi "Pinvoke FrameListRoot"
; V05 PInvokeFrame [V05 ] ( 12, 12 ) blk (72) [rbp-90H] do-not-enreg[X] addr-exposed "Pinvoke FrameVar"
;
; Lcl frame size = 120
; Method Program:Delta():ulong
G_M000_IG01: ;; offset=0000H
55 push rbp
4157 push r15
4156 push r14
4155 push r13
4154 push r12
57 push rdi
56 push rsi
53 push rbx
4883EC78 sub rsp, 120
488DAC24B0000000 lea rbp, [rsp+B0H]
G_M000_IG02: ;; offset=0018H
488D8D78FFFFFF lea rcx, [rbp-88H]
498BD2 mov rdx, r10
E8A939A65F call CORINFO_HELP_INIT_PINVOKE_FRAME
488BF0 mov rsi, rax
488BCC mov rcx, rsp
48894D98 mov qword ptr [rbp-68H], rcx
488BCD mov rcx, rbp
48894DA8 mov qword ptr [rbp-58H], rcx
488D4DC0 lea rcx, [rbp-40H]
48B8009DE65FFA7F0000 mov rax, 0x7FFA5FE69D00
48894588 mov qword ptr [rbp-78H], rax
488D0519000000 lea rax, G_M000_IG04
488945A0 mov qword ptr [rbp-60H], rax
488D8578FFFFFF lea rax, bword ptr [rbp-88H]
48894610 mov qword ptr [rsi+10H], rax
C6460C00 mov byte ptr [rsi+0CH], 0
G_M000_IG03: ;; offset=0064H
FF1536632300 call [Program:QueryPerformanceCounter(ulong):int]
-G_M000_IG04: ;; offset=006AH
- C6460C01 mov byte ptr [rsi+0CH], 1
- 833D8F9EE25F00 cmp dword ptr [(reloc 0x7ffabfa5d8a4)], 0
- 7406 je SHORT G_M000_IG05
- FF151BA9E15F call [CORINFO_HELP_STOP_FOR_GC]
G_M000_IG05: ;; offset=007DH
- 488B4D80 mov rcx, bword ptr [rbp-80H]
- 48894E10 mov qword ptr [rsi+10H], rcx
488D4DB8 lea rcx, [rbp-48H]
- 48B8009DE65FFA7F0000 mov rax, 0x7FFA5FE69D00
- 48894588 mov qword ptr [rbp-78H], rax
- 488D0519000000 lea rax, G_M000_IG07
- 488945A0 mov qword ptr [rbp-60H], rax
- 488D8578FFFFFF lea rax, bword ptr [rbp-88H]
- 48894610 mov qword ptr [rsi+10H], rax
- C6460C00 mov byte ptr [rsi+0CH], 0
G_M000_IG06: ;; offset=00B1H
FF15E9622300 call [Program:QueryPerformanceCounter(ulong):int]
G_M000_IG07: ;; offset=00B7H
C6460C01 mov byte ptr [rsi+0CH], 1
833D429EE25F00 cmp dword ptr [(reloc 0x7ffabfa5d8a4)], 0
7406 je SHORT G_M000_IG08
FF15CEA8E15F call [CORINFO_HELP_STOP_FOR_GC]
G_M000_IG08: ;; offset=00CAH
488B4580 mov rax, bword ptr [rbp-80H]
48894610 mov qword ptr [rsi+10H], rax
488B45B8 mov rax, qword ptr [rbp-48H]
482B45C0 sub rax, qword ptr [rbp-40H]
G_M000_IG09: ;; offset=00DAH
4883C478 add rsp, 120
5B pop rbx
5E pop rsi
5F pop rdi
415C pop r12
415D pop r13
415E pop r14
415F pop r15
5D pop rbp
C3 ret
; Total bytes of code: 235
That would end up remove some branches and multiple dependent memory accesses which are taking at least 4-12 cycles each.
Basically allowing us to remote the additional epilogue after the first P/Invoke
You are right that it would be more with the current scheme. I have been thinking about the ideal P/Invoke transition sequence that would be smaller (maybe not as small as 5 instructions, but close).
- 48B8009DE65FFA7F0000 mov rax, 0x7FFA5FE69D00
- 48894588 mov qword ptr [rbp-78H], rax
This is keeps track of the PInvoke that we are in. It is necessary to produce accurate stacktraces. We would end up producing inaccurate stacktraces if we removed this.
I have been thinking about the ideal P/Invoke transition sequence that would be smaller (maybe not as small as 5 instructions, but close).
Gotcha. Would definitely be nice to get it smaller in general. The overhead isn't terrible or anything, but it does add up and can impact the general frame to frame latency for things like games from what I've seen.
This is keeps track of the PInvoke that we are in. It is necessary to produce accurate stacktraces. We would end up producing inaccurate stacktraces if we removed this.
Is that an issue for release code? We notably don't seem to emit them for SuppressGCTransition
methods.
In general removing them for release code
seems "no worse" than the type of stack trace/debugging info we lose from inlining or other optimizations. We would likely not be doing this opt for min-opts or debuggable code, but could preserve that info for such code if we did want a partial opt.
SuppressGCTransition methods have issues with diagnostics in general. They cannot be debugged, stacktraces are less unreliable, etc.
Is that an issue for release code?
It is take-back. It would have to be announced as diagnostic breaking change. It is fairly common for code to spend a non-trivial amount of time in unmanaged code. The diagnostic tools that work with stacktraces would stop seeing the name of the method that is P/Invoked for these cases and the diagnostic of these cases would be harder.
As someone who might benefit from this in my code, I think it could be useful to have a language construct for this:
native
{
NativeCall1(a, b, c);
NativeCall2(d, e);
}
Inside the native
region, the compiler would only permit the use of constructs that are compatible with preemptive mode. Which (IIUC) means P/Invokes and unmanaged
types. Anything you use inside of the native
region would probably have to be prepared outside of it, even including the use of an overloaded +
operator since that is a managed method call.
This would help me get the most out of this, as otherwise an area of code that is highly optimized for this (assuming the JIT infers it automagically) could end up becoming de-optimized by some later maintenance work and there would be no indication that performance had regressed or why.
Summary
Today, when performing a normal P/Invoke the JIT inserts a GC-transition before and after the call.
This means that if you have logic such as:
You will end up actually getting something like:
This can be undesirable for interop heavy code as it means many transitions are occuring. Ideally we could instead recognize this pattern and simplify it to something more like:
This would increase the performance of interop heavy code, but would have the code in question in in pre-emptive mode for longer which may also have negative side effects and so may need additional consideration.
category:cq theme:pinvoke skill-level:expert cost:medium impact:medium