Open rolfbjarne opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Unrelated, just a note: there is also NativeMemory.Alloc
which should be faster than AllocHGlobal
here is also
NativeMemory.Alloc
which should be faster than AllocHGlobal
On Unix systems Marshal.AllocHGlobal
is just wrapper around NativeMemory.Alloc
. On legacy Mono it was an icall, now it's a P/Invoke. It doesn't have the SuppressGCTransition
attribute applied so it goes through the expensive GC transitions.
It doesn't have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.
malloc/free
can take locks or it can end up calling OS syscalls like mmap
. It would not be appropriate to mark it with SuppressGCTransition
.
malloc/free can take locks or it can end up calling OS syscalls like mmap. It would not be appropriate to mark it with SuppressGCTransition.
I agree it would be a correctness issue to do so. I am still thinking about benchmarking the impact though.
Depending on particular use case there may be other ways to allocate the memory which are less costly (eg. stackalloc for small allocations).
I think malloc
is an interesting case where common implementations try to ensure it is as fast as possible. If you look at the various implementations available, most only take locks or do syscalls in the rare edge case, not in the common path.
That is, these calls that make it "incompatible" with SuppressGCTransition
largely only happen when the underlying heap needs to be created or expanded. Otherwise, small allocations avoid all of this as do many medium sized allocations.
It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.
It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.
like emitting it as
if (len >= someLimit)
emit machinery for gc transition
NativeMemory.Alloc(len);
? where someLimit is some small value that won't cause any mmap under the hood
? where someLimit is some small value that won't cause any mmap under the hood
You cannot tell. Any malloc
call, no matter how small the request block size is, can end up taking locks or calling mmap.
? where someLimit is some small value that won't cause any mmap under the hood
You cannot tell. Any
malloc
call, no matter how small the request block size is, can end up taking locks or calling mmap.
Right, just discovered that in glibc sources 👍
I think the root cause of this problem is the high overhead implementation strategy used for PInvoke transitions on tvOS. This high overhead is a problem for every other PInvoke. For example, globalization PInvokes will hit it too.
I'll have to revisit https://github.com/mono/mono/pull/17110 to see if the optimizations can be done correctly.
On my MacBook Air M1 I get these numbers with the provided test case:
Iterations per second: 12,019,291
Iterations per second: 4,003,878
Iterations per second: 6,845,245
There's something weird with my local runtime builds because they behave quite differently to the official ones. I'll have to figure that out first.
Local dotnet/runtime, MacCatalyst ARM64 Debug, Iterations per second: 2,591,953
The same with some optimizations cherry-picked from my earlier attempts to speed up the GC transitions, Iterations per second: 2,750,647
Release builds are way more comparable to the numbers above. My local changes produce Iterations per second: 6,402,773
so basically it's even worse or a problem with my benchmarking numbers. Definitely not the improvement I hoped for.
Another experiment is measuring the overhead by adding SuppressGCTransition
(MacCatalyst, Mono, x64 under Rosetta):
Iterations per second: 4,003,878
Free
: Iterations per second: 5,573,670
Free
and Malloc
: Iterations per second: 15,170,537
So, yeah, the transitions are crazy expensive.
I've done some experiment locally that ensures registers are saved to stack and then saves only part of the context on the thread state transition. Saves some memory copying. It would need a lot of polishing and validation to ensure it does not break anything. Mono ARM64 gets Iterations per second: 8,807,528
with the changes, or about 28% improvement.
Not sure if I can get it ready anytime soon but here's a gist of what I was testing:
SafeHandle
) exist somewhere on the stack before the native method is called or before the GC transition frame is established.thread_state_init
can be simplified). This saves about 12% of the run time.copy_stack
method is no longer needed in the P/Invoke flow since everything is on the stack already by the time mono_threads_enter_gc_safe_region_unbalanced
is called and the stack is not unwound. This saves another ±15% of the run time.save_lmf
logic. It should emit llvm.eh.unwind.init
intrinsic to spill the callee saved registers./cc @vargaz
Fixing/improving this would require risky changes so this will unlikely to be fixed for 6.0.
I agree that this is not generally fixable for .NET 6 timeframe. Something like PR #59029 may help a bit while being backportable. PR #58992 is exploring a more radical solution (but definitely not ready for prime time).
I'll have to revisit https://github.com/mono/mono/pull/17110 to see if the optimizations can be done correctly.
On my MacBook Air M1 I get these numbers with the provided test case:
- CoreCLR under Rosetta:
Iterations per second: 12,019,291
- Mono under Rosetta:
Iterations per second: 4,003,878
- Mono ARM64:
Iterations per second: 6,845,245
What about CoreCLR ARM64?
Moving to 8.0
Description
Calling Marshal.AllocHGlobal / Marshal.FreeHGlobal is ~150x slower in .NET compared to legacy Mono when running on a tvOS device.
Sample test code: https://gist.github.com/rolfbjarne/b22b844e6f351ad40c4f30e20a2a36d8
Outputs something like this with legacy Mono (Xamarin.iOS from d16-10):
which is roughly 15.5M calls to Marshal.AllocHGlobal+FreeHGlobal per second.
Now in .NET I get this:
that's roughly 103k calls to Marshal.AllocHGlobal+FreeHGlobal per second; ~150x slower.
This is on an Apple TV 4K from 2017.
There's a difference in the simulator too, just not so stark (on an iMac Pro)
Legacy Mono:
and with .NET:
so ~4x slower.
I profiled the .NET version on device using instruments: Marshal.trace.zip
Here's a preview:
It seems most of the time is spent inside
mono_threads_enter_gc_safe_region_unbalanced
.This function isn't even called in legacy Mono.
Here's an Instruments trace: MarshalMono.trace.zip
and a preview:
I don't know if this applies to other platforms as well, I only tested tvOS.