Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS)

rolfbjarne commented 3 years ago

Description

Calling Marshal.AllocHGlobal / Marshal.FreeHGlobal is ~150x slower in .NET compared to legacy Mono when running on a tvOS device.

Sample test code: https://gist.github.com/rolfbjarne/b22b844e6f351ad40c4f30e20a2a36d8

Outputs something like this with legacy Mono (Xamarin.iOS from d16-10):

[...]
Elapsed: 17.03 seconds Counter: 265,779,440.00 Iterations per second: 15,608,870
Elapsed: 18.03 seconds Counter: 281,406,670.00 Iterations per second: 15,608,886
Elapsed: 19.03 seconds Counter: 297,098,501.00 Iterations per second: 15,612,343

which is roughly 15.5M calls to Marshal.AllocHGlobal+FreeHGlobal per second.

Now in .NET I get this:

[...]
Elapsed: 17.10 seconds Counter: 1,773,883.00 Iterations per second: 103,745
Elapsed: 18.10 seconds Counter: 1,878,446.00 Iterations per second: 103,784
Elapsed: 19.10 seconds Counter: 1,982,100.00 Iterations per second: 103,771

that's roughly 103k calls to Marshal.AllocHGlobal+FreeHGlobal per second; ~150x slower.

This is on an Apple TV 4K from 2017.

There's a difference in the simulator too, just not so stark (on an iMac Pro)

Legacy Mono:

[...]
Elapsed: 12.02 seconds Counter: 165,442,863.00 Iterations per second: 13,759,790
Elapsed: 13.02 seconds Counter: 179,568,197.00 Iterations per second: 13,787,606
Elapsed: 14.02 seconds Counter: 193,717,056.00 Iterations per second: 13,812,942

and with .NET:

[...]
Elapsed: 12.03 seconds Counter: 39,917,875.00 Iterations per second: 3,317,120
Elapsed: 13.03 seconds Counter: 43,252,347.00 Iterations per second: 3,318,392
Elapsed: 14.04 seconds Counter: 46,617,318.00 Iterations per second: 3,321,424

so ~4x slower.

I profiled the .NET version on device using instruments: Marshal.trace.zip

Here's a preview:

It seems most of the time is spent inside mono_threads_enter_gc_safe_region_unbalanced.

This function isn't even called in legacy Mono.

Here's an Instruments trace: MarshalMono.trace.zip

and a preview:

I don't know if this applies to other platforms as well, I only tested tvOS.

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

EgorBo commented 3 years ago

Unrelated, just a note: there is also NativeMemory.Alloc which should be faster than AllocHGlobal

filipnavara commented 3 years ago

here is also NativeMemory.Alloc which should be faster than AllocHGlobal

On Unix systems Marshal.AllocHGlobal is just wrapper around NativeMemory.Alloc. On legacy Mono it was an icall, now it's a P/Invoke. It doesn't have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.

jkotas commented 3 years ago

It doesn't have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.

malloc/free can take locks or it can end up calling OS syscalls like mmap. It would not be appropriate to mark it with SuppressGCTransition.

filipnavara commented 3 years ago

malloc/free can take locks or it can end up calling OS syscalls like mmap. It would not be appropriate to mark it with SuppressGCTransition.

I agree it would be a correctness issue to do so. I am still thinking about benchmarking the impact though.

Depending on particular use case there may be other ways to allocate the memory which are less costly (eg. stackalloc for small allocations).

tannergooding commented 3 years ago

I think malloc is an interesting case where common implementations try to ensure it is as fast as possible. If you look at the various implementations available, most only take locks or do syscalls in the rare edge case, not in the common path.

That is, these calls that make it "incompatible" with SuppressGCTransition largely only happen when the underlying heap needs to be created or expanded. Otherwise, small allocations avoid all of this as do many medium sized allocations.

It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.

EgorBo commented 3 years ago

It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.

like emitting it as

if (len >= someLimit)
   emit machinery for gc transition
NativeMemory.Alloc(len);

? where someLimit is some small value that won't cause any mmap under the hood

jkotas commented 3 years ago

? where someLimit is some small value that won't cause any mmap under the hood

You cannot tell. Any malloc call, no matter how small the request block size is, can end up taking locks or calling mmap.

EgorBo commented 3 years ago

? where someLimit is some small value that won't cause any mmap under the hood

You cannot tell. Any malloc call, no matter how small the request block size is, can end up taking locks or calling mmap.

Right, just discovered that in glibc sources 👍

jkotas commented 3 years ago

I think the root cause of this problem is the high overhead implementation strategy used for PInvoke transitions on tvOS. This high overhead is a problem for every other PInvoke. For example, globalization PInvokes will hit it too.

filipnavara commented 3 years ago

I'll have to revisit https://github.com/mono/mono/pull/17110 to see if the optimizations can be done correctly.

On my MacBook Air M1 I get these numbers with the provided test case:

CoreCLR under Rosetta: Iterations per second: 12,019,291
Mono under Rosetta: Iterations per second: 4,003,878
Mono ARM64: Iterations per second: 6,845,245

filipnavara commented 3 years ago

There's something weird with my local runtime builds because they behave quite differently to the official ones. I'll have to figure that out first.

Local dotnet/runtime, MacCatalyst ARM64 Debug, Iterations per second: 2,591,953 The same with some optimizations cherry-picked from my earlier attempts to speed up the GC transitions, Iterations per second: 2,750,647

Release builds are way more comparable to the numbers above. My local changes produce Iterations per second: 6,402,773 so basically it's even worse or a problem with my benchmarking numbers. Definitely not the improvement I hoped for.

Another experiment is measuring the overhead by adding SuppressGCTransition (MacCatalyst, Mono, x64 under Rosetta):

Baseline: Iterations per second: 4,003,878
SuppressGCTransition on Free: Iterations per second: 5,573,670
SuppressGCTransition on both Free and Malloc: Iterations per second: 15,170,537

So, yeah, the transitions are crazy expensive.

filipnavara commented 3 years ago

I've done some experiment locally that ensures registers are saved to stack and then saves only part of the context on the thread state transition. Saves some memory copying. It would need a lot of polishing and validation to ensure it does not break anything. Mono ARM64 gets Iterations per second: 8,807,528 with the changes, or about 28% improvement.

Not sure if I can get it ready anytime soon but here's a gist of what I was testing:

The marshalling method already saves the LMF structure (last managed frame) which contains callee saved registers. I added saving of the reference-type arguments into stack variables. This ensures that all the passed parameters (such as SafeHandle) exist somewhere on the stack before the native method is called or before the GC transition frame is established.
Since everything is on the stack there's no need to accurately capture all registers since variables/arguments will need to be spilled to the stack. Thus we need to capture only IP/SP (ie. thread_state_init can be simplified). This saves about 12% of the run time.
The copy_stack method is no longer needed in the P/Invoke flow since everything is on the stack already by the time mono_threads_enter_gc_safe_region_unbalanced is called and the stack is not unwound. This saves another ±15% of the run time.
LLVM-only mode is missing support for save_lmf logic. It should emit llvm.eh.unwind.init intrinsic to spill the callee saved registers.

marek-safar commented 3 years ago

/cc @vargaz

vargaz commented 3 years ago

Fixing/improving this would require risky changes so this will unlikely to be fixed for 6.0.

filipnavara commented 3 years ago

I agree that this is not generally fixable for .NET 6 timeframe. Something like PR #59029 may help a bit while being backportable. PR #58992 is exploring a more radical solution (but definitely not ready for prime time).

SupinePandora43 commented 2 years ago

I'll have to revisit https://github.com/mono/mono/pull/17110 to see if the optimizations can be done correctly.

On my MacBook Air M1 I get these numbers with the provided test case:

CoreCLR under Rosetta: Iterations per second: 12,019,291

Mono under Rosetta: Iterations per second: 4,003,878

Mono ARM64: Iterations per second: 6,845,245

What about CoreCLR ARM64?

akoeplinger commented 2 years ago

Moving to 8.0

dotnet / runtime

Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS) #58939

Description