Enable storing pinned locals in registers

Sergio0694 commented 2 years ago

This is a follow up from (RIP) #55273, specifically this comment from @jkotas. Opening this issue for tracking and avoid that getting lost.

Overview

Consider this snippet:

static unsafe void A(ref byte r)
{
    fixed (void* p = &r)
    {
        B(p);
    }
}

[MethodImpl(MethodImplOptions.NoInlining)]
static unsafe void B(void* p)
{
}

This currently results in:

sub rsp, 0x28
mov [rsp+0x20], rcx
call 0x00007ffaecb70028
xor eax, eax
mov [rsp+0x20], rax
add rsp, 0x28
ret

Currently, all pinned locals are always stored on the stack. This makes pinning not really ideal for hot paths. It would be nice if the JIT added support for using a register to store pinned locals, when possible. As mentioned by @tannergooding, the register would need to be cleared when out of scope to stop tracking. The method A from above could then become something like this:

mov rbx, rcx
call 0x00007ffaecb70028
xor rbx, rbx
ret

Here I just used rbx to store the pinned local (just picked the first callee-saved register). I do realize there's plenty of work to make this work and all the various GC data structures need to be updated accordingly to enable tracking, this is the general idea.

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code 😄 Additionally, given this could be used to restore the RuntimeHelpers.GetHashCode optimization by porting the happy path to C# (as I did in #55273, but possibly without the GC hole ahah), it would automatically speedup virtually every dictionary out there using random reference types as keys. Or, any other data structure that would call GetHashCode at some point on an object that didn't override the default object.GetHashCode implementation.

cc. @EgorBo @SingleAccretion

category:cq theme:pinning

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details

This is a follow up from (RIP) #55273, specifically [this comment](https://github.com/dotnet/runtime/pull/55273#discussion_r738528298) from @jkotas. Opening this issue for tracking and avoid that getting lost. ### Overview Consider this snippet: ```csharp static unsafe void A(ref byte r) { fixed (void* p = &r) { B(p); } } [MethodImpl(MethodImplOptions.NoInlining)] static unsafe void B(void* p) { } ``` This currently results in: ```asm sub rsp, 0x28 mov [rsp+0x20], rcx call 0x00007ffaecb70028 xor eax, eax mov [rsp+0x20], rax add rsp, 0x28 ret ``` Currently, all pinned locals are always stored on the stack. This makes pinning not really ideal for hot paths. It would be nice if the JIT added support for using a register to store pinned locals, when possible. As mentioned by @tannergooding, the register would need to be cleared when out of scope to stop tracking. The method `A` from above could then become something like this: ```asm mov rbx, rcx call 0x00007ffaecb70028 xor rbx, rbx ret ``` In this example I just used `rbx` to store the pinned local (just picked the first callee-saved register). Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code 😄 cc. @EgorBo @SingleAccretion

Author:	Sergio0694
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

tannergooding commented 2 years ago

Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code

I'm not super convinced of the wins here myself.

I think that being able to elide the pin in the case that the pinned value is coming from the stack would be better: https://github.com/dotnet/runtime/issues/40553

The "friendly signature" for something like BOOL QueryPerformanceFrequency(LARGE_INTEGER* lpFrequency) is bool QueryPerformanceFrequency(out long frequency), in which case the typical usage is likely if (QueryPerformanceFrequency(out var frequency)). In that scenario, the out has to be pinned, even though the P/Invoke wrapper, once inlined, could see that pinning a stack local isn't required.

Regular pinning is effectively a stack spill however, and in the terms of "typical" method calls, isn't going to be much more expensive than the spilling or register shuffling that already typically happens to meet the callee/caller saved register requirements. Further, in the case of P/Invoke, this spill is nothing compared to the transition stub which already spills basically everything to ensure that any GC tracked data is not in a register (see https://github.com/dotnet/runtime/issues/54107#issuecomment-860127522, which shows the disassembly for such a transition).

Sergio0694 commented 2 years ago

I can still think of many cases where the stack spill for the pinning would be the only one. Like, this would be the case for RuntimeHelpers.GetHashCode, which otherwise is all entirely inlined and with all locals enregistered. I would imagine that'd be nice to optimize given the speedup it could give downstream to all users indirectly relying on it for hashcode-based data structures. Also, there could be plenty of other cases where the pinning is the biggest overhead in a method (eg. ComPtr<T>.GetPinnableReference() comes to mind, which would otherwise basically be free if the pinned local was enregistered. All in all I guess I'm just saying, #40553 would sure be nice to have too, but I don't see why this one would be mutually exclusive. Wouldn't it be nice to have both? 😄

JulieLeeMSFT commented 2 years ago

cc @kunalspathak.

Sergio0694 commented 2 years ago

For reference, I did try to use fixed as a follow up to #55273, ie:

public static unsafe int GetHashCode(object? o)
{
    if (o is not null)
    {
        uint syncBlockValue;

        fixed (byte* pData = &o.GetRawData())
        {
            syncBlockValue = ((ObjectHeader*)&((nint*)pData)[-2])->SyncBlockValue;
        }

        const uint BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX = 0x08000000;
        const uint BIT_SBLK_IS_HASHCODE = 0x04000000;
        const uint BITS_IS_VALID_HASHCODE = BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX | BIT_SBLK_IS_HASHCODE;
        const int HASHCODE_BITS = 26;
        const uint MASK_HASHCODE = (1u << HASHCODE_BITS) - 1u;

        if ((syncBlockValue & BITS_IS_VALID_HASHCODE) == BITS_IS_VALID_HASHCODE)
        {
            return unchecked((int)(syncBlockValue & MASK_HASHCODE));
        }
    }

    return InternalGetHashCode(o);
}

The performance though was way worse than the current one:

Method	Branch	Mean	Error	StdDev	Ratio	RatioSD	Code Size
GetHashCode	PR	0.6446 ns	0.0021 ns	0.0018 ns	1.45	0.03	78 B
GetHashCode	main	0.4462 ns	0.0102 ns	0.0091 ns	1.00	0.00	9 B

The asm in particular was... Interesting:

; ObjectGetHashCodeBenchmark.GetHashCode()
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+20],rax
       mov       rcx,[rcx+8]
       test      rcx,rcx
       je        short M00_L00
       lea       rax,[rcx+8]
       mov       [rsp+20],rax
       mov       rax,[rsp+20]
       mov       eax,[rax+0FFF4]
       xor       edx,edx
       mov       [rsp+20],rdx
       mov       edx,eax
       and       edx,0C000000
       cmp       edx,0C000000
       jne       short M00_L00
       and       eax,3FFFFFF
       jmp       short M00_L01
M00_L00:
       call      System.Runtime.CompilerServices.RuntimeHelpers.InternalGetHashCode(System.Object)
M00_L01:
       nop
       add       rsp,28
       ret
; Total bytes of code 78

Looks like there's lots of room for improvements to the codegen in this case? 🤔

I will also say: I still think an Unsafe.AtomicAddByteOffsetAndRead intrinsic would be useful. It'd both solve this case (no need to use fixed at all), and it'd potentially be useful in other scenarios as well.

As in, the code above could then just be:

public static unsafe int GetHashCode(object? o)
{
    if (o is not null)
    {
        ref byte dataRef = ref o.GetRawData();
        nint headerData = Unsafe.AtomicAddByteOffsetAndRead<nint>(ref dataRef, -8);
        uint syncBlockValue = ((ObjectHeader*)&headerData)->SyncBlockValue;

        // Rest of the code (with no GC holes)
    }

    return InternalGetHashCode(o);
}

SupinePandora43 commented 2 years ago

@Sergio0694 more like Unsafe.AtomicSubstractByteOffsetAndRead

Sergio0694 commented 2 years ago

I mean yes, that was just an example (plus you'd get the same anyway with a negative offset). My point was just "some API that can atomically subtract and read with GC tracking and no pinning" 🙂

Of course, such an API would only be allowed to read primitives, or even just int/nuint, doesn't really matter.

jakobbotsch commented 1 year ago

This would likely fix #35748 too.

VSadov commented 8 months ago

I'd like to +1 this.

In NativeAOT we do have GetHashcode implemented in managed code. Also thin locks. All operations with ObjectHeader need to pin and some code paths are otherwise fairly simple. Releasing a thin lock, for example is just a few checks for rare cases and then setting a bit in the header.

The part that pinning introduces stack locals hurts the scenario somewhat.

VSadov commented 8 months ago

Also see: https://github.com/dotnet/runtime/pull/97997 , that reminded me of this issue.

dotnet / runtime

Enable storing pinned locals in registers #63397

Overview