Open Sergio0694 opened 2 years ago
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Author: | Sergio0694 |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr`, `untriaged` |
Milestone: | - |
Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code
I'm not super convinced of the wins here myself.
I think that being able to elide the pin
in the case that the pinned value
is coming from the stack would be better: https://github.com/dotnet/runtime/issues/40553
The "friendly signature" for something like BOOL QueryPerformanceFrequency(LARGE_INTEGER* lpFrequency)
is bool QueryPerformanceFrequency(out long frequency)
, in which case the typical usage is likely if (QueryPerformanceFrequency(out var frequency))
. In that scenario, the out
has to be pinned, even though the P/Invoke wrapper, once inlined, could see that pinning a stack local isn't required.
Regular pinning is effectively a stack spill
however, and in the terms of "typical" method calls, isn't going to be much more expensive than the spilling or register shuffling that already typically happens to meet the callee/caller saved register requirements. Further, in the case of P/Invoke
, this spill is nothing compared to the transition stub which already spills basically everything to ensure that any GC tracked data is not in a register (see https://github.com/dotnet/runtime/issues/54107#issuecomment-860127522, which shows the disassembly for such a transition).
I can still think of many cases where the stack spill for the pinning would be the only one. Like, this would be the case for RuntimeHelpers.GetHashCode
, which otherwise is all entirely inlined and with all locals enregistered. I would imagine that'd be nice to optimize given the speedup it could give downstream to all users indirectly relying on it for hashcode-based data structures. Also, there could be plenty of other cases where the pinning is the biggest overhead in a method (eg. ComPtr<T>.GetPinnableReference()
comes to mind, which would otherwise basically be free if the pinned local was enregistered. All in all I guess I'm just saying, #40553 would sure be nice to have too, but I don't see why this one would be mutually exclusive. Wouldn't it be nice to have both? π
cc @kunalspathak.
For reference, I did try to use fixed
as a follow up to #55273, ie:
public static unsafe int GetHashCode(object? o)
{
if (o is not null)
{
uint syncBlockValue;
fixed (byte* pData = &o.GetRawData())
{
syncBlockValue = ((ObjectHeader*)&((nint*)pData)[-2])->SyncBlockValue;
}
const uint BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX = 0x08000000;
const uint BIT_SBLK_IS_HASHCODE = 0x04000000;
const uint BITS_IS_VALID_HASHCODE = BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX | BIT_SBLK_IS_HASHCODE;
const int HASHCODE_BITS = 26;
const uint MASK_HASHCODE = (1u << HASHCODE_BITS) - 1u;
if ((syncBlockValue & BITS_IS_VALID_HASHCODE) == BITS_IS_VALID_HASHCODE)
{
return unchecked((int)(syncBlockValue & MASK_HASHCODE));
}
}
return InternalGetHashCode(o);
}
The performance though was way worse than the current one:
Method | Branch | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
GetHashCode | PR | 0.6446 ns | 0.0021 ns | 0.0018 ns | 1.45 | 0.03 | 78 B |
GetHashCode | main | 0.4462 ns | 0.0102 ns | 0.0091 ns | 1.00 | 0.00 | 9 B |
The asm in particular was... Interesting:
; ObjectGetHashCodeBenchmark.GetHashCode()
sub rsp,28
xor eax,eax
mov [rsp+20],rax
mov rcx,[rcx+8]
test rcx,rcx
je short M00_L00
lea rax,[rcx+8]
mov [rsp+20],rax
mov rax,[rsp+20]
mov eax,[rax+0FFF4]
xor edx,edx
mov [rsp+20],rdx
mov edx,eax
and edx,0C000000
cmp edx,0C000000
jne short M00_L00
and eax,3FFFFFF
jmp short M00_L01
M00_L00:
call System.Runtime.CompilerServices.RuntimeHelpers.InternalGetHashCode(System.Object)
M00_L01:
nop
add rsp,28
ret
; Total bytes of code 78
Looks like there's lots of room for improvements to the codegen in this case? π€
I will also say: I still think an Unsafe.AtomicAddByteOffsetAndRead
intrinsic would be useful.
It'd both solve this case (no need to use fixed
at all), and it'd potentially be useful in other scenarios as well.
As in, the code above could then just be:
public static unsafe int GetHashCode(object? o)
{
if (o is not null)
{
ref byte dataRef = ref o.GetRawData();
nint headerData = Unsafe.AtomicAddByteOffsetAndRead<nint>(ref dataRef, -8);
uint syncBlockValue = ((ObjectHeader*)&headerData)->SyncBlockValue;
// Rest of the code (with no GC holes)
}
return InternalGetHashCode(o);
}
@Sergio0694 more like Unsafe.AtomicSubstractByteOffsetAndRead
I mean yes, that was just an example (plus you'd get the same anyway with a negative offset). My point was just "some API that can atomically subtract and read with GC tracking and no pinning" π
Of course, such an API would only be allowed to read primitives, or even just int
/nuint
, doesn't really matter.
This would likely fix #35748 too.
I'd like to +1 this.
In NativeAOT we do have GetHashcode
implemented in managed code. Also thin locks.
All operations with ObjectHeader
need to pin and some code paths are otherwise fairly simple. Releasing a thin lock, for example is just a few checks for rare cases and then setting a bit in the header.
The part that pinning introduces stack locals hurts the scenario somewhat.
Also see: https://github.com/dotnet/runtime/pull/97997 , that reminded me of this issue.
This is a follow up from (RIP) #55273, specifically this comment from @jkotas. Opening this issue for tracking and avoid that getting lost.
Overview
Consider this snippet:
This currently results in:
Currently, all pinned locals are always stored on the stack. This makes pinning not really ideal for hot paths. It would be nice if the JIT added support for using a register to store pinned locals, when possible. As mentioned by @tannergooding, the register would need to be cleared when out of scope to stop tracking. The method
A
from above could then become something like this:Here I just used
rbx
to store the pinned local (just picked the first callee-saved register). I do realize there's plenty of work to make this work and all the various GC data structures need to be updated accordingly to enable tracking, this is the general idea.Goes without saying that this optimization could bring some nice codegen/perf wins in interop-heavy code π Additionally, given this could be used to restore the
RuntimeHelpers.GetHashCode
optimization by porting the happy path to C# (as I did in #55273, but possibly without the GC hole ahah), it would automatically speedup virtually every dictionary out there using random reference types as keys. Or, any other data structure that would callGetHashCode
at some point on an object that didn't override the defaultobject.GetHashCode
implementation.cc. @EgorBo @SingleAccretion
category:cq theme:pinning