dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

Should `System.Int64.GetHashCode()` avoid dereferencing `m_value` twice? #105142

Open jakobbotsch opened 1 month ago

jakobbotsch commented 1 month ago

https://github.com/dotnet/runtime/blob/f8bcd05a73e2c102eb940e14bf3f0f9fd23745e5/src/tests/JIT/Regression/JitBlue/GitHub_19149/GitHub_19149.cs#L44-L56

In this test we seem to end up duplicating the loads of x:

; Assembly listing for method CommandBytes:GetHashCode():int:this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T02] (  5,  5   )   byref  ->  rcx         this single-def
;  V01 loc0         [V01    ] (  1,  1   )   byref  ->  [rsp+0x00]  pinned single-def
;  V02 loc1         [V02,T01] (  6,  6   )     int  ->  rcx
;  V03 loc2         [V03,T03] (  6,  6   )    long  ->  registers
;* V04 loc3         [V04,T05] (  0,  0   )     int  ->  zero-ref
;# V05 OutArgs      [V05    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V06 tmp1         [V06,T00] ( 11, 22   )    long  ->  registers   "impSpillLclRefs"
;  V07 tmp2         [V07,T04] (  2,  4   )    long  ->  rcx         "Cast away GC"
;
; Lcl frame size = 8

G_M45379_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
       push     rax
                                                ;; size=1 bbWeight=1 PerfScore 1.00
G_M45379_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0002 {rcx}, byref
                            ; byrRegs +[rcx]
       cmp      byte  ptr [rcx], cl
       mov      bword ptr [rsp], rcx
       lea      rax, [rcx+0x08]
       mov      edx, dword ptr [rcx]
       mov      rcx, qword ptr [rcx]  ; duplicated load
                            ; byrRegs -[rcx]
       sar      rcx, 32
       xor      ecx, edx
       add      ecx, 0xFFFFFFFFF66AE3D3
       lea      rdx, [rax+0x08]
       mov      r8d, dword ptr [rax]
       mov      rax, qword ptr [rax]
       sar      rax, 32
       xor      eax, r8d
       imul     ecx, ecx, 0xFFFFFFFFA5555529
       add      ecx, eax
       mov      rax, rdx
       mov      edx, dword ptr [rax]
       mov      rax, qword ptr [rax]
       sar      rax, 32
       xor      eax, edx
       imul     ecx, ecx, 0xFFFFFFFFA5555529
       add      ecx, eax
       mov      eax, ecx
                                                ;; size=76 bbWeight=1 PerfScore 24.50
G_M45379_IG03:        ; bbWeight=1, epilog, nogc, extend
       add      rsp, 8
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 82, prolog size 1, PerfScore 26.75, instruction count 26, allocated bytes for code 82 (MethodHash=fdc84ebc) for method CommandBytes:GetHashCode():int:this (FullOpts)
; ============================================================
dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 1 month ago

Hmm, this really comes down to the implementation of System.Int64.GetHashCode() that dereferences m_value twice:

https://github.com/dotnet/runtime/blob/f8bcd05a73e2c102eb940e14bf3f0f9fd23745e5/src/libraries/System.Private.CoreLib/src/System/Int64.cs#L106-L109

I wonder if we should introduce a 64-bit implementation that only has a single load.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

EgorBo commented 1 month ago

Hmm, this really comes down to the implementation of System.Int64.GetHashCode() that dereferences m_value twice:

@jakobbotsch Other methods like CompareTo, etc and in all other primitives we do that. Why don't these load get the same VN?

stephentoub commented 1 month ago

I wonder if we should introduce a 64-bit implementation that only has a single load.

Meaning just changing the impl to be:

 public override int GetHashCode() 
 { 
     long value = m_value;
     return unchecked((int)value ^ (int)(value >> 32); 
 } 

? We certainly can, but it'd be nice if it weren't necessary. The value isn't volatile; why can't the JIT eliminate the duplicate load?

EgorBo commented 1 month ago

I guess the minimal repro is:

int Test(long* ptr)
{
    return (*ptr).GetHashCode();
}
       mov      eax, dword ptr [rdx]
       mov      rcx, qword ptr [rdx]
       sar      rcx, 32
       xor      eax, ecx
jakobbotsch commented 1 month ago

I wonder if we should introduce a 64-bit implementation that only has a single load.

Meaning just changing the impl to be:

 public override int GetHashCode() 
 { 
     long value = m_value;
     return unchecked((int)value ^ (int)(value >> 32); 
 } 

Yeah, something like that.

@jakobbotsch Other methods like CompareTo, etc and in all other primitives we do that. Why don't these load get the same VN? ? It'd be nice if it weren't necessary. The value isn't volatile; why can't the JIT eliminate the duplicate load?

Most likely because we optimize one of the indirection to a 32-bit indirection, and then we aren't able to collapse them.

Even if we did collapse them that optimization would likely only happen in Tier-1, while this seems like a reliability thing we would want to guarantee?

EgorBo commented 1 month ago

Do you mean the current C# code is potentially not thread-safe?

stephentoub commented 1 month ago

while this seems like a reliability thing we would want to guarantee?

I don't think we make any guarantees about such operations in the face of concurrent reads/writes / tearing (which presumably is what you're alluding to).

jakobbotsch commented 1 month ago

I don't think we make any guarantees about such operations in the face of concurrent reads/writes / tearing (which presumably is what you're alluding to).

Yeah, that's what I was thinking. I agree it's be nice if the JIT optimized this in some way, but I came at this from the perspective of "it's illegal to duplicate loads in the memory model", and this looked to me like illegal behavior from the JIT initially. In reality what looks like a dereference in the C# source code is not actually one.

EgorBo commented 1 month ago

Legality of the existing C# impl aside, I am curious what would be a proper optimization in JIT, we have a tree like this:

[000003] --C--------                         *  RETURN    int   
[000011] ---XG------                         \--*  XOR       int   
[000005] ---XG------                            +--*  CAST      int <- long
[000004] ---XG------                            |  \--*  IND       long  
[000000] -----------                            |     \--*  LCL_VAR   byref  V01 arg1         
[000010] ---XG------                            \--*  CAST      int <- long
[000009] ---XG------                               \--*  RSH       long  
[000007] ---XG------                                  +--*  IND       long  
[000006] -----------                                  |  \--*  LCL_VAR   byref  V01 arg1          (last use)
[000008] -----------                                  \--*  CNS_INT   int    32

So far so good, both IND should be CSE'd just fine (without violating our MM), but then morph kicks in (optNarrowTree to be precise) and optimized the leading (int)IND<long> into IND<int> so we no longer can do any CSE here (unless CSE can do some super-smart alias analysis)

tannergooding commented 1 month ago

I don't think we make any guarantees about such operations in the face of concurrent reads/writes / tearing (which presumably is what you're alluding to).

I agree with this. At the same time it would be pretty trivial to update the managed code and make this "better" (at least for 64-bit) for the core Int64/UInt64 types and so it might be nice to do so for .NET 10.

but then morph kicks in (optNarrowTree to be precise) and optimized the leading (int)IND into IND so we no longer can do any CSE here (unless CSE can do some super-smart alias analysis)

Like I mentioned on Discord, I think this one feels like the cast(ind(x)) shouldn't be done in morph but rather instead in lowering. I'd expect that's "better" for typical user code and that perf oriented code is likely manually doing tricks like hoisting fields/etc so it won't pessimize them

Alternatively, it feels like we might be missing a simpler opt that looks for common aliases from the same address. That is, we're allowed to fold neighboring loads and the like and we already peephole optimize to ldp on arm64 and similar. So maybe we're missing something that looks for cases like ind32(x) and ind64(x) and still allows CSE, for primitive indirections at least (the CSE would hoist to ind64 and then the ind32 becomes a cast from that). That feels like it wouldn't need super-smart alias analysis to be covered and should also handle this scenario

jeffhandley commented 3 weeks ago

@tannergooding I'm assigning this to you, but I wasn't sure if this is something we'd still consider for .NET 9 or if we should punt it out. If it is something we should do in .NET 9, please coordinate who can take it on and make sure the assignment matches that. Thanks.

tannergooding commented 3 weeks ago

This isn't a correctness thing and is relatively low priority, moving to 10