JIT: one case of value tuple compare perf test much slower for x86

AndyAyersMS commented 6 years ago

The 3 other value tuple compare tests show x86 is roughly 10-30% slower, but for some reason this test has x86 over 3x slower (~1730 vs ~550). Seems worth investigating.

Also, this is the variant that shows the benefitfrom the EqualityComparer<T>.Default optimization added in dotnet/coreclr#13125, yet on x86 it is slower than other variants that do not get optimized.

EG

Arch	optimized	cached
x64	550	950
x86	1730	1077

category:cq theme:basic-cq skill-level:expert cost:small

AndyAyersMS commented 6 years ago

For the x64 vs x86 issue: x86 seems to copy less efficiently.

;; X64 Devirtualization...Compare(byref,byref):bool:this

G_M31473_IG01:
       4883EC58             sub      rsp, 88
       C5F877               vzeroupper

G_M31473_IG02:
       C4E17A6F02           vmovdqu  xmm0, qword ptr [rdx]
       C4E17A7F442448       vmovdqu  qword ptr [rsp+48H], xmm0
       C4C17A6F00           vmovdqu  xmm0, qword ptr [r8]
       C4E17A7F442438       vmovdqu  qword ptr [rsp+38H], xmm0
       488D4C2448           lea      rcx, bword ptr [rsp+48H]
       C4E17A6F442438       vmovdqu  xmm0, qword ptr [rsp+38H]
       C4E17A7F442428       vmovdqu  qword ptr [rsp+28H], xmm0
       488D542428           lea      rdx, bword ptr [rsp+28H]
       E8D4FAFFFF           call     System.ValueTuple`3[Byte,E,Int32][System.Byte,Devirtualization.EqualityComparer+E,System.Int32]:Equals(struct):bool:this
       0FB6C0               movzx    rax, al

G_M31473_IG03:
       4883C458             add      rsp, 88
       C3                   ret

;; X86 Devirtualization...Compare(byref,byref):bool:this

G_M31473_IG01:
       55           push     ebp
       8BEC         mov      ebp, esp
       83EC18       sub      esp, 24
       C5F877       vzeroupper

G_M31473_IG02:
       8B0A         mov      ecx, dword ptr [edx]
       894DF4       mov      dword ptr [ebp-0CH], ecx
       8B4A04       mov      ecx, dword ptr [edx+4]
       894DF8       mov      dword ptr [ebp-08H], ecx
       8B4A08       mov      ecx, dword ptr [edx+8]
       894DFC       mov      dword ptr [ebp-04H], ecx
       8B4508       mov      eax, bword ptr [ebp+08H]
       8B08         mov      ecx, dword ptr [eax]
       894DE8       mov      dword ptr [ebp-18H], ecx
       8B4804       mov      ecx, dword ptr [eax+4]
       894DEC       mov      dword ptr [ebp-14H], ecx
       8B4808       mov      ecx, dword ptr [eax+8]
       894DF0       mov      dword ptr [ebp-10H], ecx
       8B4DF0       mov      ecx, dword ptr [ebp-10H]
       51           push     ecx
       C4E17A7E45E8 vmovq    xmm0, qword ptr [ebp-18H]
       83EC08       sub      esp, 8
       C4E179D60424 vmovq    dword ptr [esp], xmm0
       8D4DF4       lea      ecx, bword ptr [ebp-0CH]
       FF15A0519F00 call     [System.ValueTuple`3[Byte,E,Int32][System.Byte,Devirtualization.EqualityComparer+E,System.Int32]:Equals(struct):bool:this]
       0FB6C0       movzx    eax, al

G_M31473_IG03:
       8BE5         mov      esp, ebp
       5D           pop      ebp
       C20400       ret      4

The tuple Equals asm looks comparable, so suspect the perf diff between x86 and x64 is likely from the above.

AndyAyersMS commented 6 years ago

For Compare vs CompareCached, the latter codegen is:


;; x64 Devirtualization...CompareCached(byref,byref):bool:this

G_M6247_IG01:
       4883EC48             sub      rsp, 72
       C5F877               vzeroupper 

G_M6247_IG02:
       488B4908             mov      rcx, gword ptr [rcx+8]
       C4E17A6F02           vmovdqu  xmm0, qword ptr [rdx]
       C4E17A7F442438       vmovdqu  qword ptr [rsp+38H], xmm0
       C4C17A6F00           vmovdqu  xmm0, qword ptr [r8]
       C4E17A7F442428       vmovdqu  qword ptr [rsp+28H], xmm0
       4C8D442428           lea      r8, bword ptr [rsp+28H]
       488D542438           lea      rdx, bword ptr [rsp+38H]
       49BB20003A8BFD7F0000 mov      r11, 0x7FFD8B3A0020
       3909                 cmp      dword ptr [rcx], ecx
       FF15B1EEEEFF         call     [System.Collections.Generic.IEqualityComparer`1[ValueTuple`3][System.ValueTuple`3[System.Byte,Devirtualization.EqualityComparer+E,System.Int32]]:Equals(struct,struct):bool:this]
       90                   nop      

G_M6247_IG03:
       4883C448             add      rsp, 72
       C3                   ret      

;; x86  Devirtualization...CompareCached(byref,byref):bool:this

G_M6247_IG01:
       55           push     ebp
       8BEC         mov      ebp, esp
       C5F877       vzeroupper 

G_M6247_IG02:
       8B4904       mov      ecx, gword ptr [ecx+4]
       8B4208       mov      eax, dword ptr [edx+8]
       50           push     eax
       C4E17A7E02   vmovq    xmm0, qword ptr [edx]
       83EC08       sub      esp, 8
       C4E179D60424 vmovq    dword ptr [esp], xmm0
       8B5508       mov      edx, bword ptr [ebp+08H]
       8B4208       mov      eax, dword ptr [edx+8]
       50           push     eax
       C4E17A7E02   vmovq    xmm0, qword ptr [edx]
       83EC08       sub      esp, 8
       C4E179D60424 vmovq    dword ptr [esp], xmm0
       FF151000C500 call     [System.Collections.Generic.IEqualityComparer`1[ValueTuple`3][System.ValueTuple`3[System.Byte,Devirtualization.EqualityComparer+E,System.Int32]]:Equals(struct,struct):bool:this]

G_M6247_IG03:
       5D           pop      ebp
       C20400       ret      4

So one fewer copy in both cases, but then an interface dispatch to compare.

AndyAyersMS commented 6 years ago

So part of the divergence here is that the struct layout differs on x86 and x64. On x64 we end up with a padding void at the end and so shy away from field-by-field copying. So likely we need to re-examine the block copy heuristics for x86 and see if they should do block copying more often.

It might also be interesting to compare codegen for examples with no padding and see where we end up (say three longs).

Another part is that the arg passing on x86 seems to do copying independent of how locals are copied. So we get field-by-field for the local copies and then a different approach for args. I wonder if we're encountering HW penalties for mixing pushes and stores via esp as IIRC the hardware would prefer not to see these intermingle.

@fiigii do you know offhand if a sequence like:

       50           push     eax
       C4E17A7E02   vmovq    xmm0, qword ptr [edx]
       83EC08       sub      esp, 8
       C4E179D60424 vmovq    dword ptr [esp], xmm0

is going to cause a perf hit?

Ideally we'd just get rid of these copies though on x86 with by-value semantics at least one copy is required by the ABI (though we might be able to tail call if we were more clever and avoid that too).

Don't see anything actionable here for 2.1 so am going to move to future.

fiigii commented 6 years ago

@AndyAyersMS Sorry for the delay to reply. I think the possible issue in the above code sequence may be store forward. If the value of edx and esp are overlapping (or same), the different size of memory read and write may hit store forward.

       C4E17A7E02   vmovq    xmm0, qword ptr [edx]
                                    ^^^^
       83EC08       sub      esp, 8
       C4E179D60424 vmovq    dword ptr [esp], xmm0
                             ^^^^
       8B5508       mov      edx, bword ptr [ebp+08H]
       8B4208       mov      eax, dword ptr [edx+8]
       50           push     eax
       C4E17A7E02   vmovq    xmm0, qword ptr [edx]
                                   ^^^^
       83EC08       sub      esp, 8
       C4E179D60424 vmovq    dword ptr [esp], xmm0
                             ^^^^

Please see the chapter 3.6.5 of Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.

mikedn commented 6 years ago

Those dword movq instruction are dubious, they're likely qword sized, not dword.

Mixing ALU instructions that modify ESP and push instructions will probably result in extra uops needed to synchronize the stack engine with the ALU. But it's probably a wash since the alternative of using only pushes would require extra instructions. Or you can try to not use any pushes but then the code gets larger.

fiigii commented 6 years ago

More profiling data (i.e., PMU data from VTune) would be useful to root cause this issue.

mikedn commented 6 years ago

Those dword movq instruction are dubious, they're likely qword sized, not dword.

Yeah, that instruction is vmovq QWORD PTR [esp], xmm0. CodeGen manages to emit the instruction using EA_4BYTE and while the emitter produces the correct encoding the "disassembler" gets confused.

mikedn commented 6 years ago

Quick benchmark:

    struct foo { public int x, y; }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void sink(int x, int y, foo f) { }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void sink(int x, int y, int z, int w) { }

    foo f;

    [Benchmark]
    public void MOVQ() => sink(0, 0, f);

    [Benchmark]
    public void PUSH() => sink(0, 0, f.x, f.y);

Method	Mean	Error	StdDev
MOVQ	1.706 ns	0.0676 ns	0.0632 ns
PUSH	1.135 ns	0.0557 ns	0.0521 ns

; Assembly listing for method Bench:PUSH():this
; Emitting BLENDED_CODE for generic X86 CPU
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] (  4,  4   )     ref  ->  ecx         this class-hnd
;
; Lcl frame size = 0
G_M22916_IG01:
G_M22916_IG02:
       8B5104       mov      edx, dword ptr [ecx+4]
       52           push     edx
       8B4908       mov      ecx, dword ptr [ecx+8]
       51           push     ecx
       33C9         xor      ecx, ecx
       33D2         xor      edx, edx
       FF150C428C00 call     [Bench:sink(int,int,int,int)]
G_M22916_IG03:
       C3           ret
; Total bytes of code 19, prolog size 0 for method Bench:PUSH():this

; Assembly listing for method Bench:MOVQ():this
; Emitting BLENDED_CODE for generic X86 CPU
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] (  3,  3   )     ref  ->  ecx         this class-hnd
;
; Lcl frame size = 0
G_M1695_IG01:
       C5F877       vzeroupper
G_M1695_IG02:
       83C104       add      ecx, 4
       C4E17A7E01   vmovq    xmm0, qword ptr [ecx]
       83EC08       sub      esp, 8
       C4E179D60424 vmovq    qword ptr [esp], xmm0
       33C9         xor      ecx, ecx
       33D2         xor      edx, edx
       FF15EC419F02 call     [Bench:sink(int,int,struct)]
G_M1695_IG03:
       C3           ret
; Total bytes of code 31, prolog size 3 for method Bench:MOVQ():this

Apparently using MOVQ isn't such a good idea.

dotnet / runtime

JIT: one case of value tuple compare perf test much slower for x86 #9848