Open AndyAyersMS opened 6 years ago
For the x64 vs x86 issue: x86 seems to copy less efficiently.
;; X64 Devirtualization...Compare(byref,byref):bool:this
G_M31473_IG01:
4883EC58 sub rsp, 88
C5F877 vzeroupper
G_M31473_IG02:
C4E17A6F02 vmovdqu xmm0, qword ptr [rdx]
C4E17A7F442448 vmovdqu qword ptr [rsp+48H], xmm0
C4C17A6F00 vmovdqu xmm0, qword ptr [r8]
C4E17A7F442438 vmovdqu qword ptr [rsp+38H], xmm0
488D4C2448 lea rcx, bword ptr [rsp+48H]
C4E17A6F442438 vmovdqu xmm0, qword ptr [rsp+38H]
C4E17A7F442428 vmovdqu qword ptr [rsp+28H], xmm0
488D542428 lea rdx, bword ptr [rsp+28H]
E8D4FAFFFF call System.ValueTuple`3[Byte,E,Int32][System.Byte,Devirtualization.EqualityComparer+E,System.Int32]:Equals(struct):bool:this
0FB6C0 movzx rax, al
G_M31473_IG03:
4883C458 add rsp, 88
C3 ret
;; X86 Devirtualization...Compare(byref,byref):bool:this
G_M31473_IG01:
55 push ebp
8BEC mov ebp, esp
83EC18 sub esp, 24
C5F877 vzeroupper
G_M31473_IG02:
8B0A mov ecx, dword ptr [edx]
894DF4 mov dword ptr [ebp-0CH], ecx
8B4A04 mov ecx, dword ptr [edx+4]
894DF8 mov dword ptr [ebp-08H], ecx
8B4A08 mov ecx, dword ptr [edx+8]
894DFC mov dword ptr [ebp-04H], ecx
8B4508 mov eax, bword ptr [ebp+08H]
8B08 mov ecx, dword ptr [eax]
894DE8 mov dword ptr [ebp-18H], ecx
8B4804 mov ecx, dword ptr [eax+4]
894DEC mov dword ptr [ebp-14H], ecx
8B4808 mov ecx, dword ptr [eax+8]
894DF0 mov dword ptr [ebp-10H], ecx
8B4DF0 mov ecx, dword ptr [ebp-10H]
51 push ecx
C4E17A7E45E8 vmovq xmm0, qword ptr [ebp-18H]
83EC08 sub esp, 8
C4E179D60424 vmovq dword ptr [esp], xmm0
8D4DF4 lea ecx, bword ptr [ebp-0CH]
FF15A0519F00 call [System.ValueTuple`3[Byte,E,Int32][System.Byte,Devirtualization.EqualityComparer+E,System.Int32]:Equals(struct):bool:this]
0FB6C0 movzx eax, al
G_M31473_IG03:
8BE5 mov esp, ebp
5D pop ebp
C20400 ret 4
The tuple Equals
asm looks comparable, so suspect the perf diff between x86 and x64 is likely from the above.
For Compare
vs CompareCached
, the latter codegen is:
;; x64 Devirtualization...CompareCached(byref,byref):bool:this
G_M6247_IG01:
4883EC48 sub rsp, 72
C5F877 vzeroupper
G_M6247_IG02:
488B4908 mov rcx, gword ptr [rcx+8]
C4E17A6F02 vmovdqu xmm0, qword ptr [rdx]
C4E17A7F442438 vmovdqu qword ptr [rsp+38H], xmm0
C4C17A6F00 vmovdqu xmm0, qword ptr [r8]
C4E17A7F442428 vmovdqu qword ptr [rsp+28H], xmm0
4C8D442428 lea r8, bword ptr [rsp+28H]
488D542438 lea rdx, bword ptr [rsp+38H]
49BB20003A8BFD7F0000 mov r11, 0x7FFD8B3A0020
3909 cmp dword ptr [rcx], ecx
FF15B1EEEEFF call [System.Collections.Generic.IEqualityComparer`1[ValueTuple`3][System.ValueTuple`3[System.Byte,Devirtualization.EqualityComparer+E,System.Int32]]:Equals(struct,struct):bool:this]
90 nop
G_M6247_IG03:
4883C448 add rsp, 72
C3 ret
;; x86 Devirtualization...CompareCached(byref,byref):bool:this
G_M6247_IG01:
55 push ebp
8BEC mov ebp, esp
C5F877 vzeroupper
G_M6247_IG02:
8B4904 mov ecx, gword ptr [ecx+4]
8B4208 mov eax, dword ptr [edx+8]
50 push eax
C4E17A7E02 vmovq xmm0, qword ptr [edx]
83EC08 sub esp, 8
C4E179D60424 vmovq dword ptr [esp], xmm0
8B5508 mov edx, bword ptr [ebp+08H]
8B4208 mov eax, dword ptr [edx+8]
50 push eax
C4E17A7E02 vmovq xmm0, qword ptr [edx]
83EC08 sub esp, 8
C4E179D60424 vmovq dword ptr [esp], xmm0
FF151000C500 call [System.Collections.Generic.IEqualityComparer`1[ValueTuple`3][System.ValueTuple`3[System.Byte,Devirtualization.EqualityComparer+E,System.Int32]]:Equals(struct,struct):bool:this]
G_M6247_IG03:
5D pop ebp
C20400 ret 4
So one fewer copy in both cases, but then an interface dispatch to compare.
So part of the divergence here is that the struct layout differs on x86 and x64. On x64 we end up with a padding void at the end and so shy away from field-by-field copying. So likely we need to re-examine the block copy heuristics for x86 and see if they should do block copying more often.
It might also be interesting to compare codegen for examples with no padding and see where we end up (say three longs).
Another part is that the arg passing on x86 seems to do copying independent of how locals are copied. So we get field-by-field for the local copies and then a different approach for args. I wonder if we're encountering HW penalties for mixing pushes and stores via esp
as IIRC the hardware would prefer not to see these intermingle.
@fiigii do you know offhand if a sequence like:
50 push eax
C4E17A7E02 vmovq xmm0, qword ptr [edx]
83EC08 sub esp, 8
C4E179D60424 vmovq dword ptr [esp], xmm0
is going to cause a perf hit?
Ideally we'd just get rid of these copies though on x86 with by-value semantics at least one copy is required by the ABI (though we might be able to tail call if we were more clever and avoid that too).
Don't see anything actionable here for 2.1 so am going to move to future.
@AndyAyersMS Sorry for the delay to reply. I think the possible issue in the above code sequence may be store forward. If the value of edx
and esp
are overlapping (or same), the different size of memory read and write may hit store forward.
C4E17A7E02 vmovq xmm0, qword ptr [edx]
^^^^
83EC08 sub esp, 8
C4E179D60424 vmovq dword ptr [esp], xmm0
^^^^
8B5508 mov edx, bword ptr [ebp+08H]
8B4208 mov eax, dword ptr [edx+8]
50 push eax
C4E17A7E02 vmovq xmm0, qword ptr [edx]
^^^^
83EC08 sub esp, 8
C4E179D60424 vmovq dword ptr [esp], xmm0
^^^^
Please see the chapter 3.6.5 of Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.
Those dword movq instruction are dubious, they're likely qword sized, not dword.
Mixing ALU instructions that modify ESP and push instructions will probably result in extra uops needed to synchronize the stack engine with the ALU. But it's probably a wash since the alternative of using only pushes would require extra instructions. Or you can try to not use any pushes but then the code gets larger.
More profiling data (i.e., PMU data from VTune) would be useful to root cause this issue.
Those dword movq instruction are dubious, they're likely qword sized, not dword.
Yeah, that instruction is vmovq QWORD PTR [esp], xmm0
. CodeGen manages to emit the instruction using EA_4BYTE
and while the emitter produces the correct encoding the "disassembler" gets confused.
Quick benchmark:
struct foo { public int x, y; }
[MethodImpl(MethodImplOptions.NoInlining)]
static void sink(int x, int y, foo f) { }
[MethodImpl(MethodImplOptions.NoInlining)]
static void sink(int x, int y, int z, int w) { }
foo f;
[Benchmark]
public void MOVQ() => sink(0, 0, f);
[Benchmark]
public void PUSH() => sink(0, 0, f.x, f.y);
Method | Mean | Error | StdDev |
---|---|---|---|
MOVQ | 1.706 ns | 0.0676 ns | 0.0632 ns |
PUSH | 1.135 ns | 0.0557 ns | 0.0521 ns |
; Assembly listing for method Bench:PUSH():this
; Emitting BLENDED_CODE for generic X86 CPU
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
; V00 this [V00,T00] ( 4, 4 ) ref -> ecx this class-hnd
;
; Lcl frame size = 0
G_M22916_IG01:
G_M22916_IG02:
8B5104 mov edx, dword ptr [ecx+4]
52 push edx
8B4908 mov ecx, dword ptr [ecx+8]
51 push ecx
33C9 xor ecx, ecx
33D2 xor edx, edx
FF150C428C00 call [Bench:sink(int,int,int,int)]
G_M22916_IG03:
C3 ret
; Total bytes of code 19, prolog size 0 for method Bench:PUSH():this
; Assembly listing for method Bench:MOVQ():this
; Emitting BLENDED_CODE for generic X86 CPU
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
; V00 this [V00,T00] ( 3, 3 ) ref -> ecx this class-hnd
;
; Lcl frame size = 0
G_M1695_IG01:
C5F877 vzeroupper
G_M1695_IG02:
83C104 add ecx, 4
C4E17A7E01 vmovq xmm0, qword ptr [ecx]
83EC08 sub esp, 8
C4E179D60424 vmovq qword ptr [esp], xmm0
33C9 xor ecx, ecx
33D2 xor edx, edx
FF15EC419F02 call [Bench:sink(int,int,struct)]
G_M1695_IG03:
C3 ret
; Total bytes of code 31, prolog size 3 for method Bench:MOVQ():this
Apparently using MOVQ isn't such a good idea.
The 3 other value tuple compare tests show x86 is roughly 10-30% slower, but for some reason this test has x86 over 3x slower (~1730 vs ~550). Seems worth investigating.
Also, this is the variant that shows the benefitfrom the
EqualityComparer<T>.Default
optimization added in dotnet/coreclr#13125, yet on x86 it is slower than other variants that do not get optimized.EG
category:cq theme:basic-cq skill-level:expert cost:small