dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Consider soft-reserving a register for "zero" on x64 #70287

Open tannergooding opened 2 years ago

tannergooding commented 2 years ago

Platforms like Arm64 have a dedicated "zero register" and this means that zero is nearly always easily and trivially accessible to codegen. Platforms like Arm32, x86 and x64 however do not directly expose the concept of a zero register. x86 and x64 in particular instead support the concept internally via the register renamer which is not exposed to assembly. Likewise Arm64 doesn't have a dedicated zero register for SIMD even though one for general purpose registers does exist.

Due to using SIMD to zero stack locals and the frequent need to use or compare against zero in many functions, it is often the case that at least one register is zeroed. On the other hand, not many functions are complex enough to utilize all 16 of a given set of registers. Because of this, I believe it would be beneficial at least on x64 (where 16 general purpose and 16 SIMD registers are available, this jumps to ~32 SIMD registers for AVX-512) to "soft reserve" a register to represent zero. The register allocator would have special support for assigning zero into this register and for making it the "least preferenced" register for other values (therefore it is likely the last caller save register) to ensure it can stay zero for as long as possible.

For most methods, this will ensure we initialize no more than once for each register kind and in the off-chance it needs to be "spilled", we do not actually have to incur the cost of storing the value to the stack and can trivially reconstitute it when the call returns. For methods which use many registers, there will ideally be no overall difference to what is generated today as the desired register will be unavailable and so it will fall back to whatever is available instead.

This may also be profitable on x86, but given it has half the number registers (8 general purpose and 8 SIMD regardless of AVX2 or AVX-512 support), this likely needs more testing and consideration than x64.

category:proposal theme:register-allocator

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
Platforms like Arm32 and Arm64 have a dedicated "zero register" and this means that zero is nearly always easily and trivially accessible to codegen. Platforms like x86 and x64 however do not directly expose the concept of a zero register and instead support the concept internally via the register renamer which is not exposed to assembly. Due to using SIMD to zero stack locals and the frequent need to use or compare against zero in many functions, it is often the case that at least one register is zeroed. On the other hand, not many functions are complex enough to utilize all 16 of a given set of registers. Because of this, I believe it would be beneficial at least on x64 (where 16 general purpose and 16 SIMD registers are available, this jumps to ~32 SIMD registers for AVX-512) to "soft reserve" a register to represent zero. The register allocator would have special support for assigning zero into this register and for making it the "least preferenced" register for other values (therefore it is likely the last caller save register) to ensure it can stay zero for as long as possible. For most methods, this will ensure we initialize no more than once for each register kind and in the off-chance it needs to be "spilled", we do not actually have to incur the cost of storing the value to the stack and can trivially reconstitute it when the call returns. For methods which use many registers, there will ideally be no overall difference to what is generated today as the desired register will be unavailable and so it will fall back to whatever is available instead. This may also be profitable on x86, but given it has half the number registers (8 general purpose and 8 SIMD regardless of AVX2 or AVX-512 support), this likely needs more testing and consideration than x64.
Author: tannergooding
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: -
tannergooding commented 2 years ago

CC. @kunalspathak

kunalspathak commented 2 years ago

Thanks @tannergooding . I will look into this, but don't think in .NET 7. Marking this for Future.

EgorBo commented 2 years ago

It'd be nice to have a simulated case showing it makes sense to do to see benefits, I suspect it might be handled as is currently under the hood with register renaming, mov elimination, etc

tannergooding commented 2 years ago

It'd be nice to have a simulated case showing it makes sense to do to see benefits, I suspect it might be handled as is currently under the hood with register renaming, mov elimination, etc

Even with register naming and micro-code caching, we've seen the additional 2-4 bytes (2 for general-purpose, 4 for simd) has real cost for code and how it can negatively impact alignments, size, and other things.

We have several methods where "zero" is initialized multiple times or repeatedly initialized and so this would be an opportunity to reduce that all while simplifying the necessary spill/restore logic at the same time since we know zero can be special cased.

EgorBo commented 2 years ago

Overall it's a general problem with our CSE which is afraid of all constants (on x86/64 only) and thinks it's better to re-materialize them than run out of registers and spill some important loop-dependent variable or spill/restore a constant due to "live accross call". It can be changed with DOTNET_JitConstCSE=3 I agree 0 is the most popular one but if we could properly solve it for the general case it'd be even better

tannergooding commented 2 years ago

One of the issues is we explicitly don't want to CSE Zero or if we did, we need some special LCL_CNS or other way that we can observe the original value was a constant and what that constant was.

This is because there are many places in lowering where we check for a constant as op2 and contain/specialize if it is (namely if it is Zero). Today if we CSE those, then we just see a LCL_VAR instead and we can no longer convert a movmsk reg, ...; cmp ..., ... into a ptest ..., ... instead

EgorBo commented 2 years ago

Today if we CSE those, then we just see a LCL_VAR instead and we can no longer convert a movmsk

Well, that LCL_VAR is going to have a "constant" VN that we can use to get the original value - but it most likely won't survive till lowering 😞

EgorBo commented 2 years ago

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHuGueBLAHYYG2ANx9ukpo2IoGAWXIAKAJTTO1HtoYAzaA2VCR/ALwAGMQ34AecucvWA1E/Vad3TR4/YGT0wwoEu7ePL7+gVYA9FEMkIK4GNjC1rgMghAiLADKAKIA5AAmooLFGSIAFhD8iTCF0toAvtLNIbxt0vRM8gqkygCSACI1AA4QuNjAADYwDIVu2l7e+lCGxtYWVrb2jvwuC95LodyFAHTDuGO4MGrBxzxnF1c3qtGx8YnJJmnlrHlFJTKmQYVRqGDq7XuUO0MUMGAgoji2CmM3gEwAbhDYABzA4eVpNGiNIA=

tannergooding commented 2 years ago

Yep. I think it would be good if we had a better way to handle that and ensure we get all the benefits from both ends. Maybe a LCL_CNS node or flag on LCL_VAR or other special pattern here would be "goodness" to ensure we can always CSE constants while also still always being able to do specialized containment/etc.

huoyaoyuan commented 2 years ago

How to prevent it being garbled by interop code? Would this add overhead to interop calls? Just rely on the register saving convention?

EgorBo commented 2 years ago

How to prevent it being garbled by interop code? Would this add overhead to interop calls? Just rely on the register saving convention?

If we use a callee-saved register then interop code is expected to restore it in its prologue. The problem that there are not so many callee-saved registers (it depends on ABI) and I guess we already rely on some of them for different things

kunalspathak commented 2 years ago

I happened to investigate performance of DMath and noticed we create many times zeros in hot blocks of IG03 through IG06.

For references, C++ compilers stores it in a register in the beginning. https://godbolt.org/z/osTrMfooj

Assembly code ```asm ; Assembly listing for method Benchstone.BenchF.DMath:Bench(int):bool ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; optimized code ; rsp based frame ; fully interruptible ; No matching PGO data ; 0 inlinees with PGO data; 0 single block inlinees; 1 inlinees without PGO data ; Final local variable assignments ; ; V00 arg0 [V00,T03] ( 4, 7 ) int -> rsi single-def ; V01 loc0 [V01,T01] ( 3, 18 ) ref -> rdi class-hnd exact single-def ; V02 loc1 [V02,T14] ( 6, 84 ) double -> mm8 ; V03 loc2 [V03,T12] ( 2, 144 ) double -> mm10 ; V04 loc3 [V04,T09] ( 3, 384 ) double -> mm12 ; V05 loc4 [V05,T08] ( 5, 416 ) double -> registers ; V06 loc5 [V06,T10] ( 3, 384 ) double -> mm15 ; V07 loc6 [V07,T07] ( 6, 656 ) double -> mm11 ;* V08 loc7 [V08 ] ( 0, 0 ) double -> zero-ref ; V09 loc8 [V09,T02] ( 4, 13 ) int -> rbx ; V10 OutArgs [V10 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace" ;* V11 tmp1 [V11 ] ( 0, 0 ) double -> zero-ref "impAppendStmt" ;* V12 tmp2 [V12 ] ( 0, 0 ) double -> zero-ref "impAppendStmt" ; V13 tmp3 [V13,T05] ( 4, 768 ) double -> [rsp+20H] "Inline stloc first use temp" ; V14 tmp4 [V14,T04] ( 5,2048 ) double -> mm0 "Inlining Arg" ; V15 cse0 [V15,T00] ( 3, 48 ) int -> rax "CSE - aggressive" ; V16 cse1 [V16,T06] ( 6, 660 ) double -> mm9 "CSE - aggressive" ; V17 cse2 [V17,T11] ( 2, 272 ) double -> mm13 "CSE - aggressive" ; V18 cse3 [V18,T13] ( 2, 144 ) double -> mm14 "CSE - aggressive" ; V19 cse4 [V19,T15] ( 2, 17 ) double -> mm7 "CSE - aggressive" ; V20 cse5 [V20,T16] ( 2, 17 ) double -> mm6 "CSE - aggressive" ; TEMP_01 double -> [rsp+0x28] ; ; Lcl frame size = 208 G_M59317_IG01: push rdi push rsi push rbx sub rsp, 208 vzeroupper vmovaps qword ptr [rsp+C0H], xmm6 vmovaps qword ptr [rsp+B0H], xmm7 vmovaps qword ptr [rsp+A0H], xmm8 vmovaps qword ptr [rsp+90H], xmm9 vmovaps qword ptr [rsp+80H], xmm10 vmovaps qword ptr [rsp+70H], xmm11 vmovaps qword ptr [rsp+60H], xmm12 vmovaps qword ptr [rsp+50H], xmm13 vmovaps qword ptr [rsp+40H], xmm14 vmovaps qword ptr [rsp+30H], xmm15 mov esi, ecx ;; size=90 bbWeight=1 PerfScore 24.50 G_M59317_IG02: mov rcx, 0xD1FFAB1E ; System.Double[] mov edx, 91 call CORINFO_HELP_NEWARR_1_VC mov rdi, rax mov ebx, 1 cmp esi, 1 jl G_M59317_IG10 vmovsd xmm6, qword ptr [reloc @RWD00] vmovsd xmm7, qword ptr [reloc @RWD08] ;; size=53 bbWeight=1 PerfScore 9.25 G_M59317_IG03: vxorps xmm8, xmm8 vmovsd xmm9, qword ptr [reloc @RWD16] ;; size=13 bbWeight=4 PerfScore 13.33 G_M59317_IG04: vdivsd xmm10, xmm8, xmm7 vxorps xmm11, xmm11 vxorps xmm12, xmm12 vmovsd xmm13, qword ptr [reloc @RWD24] vmovsd xmm14, qword ptr [reloc @RWD32] ;; size=30 bbWeight=16 PerfScore 298.67 G_M59317_IG05: vaddsd xmm0, xmm11, xmm11 vaddsd xmm15, xmm0, xmm9 vmovaps xmm0, xmm11 vmovaps xmm1, xmm9 vxorps xmm2, xmm2 vucomisd xmm0, xmm2 jbe G_M59317_IG13 align [0 bytes for IG06] ;; size=34 bbWeight=128 PerfScore 1258.67 G_M59317_IG06: vmulsd xmm1, xmm1, xmm13 vmovsd qword ptr [rsp+20H], xmm1 vsubsd xmm0, xmm0, xmm9 vxorps xmm2, xmm2 vucomisd xmm0, xmm2 ja G_M59317_IG12 ;; size=30 bbWeight=256 PerfScore 2645.33 G_M59317_IG07: vmovaps xmm0, xmm15 call [Benchstone.BenchF.DMath:Fact(double):double] vmovsd xmm1, qword ptr [rsp+20H] vdivsd xmm2, xmm1, xmm0 vmovsd qword ptr [rsp+28H], xmm2 vmovaps xmm0, xmm10 vmovaps xmm1, xmm15 call [Benchstone.BenchF.DMath:Power(double,double):double] vmovsd xmm2, qword ptr [rsp+28H] vmulsd xmm0, xmm2, xmm0 vaddsd xmm0, xmm0, xmm12 vaddsd xmm11, xmm11, xmm9 vsubsd xmm1, xmm12, xmm0 vandps xmm1, xmm1, qword ptr [reloc @RWD48] vucomisd xmm1, xmm14 vmovaps xmm12, xmm0 ja G_M59317_IG05 ;; size=90 bbWeight=128 PerfScore 5504.00 G_M59317_IG08: vcvttsd2si eax, xmm8 cmp eax, 91 jae G_M59317_IG14 mov eax, eax vmovsd qword ptr [rdi+8*rax+16], xmm12 vaddsd xmm8, xmm8, xmm9 vucomisd xmm6, xmm8 jae G_M59317_IG04 ;; size=38 bbWeight=16 PerfScore 248.00 G_M59317_IG09: inc ebx cmp ebx, esi jle G_M59317_IG03 ;; size=10 bbWeight=4 PerfScore 6.00 G_M59317_IG10: mov rcx, rdi call [hackishModuleName:hackishMethodName()] mov eax, 1 ;; size=14 bbWeight=1 PerfScore 3.50 G_M59317_IG11: vmovaps xmm6, qword ptr [rsp+C0H] vmovaps xmm7, qword ptr [rsp+B0H] vmovaps xmm8, qword ptr [rsp+A0H] vmovaps xmm9, qword ptr [rsp+90H] vmovaps xmm10, qword ptr [rsp+80H] vmovaps xmm11, qword ptr [rsp+70H] vmovaps xmm12, qword ptr [rsp+60H] vmovaps xmm13, qword ptr [rsp+50H] vmovaps xmm14, qword ptr [rsp+40H] vmovaps xmm15, qword ptr [rsp+30H] add rsp, 208 pop rbx pop rsi pop rdi ret ;; size=86 bbWeight=1 PerfScore 42.75 G_M59317_IG12: vmovsd xmm1, qword ptr [rsp+20H] jmp G_M59317_IG06 ;; size=11 bbWeight=128 PerfScore 640.00 G_M59317_IG13: vmovsd qword ptr [rsp+20H], xmm1 jmp G_M59317_IG07 ;; size=11 bbWeight=64 PerfScore 192.00 G_M59317_IG14: call CORINFO_HELP_RNGCHKFAIL int3 ;; size=6 bbWeight=0 PerfScore 0.00 RWD00 dq 4056800000000000h ; 90 RWD08 dq 404CA5DC1A5D2372h ; 57.2957795 RWD16 dq 3FF0000000000000h ; 1 RWD24 dq BFF0000000000000h ; -1 RWD32 dq 3E45798EE2308C3Ah ; 1e-08 RWD40 dd 00000000h, 00000000h RWD48 dq 7FFFFFFFFFFFFFFFh ; nan dq 7FFFFFFFFFFFFFFFh ; nan ; Total bytes of code 516, prolog size 90, PerfScore 10941.80, instruction count 98, allocated bytes for code 558 (MethodHash=6ff7184a) for method Benchstone.BenchF.DMath:Bench(int):bool ; ============================================================ ```
tannergooding commented 2 years ago

If we use a callee-saved register then interop code is expected to restore it in its prologue. The problem that there are not so many callee-saved registers (it depends on ABI) and I guess we already rely on some of them for different things

Right. This is why I think a caller saved register is likely better. It means that the method can do the "most efficient" thing. Which is that it can avoid spilling (floating-point and simd constants are either single instructions or CLS_VAR constants with a dedicated location to restore from) and easily rematerialize the value if it's still needed.