dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

HWIntrinsics: Load folding to immediate address? #12308

Open Zhentar opened 5 years ago

Zhentar commented 5 years ago

I've been taking a go at porting the XXH3 hash algorithm including SSE & AVX versions. My current AVX2 code for the hot loop is here: https://github.com/Zhentar/xxHash3.NET/blob/ee6a626e87f2a829ec786690d4dfa560d876dda7/xxHash3/xxHash3_AVX2.cs#L103

So far I've gotten it up to 36GB/s, against the clang compiled native version's ~40GB/s.

One sub-piece by clang looks like this:

vmovdqu ymm3, ymmword ptr [rax-360h]
vpaddd  ymm4, ymm3, cs:ymmword_40BDC0
vpshufd ymm6, ymm4, 31h
vpmuludq ymm4, ymm6, ymm4
vpaddq  ymm3, ymm5, ymm3
vpaddq  ymm0, ymm0, ymm3

While my version looks like this:

vmovupd ymm8,ymmword ptr [r10+88h]
vmovupd ymm9,ymmword ptr [r11+360h]
vpaddd  ymm8,ymm9,ymm8
vpshufd ymm10,ymm8,31h
vpmuludq ymm8,ymm8,ymm10
vpaddq  ymm1,ymm9,ymm1
vpaddq  ymm1,ymm8,ymm1

Or, if I arrange the code such that folding kicks in (uncommenting the in for the ProcessStripePiece_AVX2 key argument), this:

lea     r14,[rax+20h]
vmovupd ymm4,ymmword ptr [rbp+100h]
vpaddd  ymm5,ymm4,ymmword ptr [r14]
vpshufd ymm6,ymm5,31h
vpmuludq ymm5,ymm5,ymm6
vpaddq  ymm0,ymm4,ymm0
vpaddq  ymm0,ymm5,ymm0

However, the folded version performs worse, because the lea competes with the add/shuf/mul instructions for an integer ALU port instead of a load port.

Is there any way to get an immediate address folded into the vpaddd instead of an execution time calculated displacement? I've tried a static readonly field, but that still resulted in an lea displacement calculation.

category:cq theme:hardware-intrinsics skill-level:expert cost:medium

fiigii commented 5 years ago

https://github.com/dotnet/coreclr/pull/22944 may help.

RussKeldorph commented 5 years ago

@CarolEidt @tannergooding @dotnet/jit-contrib

4creators commented 5 years ago

@Zhentar llvm uses special lea optimization stage which works on the whole function frame. This allows for much better whole frame lea addressing optimization by eliminating any unnecessary offset calculations. RyuJIT does not have such functionality at this time. Nevertheless, as @fiigii pointed dotnet/coreclr#22944 may help to some extent.

CarolEidt commented 4 years ago

@Zhentar - I believe this may have been addressed with dotnet/coreclr#22944. Can you verify?

Zhentar commented 3 years ago

I'm not entirely clear on what effect was expected from that PR, but it doesn't appear to have affected this scenario at all. Testing without the in, I am unfortunately getting worse codegen than I was when I wrote this; I get the same code plus an added completely unnecessary mov:

       vmovupd   ymm3,[rsp+60]
       vmovaps   ymm2,ymm3
       vmovupd   ymm5,[r15+360]
       vpaddd    ymm2,ymm5,ymm2
       vpshufd   ymm3,ymm2,31
       vpmuludq  ymm2,ymm2,ymm3
       vpaddq    ymm1,ymm5,ymm1
       vpaddq    ymm1,ymm2,ymm1

With the in, the code is basically identical:

       lea       rsi,[r10+68]
       vmovupd   ymm4,[r11+340]
       vpaddd    ymm5,ymm4,[rsi]
       vpshufd   ymm6,ymm5,31
       vpmuludq  ymm5,ymm5,ymm6
       vpaddq    ymm0,ymm4,ymm0
       vpaddq    ymm0,ymm5,ymm0

On the bright side, I do see it's doing a much better job of leveraging available registers to save off some of the reads or lea calculations outside of the loop.

tannergooding commented 9 months ago

Just before forward sub we have a tree shape like the following:

***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
               [000033] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         
               [000003] n--XG------                         \--*  IND       simd32
               [000002] ---X-------                            \--*  FIELD_ADDR byref  xxHash3.xxHash3+Vec256Pair`1[uint]:A
               [000001] ---X-------                               \--*  FIELD_ADDR byref  xxHash3.xxHash3+Keys_AVX2:K00
               [000000] -----------                                  \--*  LCL_VAR   byref  V03 arg2          (last use)

***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
               [000034] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000008] n--XG------                         \--*  IND       simd32
               [000007] ---X-------                            \--*  FIELD_ADDR byref  xxHash3.xxHash3+Vec256Pair`1[uint]:A
               [000006] ---X-------                               \--*  FIELD_ADDR byref  xxHash3.xxHash3+StripeBlock_AVX2:S00
               [000005] -----------                                  \--*  LCL_VAR   byref  V01 arg0          (last use)

***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
               [000018] DA---------                         *  STORE_LCL_VAR simd32 V07 tmp3         
               [000017] -----------                         \--*  HWINTRINSIC simd32 uint Add
               [000015] -----------                            +--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000016] -----------                            \--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2          (last use)

This gets fixed up in global morph to be more like:

***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
               [000033] DA-XG+-----                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         
               [000003] ---XG+-----                         \--*  IND       simd32
               [000000] -----+-----                            \--*  LCL_VAR   byref  V03 arg2          (last use)

***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
               [000034] DA-XG+-----                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000008] ---XG+-----                         \--*  IND       simd32
               [000005] -----+-----                            \--*  LCL_VAR   byref  V01 arg0          (last use)

***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
               [000018] DA---+-----                         *  STORE_LCL_VAR simd32 V07 tmp3         
               [000017] -----+-----                         \--*  HWINTRINSIC simd32 uint Add
               [000015] -----+-----                            +--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000016] -----+-----                            \--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2          (last use)

It persists this way all the way through LSRA, just getting a few pieces of additional metadata:

***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
N003 (  3,  3) [000033] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         d:1 $1c1
N002 (  3,  2) [000003] ---XG------                         \--*  IND       simd32 <l:$200, c:$201>
N001 (  1,  1) [000000] -----------                            \--*  LCL_VAR   byref  V03 arg2         u:1 (last use) $83

***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
N003 (  3,  3) [000034] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         d:1 $1c3
N002 (  3,  2) [000008] ---XG------                         \--*  IND       simd32 <l:$202, c:$203>
N001 (  1,  1) [000005] -----------                            \--*  LCL_VAR   byref  V01 arg0         u:1 (last use) $81

***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
N004 (  3,  3) [000018] DA---------                         *  STORE_LCL_VAR simd32 V07 tmp3         d:1 $VN.Void
N003 (  3,  3) [000017] -----------                         \--*  HWINTRINSIC simd32 uint Add <l:$102, c:$103>
N001 (  1,  1) [000015] -----------                            +--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         u:1 <l:$101, c:$141>
N002 (  1,  1) [000016] -----------                            \--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         u:1 (last use) <l:$100, c:$140>

In the above, we have this case of a store to a local from an indirection, followed by an immediate use which is also notably the last use.

In such a scenario, we should really support taking that indirection directly, but we miss out on it today. We miss out on the containment check because it's a LCL_VAR where lvDoNotEnregister is false. We then do mark it regOptional but it doesn't ultimately get picked.

I think this might be an opportunity that forward sub is missing out on and it namely looks to be because the STORE_LCL_VAR for tmp1 is in the way. If we explicitly change the parameter order to ensure that key is the final parameter, then it works as expected.

Can we minimally have forward sub look past other STORE_LCL_VAR that aren't side effecting? CC. @AndyAyersMS