Open Zhentar opened 5 years ago
@CarolEidt @tannergooding @dotnet/jit-contrib
@Zhentar llvm uses special lea
optimization stage which works on the whole function frame. This allows for much better whole frame lea
addressing optimization by eliminating any unnecessary offset calculations. RyuJIT does not have such functionality at this time. Nevertheless, as @fiigii pointed dotnet/coreclr#22944 may help to some extent.
@Zhentar - I believe this may have been addressed with dotnet/coreclr#22944. Can you verify?
I'm not entirely clear on what effect was expected from that PR, but it doesn't appear to have affected this scenario at all. Testing without the in
, I am unfortunately getting worse codegen than I was when I wrote this; I get the same code plus an added completely unnecessary mov:
vmovupd ymm3,[rsp+60]
vmovaps ymm2,ymm3
vmovupd ymm5,[r15+360]
vpaddd ymm2,ymm5,ymm2
vpshufd ymm3,ymm2,31
vpmuludq ymm2,ymm2,ymm3
vpaddq ymm1,ymm5,ymm1
vpaddq ymm1,ymm2,ymm1
With the in
, the code is basically identical:
lea rsi,[r10+68]
vmovupd ymm4,[r11+340]
vpaddd ymm5,ymm4,[rsi]
vpshufd ymm6,ymm5,31
vpmuludq ymm5,ymm5,ymm6
vpaddq ymm0,ymm4,ymm0
vpaddq ymm0,ymm5,ymm0
On the bright side, I do see it's doing a much better job of leveraging available registers to save off some of the reads or lea calculations outside of the loop.
Just before forward sub we have a tree shape like the following:
***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
[000033] DA-XG------ * STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2
[000003] n--XG------ \--* IND simd32
[000002] ---X------- \--* FIELD_ADDR byref xxHash3.xxHash3+Vec256Pair`1[uint]:A
[000001] ---X------- \--* FIELD_ADDR byref xxHash3.xxHash3+Keys_AVX2:K00
[000000] ----------- \--* LCL_VAR byref V03 arg2 (last use)
***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
[000034] DA-XG------ * STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1
[000008] n--XG------ \--* IND simd32
[000007] ---X------- \--* FIELD_ADDR byref xxHash3.xxHash3+Vec256Pair`1[uint]:A
[000006] ---X------- \--* FIELD_ADDR byref xxHash3.xxHash3+StripeBlock_AVX2:S00
[000005] ----------- \--* LCL_VAR byref V01 arg0 (last use)
***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
[000018] DA--------- * STORE_LCL_VAR simd32 V07 tmp3
[000017] ----------- \--* HWINTRINSIC simd32 uint Add
[000015] ----------- +--* LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1
[000016] ----------- \--* LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2 (last use)
This gets fixed up in global morph to be more like:
***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
[000033] DA-XG+----- * STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2
[000003] ---XG+----- \--* IND simd32
[000000] -----+----- \--* LCL_VAR byref V03 arg2 (last use)
***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
[000034] DA-XG+----- * STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1
[000008] ---XG+----- \--* IND simd32
[000005] -----+----- \--* LCL_VAR byref V01 arg0 (last use)
***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
[000018] DA---+----- * STORE_LCL_VAR simd32 V07 tmp3
[000017] -----+----- \--* HWINTRINSIC simd32 uint Add
[000015] -----+----- +--* LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1
[000016] -----+----- \--* LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2 (last use)
It persists this way all the way through LSRA, just getting a few pieces of additional metadata:
***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
N003 ( 3, 3) [000033] DA-XG------ * STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2 d:1 $1c1
N002 ( 3, 2) [000003] ---XG------ \--* IND simd32 <l:$200, c:$201>
N001 ( 1, 1) [000000] ----------- \--* LCL_VAR byref V03 arg2 u:1 (last use) $83
***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
N003 ( 3, 3) [000034] DA-XG------ * STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1 d:1 $1c3
N002 ( 3, 2) [000008] ---XG------ \--* IND simd32 <l:$202, c:$203>
N001 ( 1, 1) [000005] ----------- \--* LCL_VAR byref V01 arg0 u:1 (last use) $81
***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
N004 ( 3, 3) [000018] DA--------- * STORE_LCL_VAR simd32 V07 tmp3 d:1 $VN.Void
N003 ( 3, 3) [000017] ----------- \--* HWINTRINSIC simd32 uint Add <l:$102, c:$103>
N001 ( 1, 1) [000015] ----------- +--* LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1 u:1 <l:$101, c:$141>
N002 ( 1, 1) [000016] ----------- \--* LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2 u:1 (last use) <l:$100, c:$140>
In the above, we have this case of a store to a local from an indirection, followed by an immediate use which is also notably the last use.
In such a scenario, we should really support taking that indirection directly, but we miss out on it today. We miss out on the containment check because it's a LCL_VAR
where lvDoNotEnregister
is false. We then do mark it regOptional
but it doesn't ultimately get picked.
I think this might be an opportunity that forward sub is missing out on and it namely looks to be because the STORE_LCL_VAR
for tmp1
is in the way. If we explicitly change the parameter order to ensure that key
is the final parameter, then it works as expected.
Can we minimally have forward sub look past other STORE_LCL_VAR
that aren't side effecting? CC. @AndyAyersMS
I've been taking a go at porting the XXH3 hash algorithm including SSE & AVX versions. My current AVX2 code for the hot loop is here: https://github.com/Zhentar/xxHash3.NET/blob/ee6a626e87f2a829ec786690d4dfa560d876dda7/xxHash3/xxHash3_AVX2.cs#L103
So far I've gotten it up to 36GB/s, against the clang compiled native version's ~40GB/s.
One sub-piece by clang looks like this:
While my version looks like this:
Or, if I arrange the code such that folding kicks in (uncommenting the
in
for theProcessStripePiece_AVX2
key argument), this:However, the folded version performs worse, because the
lea
competes with the add/shuf/mul instructions for an integer ALU port instead of a load port.Is there any way to get an immediate address folded into the
vpaddd
instead of an execution time calculated displacement? I've tried a static readonly field, but that still resulted in an lea displacement calculation.category:cq theme:hardware-intrinsics skill-level:expert cost:medium