Closed gfoidl closed 6 years ago
For instance
https://github.com/gfoidl/Stochastics/blob/73d3e6fd7127896c72c85cccf6cf5ff2b83d9300/source/gfoidl.Stochastics/Statistics/Sample.Delta.cs#L109-L112
produces
C4E17D1039 vmovupd ymm7, ymmword ptr[rcx] C4E1455CFA vsubpd ymm7, ymm2 C4627D190578010000 vbroadcastsd ymm8, ymmword ptr[reloc @RWD72] C4C14554F8 vandpd ymm7, ymm8 C4E16558DF vaddpd ymm3, ymm7 C4E17D107920 vmovupd ymm7, ymmword ptr[rcx+32] C4E1455CFA vsubpd ymm7, ymm2 C4627D190562010000 vbroadcastsd ymm8, ymmword ptr[reloc @RWD80] C4C14554F8 vandpd ymm7, ymm8 C4E15D58E7 vaddpd ymm4, ymm7 C4E17D107940 vmovupd ymm7, ymmword ptr[rcx+64] C4E1455CFA vsubpd ymm7, ymm2 C4627D19054C010000 vbroadcastsd ymm8, ymmword ptr[reloc @RWD88] C4C14554F8 vandpd ymm7, ymm8 C4E15558EF vaddpd ymm5, ymm7 C4E17D107960 vmovupd ymm7, ymmword ptr[rcx+96] C4E1455CFA vsubpd ymm7, ymm2 C4627D190536010000 vbroadcastsd ymm8, ymmword ptr[reloc @RWD96] C4C14554F8 vandpd ymm7, ymm8 C4E14D58F7 vaddpd ymm6, ymm7
thus having a read after write (RAW) dependency strongly on ymm7.
ymm7
This RAW can be eliminated by:
var tmp0 = Unsafe.Read<Vector<double>>(current + 0 * Vector<double>.Count); var tmp1 = Unsafe.Read<Vector<double>>(current + 1 * Vector<double>.Count); var tmp2 = Unsafe.Read<Vector<double>>(current + 2 * Vector<double>.Count); var tmp3 = Unsafe.Read<Vector<double>>(current + 3 * Vector<double>.Count); tmp0 -= avgVec; tmp1 -= avgVec; tmp2 -= avgVec; tmp3 -= avgVec; deltaVec0 += Vector.Abs(tmp0); deltaVec1 += Vector.Abs(tmp1); deltaVec2 += Vector.Abs(tmp2); deltaVec3 += Vector.Abs(tmp3);
which results in
C4E17D1039 vmovupd ymm7, ymmword ptr[rcx] C4617D104120 vmovupd ymm8, ymmword ptr[rcx+32] C4617D104940 vmovupd ymm9, ymmword ptr[rcx+64] C4617D105160 vmovupd ymm10, ymmword ptr[rcx+96] C4E1455CFA vsubpd ymm7, ymm2 C4613D5CC2 vsubpd ymm8, ymm2 C461355CCA vsubpd ymm9, ymm2 C4612D5CD2 vsubpd ymm10, ymm2 C4627D191D54010000 vbroadcastsd ymm11, ymmword ptr[reloc @RWD72] C4C14554FB vandpd ymm7, ymm11 C4E16558DF vaddpd ymm3, ymm7 C4E27D193D49010000 vbroadcastsd ymm7, ymmword ptr[reloc @RWD80] C4613D54C7 vandpd ymm8, ymm7 C4C15D58E0 vaddpd ymm4, ymm8 C4E27D193D3E010000 vbroadcastsd ymm7, ymmword ptr[reloc @RWD88] C4613554CF vandpd ymm9, ymm7 C4C15558E9 vaddpd ymm5, ymm9 C4E27D193D33010000 vbroadcastsd ymm7, ymmword ptr[reloc @RWD96] C4612D54D7 vandpd ymm10, ymm7 C4C14D58F2 vaddpd ymm6, ymm10
where no such dependency is present.
A trivial benchmark shows a good perf-improvement:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3) Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC .NET Core SDK=2.1.300-preview3-008618 [Host] : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
For instance
https://github.com/gfoidl/Stochastics/blob/73d3e6fd7127896c72c85cccf6cf5ff2b83d9300/source/gfoidl.Stochastics/Statistics/Sample.Delta.cs#L109-L112
produces
thus having a read after write (RAW) dependency strongly on
ymm7
.This RAW can be eliminated by:
which results in
where no such dependency is present.
A trivial benchmark shows a good perf-improvement: