gfoidl / Stochastics

Stochastic tools, distrubution, analysis
MIT License
3 stars 0 forks source link

Avoid RAW for better ISP #68

Closed gfoidl closed 6 years ago

gfoidl commented 6 years ago

For instance

https://github.com/gfoidl/Stochastics/blob/73d3e6fd7127896c72c85cccf6cf5ff2b83d9300/source/gfoidl.Stochastics/Statistics/Sample.Delta.cs#L109-L112

produces

       C4E17D1039           vmovupd  ymm7, ymmword ptr[rcx]
       C4E1455CFA           vsubpd   ymm7, ymm2
       C4627D190578010000   vbroadcastsd ymm8, ymmword ptr[reloc @RWD72]
       C4C14554F8           vandpd   ymm7, ymm8
       C4E16558DF           vaddpd   ymm3, ymm7
       C4E17D107920         vmovupd  ymm7, ymmword ptr[rcx+32]
       C4E1455CFA           vsubpd   ymm7, ymm2
       C4627D190562010000   vbroadcastsd ymm8, ymmword ptr[reloc @RWD80]
       C4C14554F8           vandpd   ymm7, ymm8
       C4E15D58E7           vaddpd   ymm4, ymm7
       C4E17D107940         vmovupd  ymm7, ymmword ptr[rcx+64]
       C4E1455CFA           vsubpd   ymm7, ymm2
       C4627D19054C010000   vbroadcastsd ymm8, ymmword ptr[reloc @RWD88]
       C4C14554F8           vandpd   ymm7, ymm8
       C4E15558EF           vaddpd   ymm5, ymm7
       C4E17D107960         vmovupd  ymm7, ymmword ptr[rcx+96]
       C4E1455CFA           vsubpd   ymm7, ymm2
       C4627D190536010000   vbroadcastsd ymm8, ymmword ptr[reloc @RWD96]
       C4C14554F8           vandpd   ymm7, ymm8
       C4E14D58F7           vaddpd   ymm6, ymm7

thus having a read after write (RAW) dependency strongly on ymm7.

This RAW can be eliminated by:

var tmp0 = Unsafe.Read<Vector<double>>(current + 0 * Vector<double>.Count);
var tmp1 = Unsafe.Read<Vector<double>>(current + 1 * Vector<double>.Count);
var tmp2 = Unsafe.Read<Vector<double>>(current + 2 * Vector<double>.Count);
var tmp3 = Unsafe.Read<Vector<double>>(current + 3 * Vector<double>.Count);

tmp0 -= avgVec;
tmp1 -= avgVec;
tmp2 -= avgVec;
tmp3 -= avgVec;

deltaVec0 += Vector.Abs(tmp0);
deltaVec1 += Vector.Abs(tmp1);
deltaVec2 += Vector.Abs(tmp2);
deltaVec3 += Vector.Abs(tmp3);

which results in

       C4E17D1039           vmovupd  ymm7, ymmword ptr[rcx]
       C4617D104120         vmovupd  ymm8, ymmword ptr[rcx+32]
       C4617D104940         vmovupd  ymm9, ymmword ptr[rcx+64]
       C4617D105160         vmovupd  ymm10, ymmword ptr[rcx+96]
       C4E1455CFA           vsubpd   ymm7, ymm2
       C4613D5CC2           vsubpd   ymm8, ymm2
       C461355CCA           vsubpd   ymm9, ymm2
       C4612D5CD2           vsubpd   ymm10, ymm2
       C4627D191D54010000   vbroadcastsd ymm11, ymmword ptr[reloc @RWD72]
       C4C14554FB           vandpd   ymm7, ymm11
       C4E16558DF           vaddpd   ymm3, ymm7
       C4E27D193D49010000   vbroadcastsd ymm7, ymmword ptr[reloc @RWD80]
       C4613D54C7           vandpd   ymm8, ymm7
       C4C15D58E0           vaddpd   ymm4, ymm8
       C4E27D193D3E010000   vbroadcastsd ymm7, ymmword ptr[reloc @RWD88]
       C4613554CF           vandpd   ymm9, ymm7
       C4C15558E9           vaddpd   ymm5, ymm9
       C4E27D193D33010000   vbroadcastsd ymm7, ymmword ptr[reloc @RWD96]
       C4612D54D7           vandpd   ymm10, ymm7
       C4C14D58F2           vaddpd   ymm6, ymm10

where no such dependency is present.

A trivial benchmark shows a good perf-improvement:


BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008618
  [Host]     : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
Method Mean Error StdDev Scaled
Base 1.491 us 0.0142 us 0.0133 us 1.00
Alt 1.199 us 0.0081 us 0.0076 us 0.80