Closed gfoidl closed 6 years ago
Based on the numbers of latency and throughput they should be faster. But don't get fooled by these numbers, because instruction level parallelism comes into account, as can be seen by benchmarks:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=2742187 Hz, Resolution=364.6724 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008618
[Host] : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
Method | Mean | Error | StdDev | Scaled | ScaledSD |
---|---|---|---|---|---|
A | 2.176 ns | 0.0831 ns | 0.0816 ns | 1.00 | 0.00 |
B | 2.266 ns | 0.1006 ns | 0.0941 ns | 1.04 | 0.06 |
C | 2.260 ns | 0.0630 ns | 0.0589 ns | 1.04 | 0.04 |
In MinMaxCore0
almost always two instructions can be done in parallel.
Whereas in MinMaxCore1
and MinMaxCore2
there's more or less a strict sequential dependency, thus disabling ISP.
; MinMaxCore0
00007FFD1ECD28E0 vzeroupper
00007FFD1ECD28E3 vmovupd ymm0,ymmword ptr [rcx]
00007FFD1ECD28E8 vextractf128 xmm1,ymm0,1
00007FFD1ECD28EE vpermilpd xmm2,xmm1,1
00007FFD1ECD28F4 vextractf128 xmm0,ymm0,0
00007FFD1ECD28FA vpermilpd xmm3,xmm0,1
00007FFD1ECD2900 vminpd xmm1,xmm1,xmm2
00007FFD1ECD2905 vminpd xmm0,xmm0,xmm3
00007FFD1ECD290A vminpd xmm0,xmm0,xmm1
00007FFD1ECD290F vzeroupper
00007FFD1ECD2912 ret
; MinMaxCore1
00007FFD1ECC2960 vzeroupper
00007FFD1ECC2963 vmovupd ymm0,ymmword ptr [rcx]
00007FFD1ECC2968 vperm2f128 ymm1,ymm0,ymm0,3
00007FFD1ECC296E vminpd ymm0,ymm0,ymm1
00007FFD1ECC2973 vextractf128 xmm0,ymm0,0
00007FFD1ECC2979 vpermilpd xmm1,xmm0,1
00007FFD1ECC297F vminpd xmm0,xmm0,xmm1
00007FFD1ECC2984 vzeroupper
00007FFD1ECC2987 ret
; MinMaxCore2
00007FFD1ECF43B0 vzeroupper
00007FFD1ECF43B3 vmovupd ymm0,ymmword ptr [rcx]
00007FFD1ECF43B8 vperm2f128 ymm1,ymm0,ymm0,3
00007FFD1ECF43BE vminpd ymm0,ymm0,ymm1
00007FFD1ECF43C3 vpermilpd ymm1,ymm0,5
00007FFD1ECF43C9 vminpd ymm0,ymm0,ymm1
00007FFD1ECF43CE vextractf128 xmm0,ymm0,0
00007FFD1ECF43D4 vzeroupper
00007FFD1ECF43D7 ret
Though the aforementioned dependency can be better seen in VS with the C# code.
Maybe
MinMaxCore1
orMinMaxCore2
are better implementations than current (MinMaxCore0
)?Cf. https://github.com/gfoidl/Stochastics/pull/49