gfoidl / Stochastics

Stochastic tools, distrubution, analysis
MIT License
3 stars 0 forks source link

Better SIMD algorithm for min/max? #56

Closed gfoidl closed 6 years ago

gfoidl commented 6 years ago

Maybe MinMaxCore1 or MinMaxCore2 are better implementations than current (MinMaxCore0)?

public static double MinMaxCore0(Vector<double> vector)
{
    Vector256<double> vec256 = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
    Vector128<double> hi128  = Avx.ExtractVector128(vec256, 1);                 // lat 3        thr 1
    Vector128<double> lo128  = Avx.ExtractVector128(vec256, 0);                 // lat 3        thr 1

    Vector128<double> tmp1 = Avx.Permute(hi128, 0b_01);                         // lat 1        thr 1
    Vector128<double> tmp2 = Avx.Permute(lo128, 0b_01);                         // lat 1        thr 1

    hi128 = Sse2.Min(hi128, tmp1);                                              // lat 4        thr 0.5
    lo128 = Sse2.Min(lo128, tmp2);                                              // lat 4        thr 0.5
    lo128 = Sse2.Min(lo128, hi128);                                             // lat 4        thr 0.5

    return Sse2.ConvertToDouble(lo128);                                         // lat 1        thr 1
}                                                                               // =  21        =   6.5

public static double MinMaxCore1(Vector<double> vector)
{
    Vector256<double> vec256       = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
    Vector256<double> swapped      = Avx.Permute2x128(vec256, vec256, 0x03);    // lat 3        thr 1
    Vector256<double> min256       = Avx.Min(vec256, swapped);                  // lat 4        thr 0.5
    Vector128<double> lo128        = Avx.ExtractVector128(min256, 0);           // lat 3        thr 1
    Vector128<double> lo128Swapped = Avx.Permute(lo128, 0b_01);                 // lat 1        thr 1
    Vector128<double> min128       = Sse2.Min(lo128, lo128Swapped);             // lat 4        thr 0.5

    return Sse2.ConvertToDouble(min128);                                        // lat 1        thr 1
}                                                                               // =  16        =   5

public static unsafe double MinMaxCore2(Vector<double> vector)
{
    Vector256<double> vec256        = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
    Vector256<double> swapped       = Avx.Permute2x128(vec256, vec256, 0x03);   // lat 3        thr 1
    Vector256<double> min256        = Avx.Min(vec256, swapped);                 // lat 4        thr 0.5
    Vector256<double> min256Swapped = Avx.Permute(min256, 0b_0101);             // lat 1        thr 1
    min256                          = Avx.Min(min256, min256Swapped);           // lat 4        thr 0.5
    Vector128<double> min128        = Avx.ExtractVector128(min256, 0);          // lat 3        thr 1

    return Sse2.ConvertToDouble(min128);                                        // lat 1        thr 1
}                                                                               // =  16        thr 5

Cf. https://github.com/gfoidl/Stochastics/pull/49

gfoidl commented 6 years ago

Based on the numbers of latency and throughput they should be faster. But don't get fooled by these numbers, because instruction level parallelism comes into account, as can be seen by benchmarks:


BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=2742187 Hz, Resolution=364.6724 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008618
  [Host]     : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
Method Mean Error StdDev Scaled ScaledSD
A 2.176 ns 0.0831 ns 0.0816 ns 1.00 0.00
B 2.266 ns 0.1006 ns 0.0941 ns 1.04 0.06
C 2.260 ns 0.0630 ns 0.0589 ns 1.04 0.04

In MinMaxCore0 almost always two instructions can be done in parallel. Whereas in MinMaxCore1 and MinMaxCore2 there's more or less a strict sequential dependency, thus disabling ISP.

; MinMaxCore0
00007FFD1ECD28E0  vzeroupper
00007FFD1ECD28E3  vmovupd      ymm0,ymmword ptr [rcx]
00007FFD1ECD28E8  vextractf128 xmm1,ymm0,1
00007FFD1ECD28EE  vpermilpd    xmm2,xmm1,1
00007FFD1ECD28F4  vextractf128 xmm0,ymm0,0
00007FFD1ECD28FA  vpermilpd    xmm3,xmm0,1
00007FFD1ECD2900  vminpd       xmm1,xmm1,xmm2
00007FFD1ECD2905  vminpd       xmm0,xmm0,xmm3
00007FFD1ECD290A  vminpd       xmm0,xmm0,xmm1
00007FFD1ECD290F  vzeroupper
00007FFD1ECD2912  ret

; MinMaxCore1
00007FFD1ECC2960  vzeroupper
00007FFD1ECC2963  vmovupd      ymm0,ymmword ptr [rcx]
00007FFD1ECC2968  vperm2f128   ymm1,ymm0,ymm0,3
00007FFD1ECC296E  vminpd       ymm0,ymm0,ymm1
00007FFD1ECC2973  vextractf128 xmm0,ymm0,0
00007FFD1ECC2979  vpermilpd    xmm1,xmm0,1
00007FFD1ECC297F  vminpd       xmm0,xmm0,xmm1
00007FFD1ECC2984  vzeroupper
00007FFD1ECC2987  ret

; MinMaxCore2
00007FFD1ECF43B0  vzeroupper
00007FFD1ECF43B3  vmovupd      ymm0,ymmword ptr [rcx]
00007FFD1ECF43B8  vperm2f128   ymm1,ymm0,ymm0,3
00007FFD1ECF43BE  vminpd       ymm0,ymm0,ymm1
00007FFD1ECF43C3  vpermilpd    ymm1,ymm0,5
00007FFD1ECF43C9  vminpd       ymm0,ymm0,ymm1
00007FFD1ECF43CE  vextractf128 xmm0,ymm0,0
00007FFD1ECF43D4  vzeroupper
00007FFD1ECF43D7  ret

Though the aforementioned dependency can be better seen in VS with the C# code.