gfoidl / Stochastics

Stochastic tools, distrubution, analysis
MIT License
3 stars 0 forks source link

Better Simd reduction #49

Closed gfoidl closed 6 years ago

gfoidl commented 6 years ago

Fixes #43

44 made some improvements on this, but the codegen wasn't perfect. Cf. opened the door for better codegen, this is the implementation.

Biggest improvement to #44 is in ReduceMinMax:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=2742187 Hz, Resolution=364.6724 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008618
  [Host]     : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
Method Mean Error StdDev Scaled
Base 14.598 ns 0.0441 ns 0.0413 ns 1.00
New 1.644 ns 0.0216 ns 0.0202 ns 0.11

With wonderful dasm :wink:

00007FFD1DDE7510  vzeroupper
00007FFD1DDE7513  vmovupd       ymm0,ymmword ptr [rcx]
00007FFD1DDE7518  vextractf128  xmm1,ymm0,1
00007FFD1DDE751E  vextractf128  xmm0,ymm0,0
00007FFD1DDE7524  vpermilpd     xmm2,xmm1,1
00007FFD1DDE752A  vpermilpd     xmm3,xmm0,1
00007FFD1DDE7530  vminpd        xmm1,xmm1,xmm2
00007FFD1DDE7535  vminpd        xmm0,xmm0,xmm3
00007FFD1DDE753A  vminpd        xmm0,xmm0,xmm1
00007FFD1DDE753F  vmovsd        qword ptr [r8],xmm0
00007FFD1DDE7544  vmovupd       ymm0,ymmword ptr [rdx]
00007FFD1DDE7549  vextractf128  xmm1,ymm0,1
00007FFD1DDE754F  vextractf128  xmm0,ymm0,0
00007FFD1DDE7555  vpermilpd     xmm2,xmm1,1
00007FFD1DDE755B  vpermilpd     xmm3,xmm0,1
00007FFD1DDE7561  vmaxpd        xmm1,xmm1,xmm2
00007FFD1DDE7566  vmaxpd        xmm0,xmm0,xmm3
00007FFD1DDE756B  vmaxpd        xmm0,xmm0,xmm1
00007FFD1DDE7570  vmovsd        qword ptr [r9],xmm0
00007FFD1DDE7575  vzeroupper
00007FFD1DDE7578  ret
gfoidl commented 6 years ago

Just for reference a portion of C++:

#include <iostream>
#include <immintrin.h>
float max_sse(float* a)
    __m128* f4    = reinterpret_cast<__m128*>(a);
    __m128 maxval = *f4;

    for (int i = 0; i < 3; ++i)
        __m128 tmp = _mm_shuffle_ps(maxval, maxval, 0x93);
        maxval     = _mm_max_ps(maxval, tmp);

    float res;
    _mm_store_ss(&res, maxval);
    return res;
double max_sse(double* a)
    __m256d* d4    = reinterpret_cast<__m256d*>(a);
    __m256d maxval = *d4;

    for (int i = 0; i < 3; ++i)
        __m256d tmp = _mm256_permute4x64_pd(maxval, 0x39);
        maxval      = _mm256_max_pd(maxval, tmp);

    double res;
    _mm256_store_pd(&res, maxval);
    return res;
#define MM_SHUFFLE(fp0,fp1,fp2,fp3) (((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | ((fp0)))
int main()
    int a = 0x93;
    int b = MM_SHUFFLE(2, 1, 0, 3);

    float arr[] = {1, 2, 3, 4};
    float max   = max_sse(arr);

    double darr[] = {1, 2, 3, 4};
    double dmax   = max_sse(darr);

    using namespace std;

    cout << max  << endl;
    cout << dmax << endl;

I haven't tested the double-version in C#, because in the reference-assembly _mm256_permute4x64_pd is missing (though it's available in CoreLib). But I believe the implemented variant is faster, because it's just

instead of rotating and min/max.