gfoidl / Stochastics

Stochastic tools, distrubution, analysis
MIT License
3 stars 0 forks source link

Better SIMD code generation #5

Closed gfoidl closed 6 years ago

gfoidl commented 6 years ago

Fixes https://github.com/gfoidl/Stochastics/issues/4

gfoidl commented 6 years ago

With the new SIMD codegen a benchmark for Kurtosis would be interesting, because this does a lot of work in the SIMD registers.

gfoidl commented 6 years ago

Results for Kurtosis

Benchmark


BenchmarkDotNet=v0.10.11, OS=Windows 7 SP1 (6.1.7601.0)
Processor=Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge), ProcessorCount=8
Frequency=2241064 Hz, Resolution=446.2166 ns, Timer=TSC
.NET Core SDK=2.1.2
  [Host]     : .NET Core 2.0.3 (Framework 4.6.25815.02), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.3 (Framework 4.6.25815.02), 64bit RyuJIT
Method Mean Error StdDev Scaled ScaledSD
Sequential 11.240 us 0.2209 us 0.3691 us 1.00 0.00
UnsafeSimd 5.369 us 0.1042 us 0.1391 us 0.48 0.02

dasm

000007fe`7792d490 0f1019          movups  xmm3,xmmword ptr [rcx]        ; loop start
000007fe`7792d493 4883c110        add     rcx,10h
000007fe`7792d497 660f5cd8        subpd   xmm3,xmm0
000007fe`7792d49b 0f28e3          movaps  xmm4,xmm3
000007fe`7792d49e 660f59e3        mulpd   xmm4,xmm3
000007fe`7792d4a2 660f59e3        mulpd   xmm4,xmm3
000007fe`7792d4a6 660f59e3        mulpd   xmm4,xmm3
000007fe`7792d4aa 660f58d4        addpd   xmm2,xmm4
000007fe`7792d4ae 0f1019          movups  xmm3,xmmword ptr [rcx]
000007fe`7792d4b1 4883c110        add     rcx,10h
000007fe`7792d4b5 660f5cd8        subpd   xmm3,xmm0
000007fe`7792d4b9 0f28e3          movaps  xmm4,xmm3
000007fe`7792d4bc 660f59e3        mulpd   xmm4,xmm3
000007fe`7792d4c0 660f59e3        mulpd   xmm4,xmm3
000007fe`7792d4c4 660f59e3        mulpd   xmm4,xmm3
000007fe`7792d4c8 660f58d4        addpd   xmm2,xmm4
000007fe`7792d4cc 4183c004        add     r8d,4
000007fe`7792d4d0 453bc1          cmp     r8d,r9d
000007fe`7792d4d3 7cbb            jl      000007fe`7792d490             ; loop end

Pretty code 😄

gfoidl commented 6 years ago

The code for sequential and parallel is similar, except of the range. This can be refactored to cleaner code.