gfoidl / Stochastics

Stochastic tools, distrubution, analysis
MIT License
3 stars 0 forks source link

AutoCorrelationToArrayParallelSimd custom partitioner #11

Closed gfoidl closed 6 years ago

gfoidl commented 6 years ago

A custom partitioner is used to address the workload-shape for autocorrelation.

        A
        |\
        |  \
        |   |\
        |   |  \ X
        |   |    \
        |   |    | \
        |   |    |   \ B
        |   |    |     \
        | A1| A2 | A3  |
        |   |    |     |
        |___|____|_____|
        0        n     N

Fixes https://github.com/gfoidl/Stochastics/issues/9

gfoidl commented 6 years ago

General

The following illustration shows the utilization patterns for

It can be clearly seen, that the Pfx default and static partitioner don't fit to the problem domain.

Trapeze Partitioner

Utilization Patterns

Default Loop

Custom Loop

Benchmarks


BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=4
.NET Core SDK=2.1.4
  [Host]     : .NET Core 2.0.5 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.5 (Framework 4.6.0.0), 64bit RyuJIT
Method N PartitionMultiplier Mean Error StdDev Scaled ScaledSD
TrapezeWorkload 1000 1 234.2 us 4.450 us 4.371 us 1.00 0.00
CustomLoop 1000 1 219.8 us 2.722 us 2.546 us 0.94 0.02
TrapezeWorkload 1000 2 234.1 us 3.264 us 3.053 us 1.00 0.00
CustomLoop 1000 2 217.0 us 2.339 us 2.188 us 0.93 0.01
TrapezeWorkload 1000 3 232.3 us 1.523 us 1.425 us 1.00 0.00
CustomLoop 1000 3 218.3 us 2.659 us 2.488 us 0.94 0.01
TrapezeWorkload 1000 4 236.0 us 1.843 us 1.634 us 1.00 0.00
CustomLoop 1000 4 220.6 us 1.595 us 1.492 us 0.93 0.01
TrapezeWorkload 1000 8 229.6 us 2.151 us 2.012 us 1.00 0.00
CustomLoop 1000 8 219.2 us 1.758 us 1.644 us 0.95 0.01
TrapezeWorkload 2000 1 804.3 us 7.051 us 6.596 us 1.00 0.00
CustomLoop 2000 1 785.5 us 9.119 us 8.530 us 0.98 0.01
TrapezeWorkload 2000 2 801.7 us 4.145 us 3.675 us 1.00 0.00
CustomLoop 2000 2 785.7 us 5.515 us 5.159 us 0.98 0.01
TrapezeWorkload 2000 3 803.0 us 8.320 us 7.782 us 1.00 0.00
CustomLoop 2000 3 784.3 us 7.768 us 7.266 us 0.98 0.01
TrapezeWorkload 2000 4 802.4 us 6.811 us 6.371 us 1.00 0.00
CustomLoop 2000 4 782.5 us 5.974 us 5.296 us 0.98 0.01
TrapezeWorkload 2000 8 800.4 us 7.422 us 6.943 us 1.00 0.00
CustomLoop 2000 8 781.0 us 4.718 us 4.413 us 0.98 0.01
TrapezeWorkload 5000 1 4,664.5 us 67.967 us 60.251 us 1.00 0.00
CustomLoop 5000 1 4,767.7 us 109.661 us 102.577 us 1.02 0.02
TrapezeWorkload 5000 2 4,630.1 us 31.805 us 29.751 us 1.00 0.00
CustomLoop 5000 2 4,681.8 us 41.649 us 36.921 us 1.01 0.01
TrapezeWorkload 5000 3 4,657.0 us 40.327 us 37.722 us 1.00 0.00
CustomLoop 5000 3 4,636.1 us 46.268 us 43.279 us 1.00 0.01
TrapezeWorkload 5000 4 4,655.6 us 44.765 us 41.873 us 1.00 0.00
CustomLoop 5000 4 4,663.1 us 36.008 us 33.682 us 1.00 0.01
TrapezeWorkload 5000 8 4,633.2 us 22.115 us 20.686 us 1.00 0.00
CustomLoop 5000 8 4,656.5 us 37.476 us 35.055 us 1.01 0.01

grafik

Discussion

The custom loop displays a better utilization pattern, and the runtime for smaller arrays is better than the one for the default parallel foreach-loop. For larger sizes the custom loop doesn't show any benefit in timing.

The custom loop has no mean of cooperative multitasking. Thus on the benchmark it might be better, but in real-world application this can introduce a negative effect.

Conclusion