General

The following illustration shows the utilization patterns for

Pfx default partitioner
static partitioner
Trapeze partitioner
Trapeze partitioner with custom loop

It can be clearly seen, that the Pfx default and static partitioner don't fit to the problem domain.

Trapeze Partitioner

Utilization Patterns

Default Loop

Custom Loop

Benchmarks


BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=4
.NET Core SDK=2.1.4
  [Host]     : .NET Core 2.0.5 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.5 (Framework 4.6.0.0), 64bit RyuJIT

Method	N	PartitionMultiplier	Mean	Error	StdDev	Scaled	ScaledSD
TrapezeWorkload	1000	1	234.2 us	4.450 us	4.371 us	1.00	0.00
CustomLoop	1000	1	219.8 us	2.722 us	2.546 us	0.94	0.02
TrapezeWorkload	1000	2	234.1 us	3.264 us	3.053 us	1.00	0.00
CustomLoop	1000	2	217.0 us	2.339 us	2.188 us	0.93	0.01
TrapezeWorkload	1000	3	232.3 us	1.523 us	1.425 us	1.00	0.00
CustomLoop	1000	3	218.3 us	2.659 us	2.488 us	0.94	0.01
TrapezeWorkload	1000	4	236.0 us	1.843 us	1.634 us	1.00	0.00
CustomLoop	1000	4	220.6 us	1.595 us	1.492 us	0.93	0.01
TrapezeWorkload	1000	8	229.6 us	2.151 us	2.012 us	1.00	0.00
CustomLoop	1000	8	219.2 us	1.758 us	1.644 us	0.95	0.01
TrapezeWorkload	2000	1	804.3 us	7.051 us	6.596 us	1.00	0.00
CustomLoop	2000	1	785.5 us	9.119 us	8.530 us	0.98	0.01
TrapezeWorkload	2000	2	801.7 us	4.145 us	3.675 us	1.00	0.00
CustomLoop	2000	2	785.7 us	5.515 us	5.159 us	0.98	0.01
TrapezeWorkload	2000	3	803.0 us	8.320 us	7.782 us	1.00	0.00
CustomLoop	2000	3	784.3 us	7.768 us	7.266 us	0.98	0.01
TrapezeWorkload	2000	4	802.4 us	6.811 us	6.371 us	1.00	0.00
CustomLoop	2000	4	782.5 us	5.974 us	5.296 us	0.98	0.01
TrapezeWorkload	2000	8	800.4 us	7.422 us	6.943 us	1.00	0.00
CustomLoop	2000	8	781.0 us	4.718 us	4.413 us	0.98	0.01
TrapezeWorkload	5000	1	4,664.5 us	67.967 us	60.251 us	1.00	0.00
CustomLoop	5000	1	4,767.7 us	109.661 us	102.577 us	1.02	0.02
TrapezeWorkload	5000	2	4,630.1 us	31.805 us	29.751 us	1.00	0.00
CustomLoop	5000	2	4,681.8 us	41.649 us	36.921 us	1.01	0.01
TrapezeWorkload	5000	3	4,657.0 us	40.327 us	37.722 us	1.00	0.00
CustomLoop	5000	3	4,636.1 us	46.268 us	43.279 us	1.00	0.01
TrapezeWorkload	5000	4	4,655.6 us	44.765 us	41.873 us	1.00	0.00
CustomLoop	5000	4	4,663.1 us	36.008 us	33.682 us	1.00	0.01
TrapezeWorkload	5000	8	4,633.2 us	22.115 us	20.686 us	1.00	0.00
CustomLoop	5000	8	4,656.5 us	37.476 us	35.055 us	1.01	0.01

grafik

Discussion

The custom loop displays a better utilization pattern, and the runtime for smaller arrays is better than the one for the default parallel foreach-loop. For larger sizes the custom loop doesn't show any benefit in timing.

The custom loop has no mean of cooperative multitasking. Thus on the benchmark it might be better, but in real-world application this can introduce a negative effect.

Conclusion

the default parallel loop is kept on behalf of cooperative multitasking
for the partition count 8x the processor count is chosen

gfoidl / Stochastics

AutoCorrelationToArrayParallelSimd custom partitioner #11