Perf profile on i9-9980XE

Hello Hadrien,

I was looking for some nice map-reduce benchmarks to add to evaluate my multithreading runtime (Weave) and was thinking that histograms could be a nice example.

Your benchmark's performance profile is very strange on my machine:

i9-9980XE (overclocked at 4.1GHz all-core turbo), full perf details here.

This is the output of the default configuration:

    const NUM_BINS: usize = 1000;
    const NUM_ROLLS: usize = 300_000_000;
    const BATCH_SIZE: usize = 100;
    const NUM_BUCKETS: usize = 2;

7.258854153333333 ns/iter, test tests::sequential_raw ... ok
7.643087426666667 ns/iter, test tests::sequential_thread_local ... ok
7.814808806666667 ns/iter, test tests::sequential_thread_bucketized ... ok
7.817408296666667 ns/iter, test tests::sequential_mutex ... ok
12.8888737 ns/iter, test tests::sequential_atomic ... ok
30.117832493333335 ns/iter, test tests::parallel_mutex ... ok
30.118382763333333 ns/iter, test tests::parallel_thread_bucketized ... ok
30.119495196666666 ns/iter, test tests::parallel_atomic ... ok
30.115383773333335 ns/iter, test tests::parallel_thread_local ... ok

Not too sure why all parallel versions are as slow as each other.

Side note on RNG

Also you might want to preallocate the array of values so that it's easier to compare across languages without the influences of RNG implementation.

AFAIK many languages (Nim, Lua, Julia) are using Xoroshiro instead of XorShift that Rust seems to be using. Furthermore the Xoroshiro family all have a jump function that allow them to jump by 2^64 generated numbers (see http://prng.di.unimi.it/xoroshiro128plus.c). This would allow to have reproducible RNG sequences from 1 seed without conflict as long as each thread is started 2^64 RNG iterations apart.

Reference: http://prng.di.unimi.it/

That's definitely not an expected output. At first I suspected that the compiler might have become more clever and optimized the microbenchmark out, but I just checked that as of latest stable (1.39) that's not true on my old i3-based machine:

$ cargo test --release -- --nocapture --test-threads=1
running 9 tests
test tests::parallel_atomic ... 8.75734467 ns/iter, ok
test tests::parallel_mutex ... 17.071206826666668 ns/iter, ok
test tests::parallel_thread_bucketized ... 6.003549363333334 ns/iter, ok
test tests::parallel_thread_local ... 4.479292656666667 ns/iter, ok
test tests::sequential_atomic ... 13.806369643333333 ns/iter, ok
test tests::sequential_mutex ... 10.22155482 ns/iter, ok
test tests::sequential_raw ... 9.2978603 ns/iter, ok
test tests::sequential_thread_bucketized ... 10.252130513333332 ns/iter, ok
test tests::sequential_thread_local ... 10.2098582 ns/iter, ok

Intuitively, the fact that all parallel versions take the same time on your machine make me suspect that there's an issue with parallelization. Can you cross-check in your system monitor that all CPU cores actually get used during your parallel runs?

Regarding RNGs, I am not comfortable with precomputing all data in advance because it...

Limits maximal benchmark size (too much and you exhaust your RAM).
Quickly makes things bottlenecked by RAM bandwidth, which may hide other interesting effects.

We could probably agree on some common choice of RNG + seed though.

FYI, Rust's rand library uses a few rounds of the ChaCha block cipher as its default PRNG for security reasons. I didn't bother changing it at the time since it was fast enough not to hide the effect that I was interested in and speed of development was important.

But since you're interested in a more easily reproducible configuration, I updated the latest master to use xoshiro128+ with the initial seed [0x12, 0x34, 0x56, 0x78, 0x9a, 0xbc, 0xde, 0xf0, 0x0f, 0xed, 0xcb, 0xa9, 0x87, 0x56, 0x43, 0x21]. Each thread makes a clone of the global RNG then makes it jump() during its initialization phase.

Note that this is not enough to make execution reproducible between e.g. sequential and N threads, but only across runs with the same number of threads. Which should be enough, with 300M iterations this benchmark has largely enough statistics for the exact choice of seed not to have a significant influence on the results...

Also, bear in mind that this benchmark was built to study and compare the overhead of various concurrent histogramming strategies, and therefore puts a somewhat unfair amount of worst-case emphasis on said overhead as opposed to trying to be a fair representation of how people fill histograms in practice.

What is strange is that I see multiple threads during the sequential execution but they seem idle during the parallel exec:

During sequential DeepinScreenshot_plasmashell_20191208235322

Transitioning to "parallel" DeepinScreenshot_select-area_20191208235305

DeepinScreenshot_select-area_20191208235516

I don't really mind the non-reproducibility as long as it gives me ballpark order of magnitudes of time to run and speedup of sequential vs parallel.

In any case, the good news is that I ended up finding and porting another 2D histogram benchmark I have found to my runtime (and the bad news is that it's a pathological case for it).

Anyway feel free to close the issue, unfortunately I don't think I'll have the time to relearn Rust and then understand why it behaves this way on my machine.

Aha, I see what your problem is. You run cargo test --release -- --nocapture --test-threads=36, where you should use --test-threads=1 here. It's totally my fault though, I should really stop misusing Rust's built-in test framework as a microbenchmarking framework.

You see, the thing is that Rust provides a nice built-in test runner, and I find it great to be able to just annotate a bunch of functions as #[test] and get them all to run automagically via a simple cargo test. But although a benchmark-oriented cousin exists (unsurprisingly called #[bench] and invoked via cargo bench), it has not been stabilized yet because it has some design issues that the Rust team would like to take care of first.

So what I do in meantime is to abuse the test runner as a benchmark runner by...

Adding some timing output.
Forcing that output to be displayed even though test output is normally hidden (--nocapture).
Preventing multiple tests from running at the same time as it would bias the statistics, especially for multithreaded contention tests like this one (--test-threads=1).

This latter flag does not prevent an individual "test" from spawning multiple threads internally, and that's exactly what happens here when running with test-threads=1, as intended.

Anyhow, good luck with your project, reductions are always the most "interesting" part of parallel programming toolkits due to the amount of communication involved ;)

I'm personally skeptical that using a pure message-passing approach for everything is ever going to approach the performance of a well-tuned deque-based runtime in communication-heavy scenarios, but I'd love to be surprised!

Anyhow, good luck with your project, reductions are always the most "interesting" part of parallel programming toolkits due to the amount of communication involved ;)

I wanted to add some special support for them but I guess I'll just do for loop + critical sections/atomics ;)

I'm personally skeptical that using a pure message-passing approach for everything is ever going to approach the performance of a well-tuned deque-based runtime in communication-heavy scenarios, but I'd love to be surprised!

You're in for a treat then, for both Task Parallelism or Data Parallelism, my library has actually significantly less overhead and similar or better load distribution than:

GCC OpenMP (no contest for task has it uses a global shared queue protected by mutex)
Intel/LLVM OpenMP
Intel TBB
Cilk and CilkPlus
Julia Partr (only measured overhead via fibonacci(40) so far)
HPX (same base model only measured overhead via fibonacci(40) so far)

I didn't find any Task-Parallel Rust library to compare with unfortunately.

For scheduler overhead focus benchmarks in my bench suite (https://github.com/mratsim/weave/tree/master/benchmarks) the difference can be very significant:

Fibonacci(40) = 2^40 tasks = 1099511627776 tasks
- Weave 180 ms with alloca-based futures and 380 ms with heap-allocated futures
- Intel TBB 600 ms with closures-based implementation, 1s with class based implementation
- Clang/intel OpenMP: 2s
- Julia Partr: 8s (theirs is brand new, based on Parallel Depth-First scheduler instead of work-stealing and they were focused on nested data parallelism)
- Haskell: 8s (based on Software Transactional memory)
- HPX and GCC OpenMP: have to be killed
Source: https://github.com/mratsim/weave/tree/master/benchmarks/fibonacci
Depth First Search: depth 10, width 9

Source: https://github.com/mratsim/weave/tree/master/benchmarks/dfs

It can also deal with proper nested loop and not just OpenMP collapsed loops, which was the main reason as started this runtime, for the linear algebra/deep learning compiler I'm planning I want to have nested loop parallelism to symplify for example batched matrix multiplication implementations.

Example on a 2D tiled matrix transposition algorithm (you can embed pure C in Nim (https://github.com/mratsim/weave/tree/333375df1e268f42207bc88a4881157de3af054e/benchmarks/matrix_transposition)

OpenMP

proc omp2DTiledCollapsedTranspose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
## Transpose with 2D tiling and collapsed

const blck = 64

{.emit: """

#define min(a,b) (((a)<(b))?(a):(b))

#pragma omp parallel for collapse(2)
for (int j = 0; j < `N`; j+=`blck`)
for (int i = 0; i < `M`; i+=`blck`)
  for (int jj = j; jj<j+`blck` && jj<`N`; jj++)
    for (int ii = i; ii<min(i+`blck`,`M`); ii++)
      `bufOut`[ii+jj*`M`] = `bufIn`[jj+ii*`N`];
""".}

Weave

proc weave2DTiledNestedTranspose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
## Transpose with 2D tiling and nested

const blck = 64 # const do not need to be captured

parallelForStrided j in 0 ..< N, stride = blck:
captures: {M, N, bufIn, bufOut}
parallelForStrided i in 0 ..< M, stride = blck:
  captures: {j, M, N, bufIn, bufOut}
  for jj in j ..< min(j+blck, N):
    for ii in i ..< min(i+blck, M):
      bufOut[jj*M+ii] = bufIn[ii*N+jj]

Perf (the bench is quite unstable due to the matrices being small, but it gives a +-20% ballpark approximation: DeepinScreenshot_select-area_20191209114146

Can you give me a sample build command for the purpose of experimenting with this further?

I've never built a Nim program before, and did not find anything looking like a build system in the weave repo so I just downloaded the latest Nim nightly (as stable wouldn't build your benchmarks), looked at nim --help and tried to build with every flag which sounded like it could possibly have an effect on performance (nim compile --opt:speed --stackTrace:off --lineTrace:off --threads:on --checks:off --assertions:off -d:release weave_fib.nim). But I'm not sure if that's the right way to go.

So I just wanted to check with you what's the recommended way to build a Nim program in an "as fast as possible" configuration.

In any case, here are quick and dirty Rayon ports of your simplest benchmarks. On my machine, they exhibit performance that's roughly on par with the Nim version (compiled with either the bunch of flags above or the --threads:on -d:danger combo which I've discovered more recently that performs equally well and is shorter).

Which, honestly, is already a pretty impressive achievement from your library, given the design constraints that you're operating under.

Nice that you figured this out, yes --thread:on and -d:danger

Note that I think in your fib benchmark you are spawning an extra thread compared to my bench if I read this correctly: https://github.com/HadrienG2/weave-parallel-benchmarks-rs/blob/004a559970bf74ebd22f69fd0abe101f85765c0c/src/bin/fib.rs#L5

Here are the build command line arguments for Weave:

-d:WV_LazyFlowvar builds a secret version that allocates futures on the heap, about 2x faster on fib and DFS which are overhead-bound
-d:WV_metrics gives you per-thread metrics about Weave tasks stolen, tasks distributed, steal-one tasks/steal-half tasks, lazy task splitting (unlike OpenMP split is not done before parallelizing the loop, splitting is done lazily if there a thief)
-d:WV_profile will also trigger high-resolution timer (the timer has as much overhead as an addition so bad idea for fibonacci) that reports in this order timer, threadID, time spent in actual tasks, overhead of steal requests (which ar emessages), overhead of enqueueing/dequeing tasks (messages as well), time spent idle, total
The metrics and profile will need -d:CpuFreqMhz=4100 where you replace 4100 by your frequency. Unfortunately I didn't find a portable way to compute CPU frequencyyet . My CPU is 4.1GHz overclocked instead of the default 3.0 and it's a bummer to hardcode that :/

So, actually that was a benchmark bug, in the sense that I was inadvertently cheating by using a Rayon construct that's idiomatic for users of that library, but less general-purpose than the Weave spawn construct and therefore more amenable to optimization by the Rayon implementors.

The latest master provides a clearer choice between "use the idiomatic Rayon tool for the job" and "closely match what the Nim version of the benchmark does". It does the latter by default (let's play fair), and does the former when built with --features idiomatic.

You will notice that the fib benchmark performs much worse in the non-idiomatic configuration, because Rayon's scope() construct could really use some optimization love by the implementors. IIUC, as currently implemented, it forces migration of the task inside the scope to a different thread on the thread pool, and uses one heap allocation per spawned task, both of which are unpleasantly costly. The reason why it hasn't been optimized more, as far as I know, is that most algorithms are actually well covered by join()-style binary fork join...

...and after further reflection, the non-idiomatic version of dfs also needs to become even more ugly (from a Rust + rayon user's pov of course) before it becomes a fair imitation of what the Weave version is doing. Sigh.

Note that the Rayon team has plans to integrate support for Rust futures in the longer term, which could be more directly comparable with Weave's future-based design (and hopefully will motivate the Rayon team to improve scope's performance ;)).

Well my goal with those bench is in the end to produce a tool that helps me write fast code, as long as there is a way and you don't have to bend too much compared to something natural I think it's OK in the end.

In those benches though the goal was to compare scheduler overhead so those tasks need to spawn a couple trillions of tasks to be meaningful ;).

Now I think it's fine to have both implementations alongside. In my own library, unfortunately reductions are broken so I probably would fare badly with a map-reduce implementation. (But I have ways around that since OpenMP reductions are also a bit inflexible so I used plain parallel for when I was still hoping to salvage OpenMP - https://github.com/numforge/laser/blob/master/examples/ex05_tensor_parallel_reduction.nim)

The pattern for both fib and dfs is pretty regular though, I think nqueens would might be a bit harder without futures.

In those benches though the goal was to compare scheduler overhead so those tasks need to spawn a couple trillions of tasks to be meaningful ;).

At the risk of sounding pedantic even though I understand what you mean (and largely agree!), the word "task" can have two meanings in modern parallel runtimes:

The elementary unit of independent work that is specified by the user, which can be run in parallel from others.
The concrete work-item that ends up being shuffled around in the runtime's scheduling queues.

The former is an API-visible runtime user concern, the latter is a runtime implementor concern that should remain hidden from the API and only emerge when the runtime takes bad scheduling decisions that force the user to exert some manual control on it.

A very fertile ground for parallel runtime ergonomics and optimization over the years has been to realize that these two kinds of work items do not need to have the same granularity:

In the OpenMP era, a first generation of clever runtime implementors realized that they did not need to push one work-item in their scheduling queues per iteration of a "parallel for" loop, but could instead split the iteration space into chunks of larger granularity, only as far as demanded by the machine's level of parallelism.
Later on, in the TBB era, a second generation of even more clever runtime implementors realized that work splitting did not have to be a static compile-time or configuration decision, but could instead be something that the runtime does dynamically as demanded by scheduling circumstances, core processing ability and load imbalance.
I wonder if there are runtimes in the wild that also try to merge back work items for the sake of reducing pressure on scheduling queues when there are too many tasks in flight? From reading through either Weave's description or the Picasso RFC, I got the impression that this was something that you were interested in doing or already experimenting with.

Runtimes with dynamic work splitting make more complex dynamic scheduling decisions, and therefore are intrinsically slower (assuming an equal degree of optimization) in the ideal case of a homogeneous "parallel for" iteration workload whose execution schedule can be easily decided at compile time. But they make up for it by handling more complex scenarios like quicksort or heterogeneous hardware efficiently without requiring manual user intervention like specifications of work granularity.

Therefore, although I agree with you that scheduling overhead can and should be benchmarked at the granularity of individual work item handling in scheduling queues, it's important to keep in mind that such a benchmark is not 100% fair to the cleverness and flexibility of modern dynamic runtimes (which cannot win against an equally well-optimized static runtime with dumber and faster scheduling decisions, only get close) and that we also need benchmarks which stress a runtime's ability to make good work splitting and merging decision.

And this is why I thought it was interesting to have both versions of the benchmark. One which is a pure stress test for the runtime's scheduling queues (and therefore tests the runtime's low-level scheduling queue optimizations), and one which uses the native runtime API in a way which is both more convenient to the user and a bit less stressful for the runtime's queues (and therefore tests the runtime's high-level scheduling logic and "exposing the right API" optimizations).

Anyway, enough philosophical babbling about APIs and dynamic scheduling, let's try to add an nqueens port to my repo ;)

In those benches though the goal was to compare scheduler overhead so those tasks need to spawn a couple trillions of tasks to be meaningful ;).

At the risk of sounding pedantic even though I understand what you mean (and largely agree!), the word "task" can have two meanings in modern parallel runtimes:
1. The elementary unit of independent work that is _specified_ by the user, which _can_ be run in parallel from others.

2. The concrete work-item that ends up being shuffled around in the runtime's scheduling queues.
The former is an API-visible runtime user concern, the latter is a runtime implementor concern that should remain hidden from the API and only emerge when the runtime takes bad scheduling decisions that force the user to exert some manual control on it.

A very fertile ground for parallel runtime ergonomics and optimization over the years has been to realize that these two kinds of work items do not need to have the same granularity:
1. In the OpenMP era, a first generation of clever runtime implementors realized that they did not need to push one work-item in their scheduling queues per iteration of a "parallel for" loop, but could instead split the iteration space into chunks of larger granularity, only as far as demanded by the machine's level of parallelism.

2. Later on, in the TBB era, a second generation of even more clever runtime implementors realized that work splitting did not have to be a static compile-time or configuration decision, but could instead be something that the runtime does dynamically as demanded by scheduling circumstances, core processing ability and load imbalance.

Agreed

3\. I wonder if there are runtimes in the wild that also try to _merge back_ work items for the sake of reducing pressure on scheduling queues when there are too many tasks in flight? From reading through either Weave's description or the Picasso RFC, I got the impression that this was something that you were interested in doing or already experimenting with.

Actually Weave doesn't merge back parallel loop, what it does is there: https://github.com/mratsim/weave/blob/333375df1e268f42207bc88a4881157de3af054e/weave/parallel_for.nim#L23-L47

template parallelForWrapper(
    idx: untyped{ident},
    prologue, loopBody, epilogue,
    remoteAccum, resultTy,
    returnStmt: untyped): untyped =
  ## To be called within a loop task
  ## Gets the loop bounds and iterate the over them
  ## Also poll steal requests in-between iterations
  ##
  ## Loop prologue, epilogue,
  ## remoteAccum, resultTy and returnStmt
  ## are unused

  block:
    let this = myTask()
    ascertain: this.isLoop
    ascertain: this.start == this.cur

    var idx {.inject.} = this.start
    this.cur += this.stride
    while idx < this.stop:
      loopBody
      idx += this.stride
      this.cur += this.stride
      loadBalance(Weave)

After each loop iteration, the runtime checks for incoming steal requests (unfortunate overhead of using message-passing, the workers that have actual work have to do also the load balancing while in classic Chase-Lev deque work-stealing it's the thief that have nothing to do that incur this overhead.) The this.stop is dynamic, if there are incoming requests, the worker will split the remaining iterations to satisfy all incoming requests and send them to all thieves (very nice side-benefit of using message passing, lazy loop splitting would require synchronization between workers and thieves if the loop was the last task of a chase-lev deque)

There are many benefits to this, it makes writing generic algorithm like map much easier, static scheduling needs inflexible heuristics about when or when not a loop should be split. This is shown in those OpenMP for Pytorch benchmarks https://github.com/zy97140/omp-benchmark-for-pytorch and issue https://github.com/pytorch/pytorch/issues/3146 which in summary, for 3 class of processes tell us from how much elements in the tensor we should start parallelizing:

CPU Model	Sockets	Cores/Socket	Frequency
Intel(R) Xeon(R) CPU E5-2699 v4	2	22	2.20GHz
Intel(R) Xeon(R) Platinum 8180 CPU	2	28	2.50GHz
Intel(R) Core(TM) i7-5960X CPU	1	8	3.00GHz

Tested operations:


copy	add	div	sin	exp	sum	prod

For contiguous tensor operation:

	Xeon(R) Platinum 8180 CPU	Xeon(R) CPU E5-2699 v4	i7-5960X CPU
copy	80k	20k	8k
add	80k	20k	8k
div	50k	10k	2k
exp	1k	1k	1k
sin	1k	1k	1k
sum	1k	1k	1k
prod	1k	1k	1k

For discontiguous tensor operation:

Xeon(R) Platinum 8180 CPU	Xeon(R) CPU E5-2699 v4	i7-5960X CPU
copy	20k	8k	2k
add	20k	8k	2k
div	10k	8k	1k
exp	1k	1k	1k
sin	2k	2k	1k
sum	1k	1k	1k
prod	1k	1k	1k

This lead to optimizing for the common denominator (say a 4~8 cores CPU) instead of properly using the hardware. My own low-level library that I'm building as a linear-algebra/HPC/deep-learning backend struggles with the same thing

const OMP_MEMORY_BOUND_GRAIN_SIZE*{.intdefine.} = 1024
  ## This is the minimum amount of work per physical cores
  ## for memory-bound processing.
  ## - "copy" and "addition" are considered memory-bound
  ## - "float division" can be considered 2x~4x more complex
  ##   and should be scaled down accordingly
  ## - "exp" and "sin" operations are compute-bound and
  ##   there is a perf boost even when processing
  ##   only 1000 items on 28 cores
  ##
  ## Launching 2 threads per core (HyperThreading) is probably desirable:
  ##   - https://medium.com/data-design/destroying-the-myth-of-number-of-threads-number-of-physical-cores-762ad3919880
  ##
  ## Raising the following parameters can have the following impact:
  ##   - number of sockets: higher, more over memory fetch
  ##   - number of memory channel: lower, less overhead per memory fetch
  ##   - RAM speed: lower, less overhead per memory fetch
  ##   - Private L2 cache: higher, feed more data per CPU
  ##   - Hyperthreading and cache associativity
  ##   - Cores, shared L3 cache: Memory contention
  ##
  ## Note that setting num_threads manually might impact performance negatively:
  ##   - http://studio.myrian.fr/openmp-et-num_threads/
  ##     > 2x2ms overhead when changing num_threads from 16->6->16

const OMP_NON_CONTIGUOUS_SCALE_FACTOR*{.intdefine.} = 4
  ## Due to striding computation, we can use a lower grainsize
  ## for non-contiguous tensors

Unfortunately I'm also pretty sure that Intel TBB also suffers from the curse of "how to choose a grain size" as the splitting is eager and there is no feedback loop that would inform the runtime "is it worth it to parallelize if I have 1000 exponentiation", "what if it's 1000 cos?", so it's up to the developer to figure out the grain size and you get issue that cannot be solved: https://github.com/pytorch/pytorch/issues/3146

Now, maybe re-merging would be interesting but I think not splitting in the first place is best ;).

Also lazy loop splitting composes much better with nested parallelism when parallelism is found at a higher level like another for loop (batch matrix multiplication) or graph-level parallelism for neural network with splitted code-paths, see https://github.com/pytorch/glow/issues/1749, https://github.com/cpp-taskflow/cpp-taskflow/issues/97.

Dealing with nested parallelism is actually one of the main Julia focus with PARTR which uses something relatively unknown: a Parallel Depth-First Scheduler instead of a work-stealing scheduler, instead of spawning tasks from the root of the task tree (assuming spawn/sync semantics), it's down depth first to reuse the parallel hard work down by people behind BLAS or FFTW. Actually they went one step beyond that and made the FFTW threadpool backend modular https://github.com/FFTW/fftw3/issues/175 and are in the process of doing the same for OpenBLAS https://github.com/xianyi/OpenBLAS/pull/2255. So this is also an opportunity for Rayon ;)

In short, I think in today's era where:

the core counts varies from 4 to 64+ on relatively common hardware
the wild diversity of L1, L2 caches
the workload ranging from simple add/copy to Black and Scholes computations
parallelism should be nestable

a static scheduler cannot keep up, even for simple for-loop.

If you like reading about runtimes, I have a treasure chest of papers there: https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md

For dealing with overheads, I have laid out my techniques to deal with multithreaded memory management here:

Overview of Weave memory subsystem
Memory access patterns
Multithreaded memory management techniques
Obviously, multithreaded memory management is also done with message-passing, piggybacking on the following MPSC channels that AFAIK are the only one to support batching for both the producer and consumer without any locks

Anyhow, good luck with your project, reductions are always the most "interesting" part of parallel programming toolkits due to the amount of communication involved ;)

I'm personally skeptical that using a pure message-passing approach for everything is ever going to approach the performance of a well-tuned deque-based runtime in communication-heavy scenarios, but I'd love to be surprised!

Some good news on my side, I finally understood why I had very low performance on parallel reduction in https://github.com/mratsim/weave/pull/83. It was because my "loadBalance" routine didn't split parallel loop tasks, meaning I had sequential speed + overhead of steal requests that couldn't be satisfied. Following the fix, Weave is now faster than OpenMP on my histogram used-to-be-pathological benchmark.

The last frontier is implementing a matrix multiplication that is as fast as Intel MKL GEMM. On my machine, with a max throughput of 4TFlops, for a 1920x1920 by 1920x1920 matrix multiplication

MKL + GNU OpenMP reaches 2.8 TFlops
MKL + Intel TBB is 1 TFlops and
MKL + Intel OpenMP reaches 3.1 TFlops
MKLDNN + (Intel?) OpenMP reaches 3.0 TFlops (and sometimes shutdown my machine due to AVX512 either voltage or heat)
OpenBLAS is at 1.6 TFlops
Laser which is a pure Nim + OpenMP matrix multiplication reaches 2.7 TFlops Currently pure Weave threading based on laser code is at ~2.3 TFlops~ 2.53TFlops

I released the v0.3.0 of Weave with several fixes and improvement on nested data parallelism:

On Matrix Multiplication on my 18-core machine with a single-threaded baseline of:

~100GFlops for OpenBLAS (misdetection as AVX2? instead of AVX512 similar to https://github.com/flame/blis/issues/352#issuecomment-569813598)
~160GFlops for pure-Nim BLAS using Weave or Laser (OpenMP-based)
~199GFlops for MKL-DNN v1.0.4
~207GFlops for Intel MKL 2019.5

I get the following speedup with Weave (backoff means the worker goes to sleep on steal failure, not just pause::memory).

Weave with backoff achieves 2.4\~2.5GFlops the same speedup as Intel MKL + Intel OpenMP (3.0\~3.1TFlops) at 15.0\~15.5x speedup
Weave is much better than Intel MKL + GCC OpenMP (2.8TFlops) at 14x speedup
Weave without backoff (2.7TFlops) achieves a speedup of 16.9x This is the only benchmark among the 12 I have where the backoff degrades performance.

The main problem with a message-passing scheduler is that in coarse-grained tasks, the thief needs cooperation of the victims which might be in a heavy computation kernel. So despite a worst-case scenario with relatively long inner matmul kernel where the victim cannot reply to thieves and lots of synchronizations due to double or triple nested parallelism, message-passing based runtimes can scale.

I tried looking into pure Rust matrix multiplication as well but:

MatrixMultiply is not parallelized: https://github.com/bluss/matrixmultiply
blasoxide is undergoing complete reset and the old version has scaling issue on my machine (20% theoretical peak while on author machine it's 83%): https://github.com/oezgurmakkurt/blasoxide/issues/13

HadrienG2 / parallel-histograms-bench

Perf profile on i9-9980XE #2

Side note on RNG