HadrienG2 / parallel-histograms-bench

Exercising various strategies for parallel histogramming in HEP
2 stars 0 forks source link

Perf profile on i9-9980XE #2

Open mratsim opened 4 years ago

mratsim commented 4 years ago

Hello Hadrien,

I was looking for some nice map-reduce benchmarks to add to evaluate my multithreading runtime (Weave) and was thinking that histograms could be a nice example.

Your benchmark's performance profile is very strange on my machine:

This is the output of the default configuration:

    const NUM_BINS: usize = 1000;
    const NUM_ROLLS: usize = 300_000_000;
    const BATCH_SIZE: usize = 100;
    const NUM_BUCKETS: usize = 2;
7.258854153333333 ns/iter, test tests::sequential_raw ... ok
7.643087426666667 ns/iter, test tests::sequential_thread_local ... ok
7.814808806666667 ns/iter, test tests::sequential_thread_bucketized ... ok
7.817408296666667 ns/iter, test tests::sequential_mutex ... ok
12.8888737 ns/iter, test tests::sequential_atomic ... ok
30.117832493333335 ns/iter, test tests::parallel_mutex ... ok
30.118382763333333 ns/iter, test tests::parallel_thread_bucketized ... ok
30.119495196666666 ns/iter, test tests::parallel_atomic ... ok
30.115383773333335 ns/iter, test tests::parallel_thread_local ... ok

Not too sure why all parallel versions are as slow as each other.

Side note on RNG

Also you might want to preallocate the array of values so that it's easier to compare across languages without the influences of RNG implementation.

AFAIK many languages (Nim, Lua, Julia) are using Xoroshiro instead of XorShift that Rust seems to be using. Furthermore the Xoroshiro family all have a jump function that allow them to jump by 2^64 generated numbers (see http://prng.di.unimi.it/xoroshiro128plus.c). This would allow to have reproducible RNG sequences from 1 seed without conflict as long as each thread is started 2^64 RNG iterations apart.

Reference: http://prng.di.unimi.it/

HadrienG2 commented 4 years ago

That's definitely not an expected output. At first I suspected that the compiler might have become more clever and optimized the microbenchmark out, but I just checked that as of latest stable (1.39) that's not true on my old i3-based machine:

$ cargo test --release -- --nocapture --test-threads=1
running 9 tests
test tests::parallel_atomic ... 8.75734467 ns/iter, ok
test tests::parallel_mutex ... 17.071206826666668 ns/iter, ok
test tests::parallel_thread_bucketized ... 6.003549363333334 ns/iter, ok
test tests::parallel_thread_local ... 4.479292656666667 ns/iter, ok
test tests::sequential_atomic ... 13.806369643333333 ns/iter, ok
test tests::sequential_mutex ... 10.22155482 ns/iter, ok
test tests::sequential_raw ... 9.2978603 ns/iter, ok
test tests::sequential_thread_bucketized ... 10.252130513333332 ns/iter, ok
test tests::sequential_thread_local ... 10.2098582 ns/iter, ok

Intuitively, the fact that all parallel versions take the same time on your machine make me suspect that there's an issue with parallelization. Can you cross-check in your system monitor that all CPU cores actually get used during your parallel runs?


Regarding RNGs, I am not comfortable with precomputing all data in advance because it...

  1. Limits maximal benchmark size (too much and you exhaust your RAM).
  2. Quickly makes things bottlenecked by RAM bandwidth, which may hide other interesting effects.

We could probably agree on some common choice of RNG + seed though.

FYI, Rust's rand library uses a few rounds of the ChaCha block cipher as its default PRNG for security reasons. I didn't bother changing it at the time since it was fast enough not to hide the effect that I was interested in and speed of development was important.

But since you're interested in a more easily reproducible configuration, I updated the latest master to use xoshiro128+ with the initial seed [0x12, 0x34, 0x56, 0x78, 0x9a, 0xbc, 0xde, 0xf0, 0x0f, 0xed, 0xcb, 0xa9, 0x87, 0x56, 0x43, 0x21]. Each thread makes a clone of the global RNG then makes it jump() during its initialization phase.

Note that this is not enough to make execution reproducible between e.g. sequential and N threads, but only across runs with the same number of threads. Which should be enough, with 300M iterations this benchmark has largely enough statistics for the exact choice of seed not to have a significant influence on the results...

HadrienG2 commented 4 years ago

Also, bear in mind that this benchmark was built to study and compare the overhead of various concurrent histogramming strategies, and therefore puts a somewhat unfair amount of worst-case emphasis on said overhead as opposed to trying to be a fair representation of how people fill histograms in practice.

mratsim commented 4 years ago

What is strange is that I see multiple threads during the sequential execution but they seem idle during the parallel exec:

During sequential DeepinScreenshot_plasmashell_20191208235322

Transitioning to "parallel" DeepinScreenshot_select-area_20191208235305

DeepinScreenshot_select-area_20191208235516

I don't really mind the non-reproducibility as long as it gives me ballpark order of magnitudes of time to run and speedup of sequential vs parallel.

In any case, the good news is that I ended up finding and porting another 2D histogram benchmark I have found to my runtime (and the bad news is that it's a pathological case for it).

Anyway feel free to close the issue, unfortunately I don't think I'll have the time to relearn Rust and then understand why it behaves this way on my machine.

HadrienG2 commented 4 years ago

Aha, I see what your problem is. You run cargo test --release -- --nocapture --test-threads=36, where you should use --test-threads=1 here. It's totally my fault though, I should really stop misusing Rust's built-in test framework as a microbenchmarking framework.

You see, the thing is that Rust provides a nice built-in test runner, and I find it great to be able to just annotate a bunch of functions as #[test] and get them all to run automagically via a simple cargo test. But although a benchmark-oriented cousin exists (unsurprisingly called #[bench] and invoked via cargo bench), it has not been stabilized yet because it has some design issues that the Rust team would like to take care of first.

So what I do in meantime is to abuse the test runner as a benchmark runner by...

This latter flag does not prevent an individual "test" from spawning multiple threads internally, and that's exactly what happens here when running with test-threads=1, as intended.

Anyhow, good luck with your project, reductions are always the most "interesting" part of parallel programming toolkits due to the amount of communication involved ;)

I'm personally skeptical that using a pure message-passing approach for everything is ever going to approach the performance of a well-tuned deque-based runtime in communication-heavy scenarios, but I'd love to be surprised!

mratsim commented 4 years ago

Anyhow, good luck with your project, reductions are always the most "interesting" part of parallel programming toolkits due to the amount of communication involved ;)

I wanted to add some special support for them but I guess I'll just do for loop + critical sections/atomics ;)

I'm personally skeptical that using a pure message-passing approach for everything is ever going to approach the performance of a well-tuned deque-based runtime in communication-heavy scenarios, but I'd love to be surprised!

You're in for a treat then, for both Task Parallelism or Data Parallelism, my library has actually significantly less overhead and similar or better load distribution than:

I didn't find any Task-Parallel Rust library to compare with unfortunately.

For scheduler overhead focus benchmarks in my bench suite (https://github.com/mratsim/weave/tree/master/benchmarks) the difference can be very significant:

It can also deal with proper nested loop and not just OpenMP collapsed loops, which was the main reason as started this runtime, for the linear algebra/deep learning compiler I'm planning I want to have nested loop parallelism to symplify for example batched matrix multiplication implementations.

Example on a 2D tiled matrix transposition algorithm (you can embed pure C in Nim (https://github.com/mratsim/weave/tree/333375df1e268f42207bc88a4881157de3af054e/benchmarks/matrix_transposition)

Perf (the bench is quite unstable due to the matrices being small, but it gives a +-20% ballpark approximation: DeepinScreenshot_select-area_20191209114146

HadrienG2 commented 4 years ago

Can you give me a sample build command for the purpose of experimenting with this further?

I've never built a Nim program before, and did not find anything looking like a build system in the weave repo so I just downloaded the latest Nim nightly (as stable wouldn't build your benchmarks), looked at nim --help and tried to build with every flag which sounded like it could possibly have an effect on performance (nim compile --opt:speed --stackTrace:off --lineTrace:off --threads:on --checks:off --assertions:off -d:release weave_fib.nim). But I'm not sure if that's the right way to go.

So I just wanted to check with you what's the recommended way to build a Nim program in an "as fast as possible" configuration.

HadrienG2 commented 4 years ago

In any case, here are quick and dirty Rayon ports of your simplest benchmarks. On my machine, they exhibit performance that's roughly on par with the Nim version (compiled with either the bunch of flags above or the --threads:on -d:danger combo which I've discovered more recently that performs equally well and is shorter).

Which, honestly, is already a pretty impressive achievement from your library, given the design constraints that you're operating under.

mratsim commented 4 years ago

Nice that you figured this out, yes --thread:on and -d:danger

Note that I think in your fib benchmark you are spawning an extra thread compared to my bench if I read this correctly: https://github.com/HadrienG2/weave-parallel-benchmarks-rs/blob/004a559970bf74ebd22f69fd0abe101f85765c0c/src/bin/fib.rs#L5

Here are the build command line arguments for Weave:

HadrienG2 commented 4 years ago

So, actually that was a benchmark bug, in the sense that I was inadvertently cheating by using a Rayon construct that's idiomatic for users of that library, but less general-purpose than the Weave spawn construct and therefore more amenable to optimization by the Rayon implementors.

The latest master provides a clearer choice between "use the idiomatic Rayon tool for the job" and "closely match what the Nim version of the benchmark does". It does the latter by default (let's play fair), and does the former when built with --features idiomatic.

You will notice that the fib benchmark performs much worse in the non-idiomatic configuration, because Rayon's scope() construct could really use some optimization love by the implementors. IIUC, as currently implemented, it forces migration of the task inside the scope to a different thread on the thread pool, and uses one heap allocation per spawned task, both of which are unpleasantly costly. The reason why it hasn't been optimized more, as far as I know, is that most algorithms are actually well covered by join()-style binary fork join...

HadrienG2 commented 4 years ago

...and after further reflection, the non-idiomatic version of dfs also needs to become even more ugly (from a Rust + rayon user's pov of course) before it becomes a fair imitation of what the Weave version is doing. Sigh.

Note that the Rayon team has plans to integrate support for Rust futures in the longer term, which could be more directly comparable with Weave's future-based design (and hopefully will motivate the Rayon team to improve scope's performance ;)).

mratsim commented 4 years ago

Well my goal with those bench is in the end to produce a tool that helps me write fast code, as long as there is a way and you don't have to bend too much compared to something natural I think it's OK in the end.

In those benches though the goal was to compare scheduler overhead so those tasks need to spawn a couple trillions of tasks to be meaningful ;).

Now I think it's fine to have both implementations alongside. In my own library, unfortunately reductions are broken so I probably would fare badly with a map-reduce implementation. (But I have ways around that since OpenMP reductions are also a bit inflexible so I used plain parallel for when I was still hoping to salvage OpenMP - https://github.com/numforge/laser/blob/master/examples/ex05_tensor_parallel_reduction.nim)

The pattern for both fib and dfs is pretty regular though, I think nqueens would might be a bit harder without futures.

HadrienG2 commented 4 years ago

In those benches though the goal was to compare scheduler overhead so those tasks need to spawn a couple trillions of tasks to be meaningful ;).

At the risk of sounding pedantic even though I understand what you mean (and largely agree!), the word "task" can have two meanings in modern parallel runtimes:

  1. The elementary unit of independent work that is specified by the user, which can be run in parallel from others.
  2. The concrete work-item that ends up being shuffled around in the runtime's scheduling queues.

The former is an API-visible runtime user concern, the latter is a runtime implementor concern that should remain hidden from the API and only emerge when the runtime takes bad scheduling decisions that force the user to exert some manual control on it.

A very fertile ground for parallel runtime ergonomics and optimization over the years has been to realize that these two kinds of work items do not need to have the same granularity:

  1. In the OpenMP era, a first generation of clever runtime implementors realized that they did not need to push one work-item in their scheduling queues per iteration of a "parallel for" loop, but could instead split the iteration space into chunks of larger granularity, only as far as demanded by the machine's level of parallelism.
  2. Later on, in the TBB era, a second generation of even more clever runtime implementors realized that work splitting did not have to be a static compile-time or configuration decision, but could instead be something that the runtime does dynamically as demanded by scheduling circumstances, core processing ability and load imbalance.
  3. I wonder if there are runtimes in the wild that also try to merge back work items for the sake of reducing pressure on scheduling queues when there are too many tasks in flight? From reading through either Weave's description or the Picasso RFC, I got the impression that this was something that you were interested in doing or already experimenting with.

Runtimes with dynamic work splitting make more complex dynamic scheduling decisions, and therefore are intrinsically slower (assuming an equal degree of optimization) in the ideal case of a homogeneous "parallel for" iteration workload whose execution schedule can be easily decided at compile time. But they make up for it by handling more complex scenarios like quicksort or heterogeneous hardware efficiently without requiring manual user intervention like specifications of work granularity.

Therefore, although I agree with you that scheduling overhead can and should be benchmarked at the granularity of individual work item handling in scheduling queues, it's important to keep in mind that such a benchmark is not 100% fair to the cleverness and flexibility of modern dynamic runtimes (which cannot win against an equally well-optimized static runtime with dumber and faster scheduling decisions, only get close) and that we also need benchmarks which stress a runtime's ability to make good work splitting and merging decision.

And this is why I thought it was interesting to have both versions of the benchmark. One which is a pure stress test for the runtime's scheduling queues (and therefore tests the runtime's low-level scheduling queue optimizations), and one which uses the native runtime API in a way which is both more convenient to the user and a bit less stressful for the runtime's queues (and therefore tests the runtime's high-level scheduling logic and "exposing the right API" optimizations).

Anyway, enough philosophical babbling about APIs and dynamic scheduling, let's try to add an nqueens port to my repo ;)

mratsim commented 4 years ago

In those benches though the goal was to compare scheduler overhead so those tasks need to spawn a couple trillions of tasks to be meaningful ;).

At the risk of sounding pedantic even though I understand what you mean (and largely agree!), the word "task" can have two meanings in modern parallel runtimes:

1. The elementary unit of independent work that is _specified_ by the user, which _can_ be run in parallel from others.

2. The concrete work-item that ends up being shuffled around in the runtime's scheduling queues.

The former is an API-visible runtime user concern, the latter is a runtime implementor concern that should remain hidden from the API and only emerge when the runtime takes bad scheduling decisions that force the user to exert some manual control on it.

A very fertile ground for parallel runtime ergonomics and optimization over the years has been to realize that these two kinds of work items do not need to have the same granularity:

1. In the OpenMP era, a first generation of clever runtime implementors realized that they did not need to push one work-item in their scheduling queues per iteration of a "parallel for" loop, but could instead split the iteration space into chunks of larger granularity, only as far as demanded by the machine's level of parallelism.

2. Later on, in the TBB era, a second generation of even more clever runtime implementors realized that work splitting did not have to be a static compile-time or configuration decision, but could instead be something that the runtime does dynamically as demanded by scheduling circumstances, core processing ability and load imbalance.

Agreed

3\. I wonder if there are runtimes in the wild that also try to _merge back_ work items for the sake of reducing pressure on scheduling queues when there are too many tasks in flight? From reading through either Weave's description or the Picasso RFC, I got the impression that this was something that you were interested in doing or already experimenting with.

Actually Weave doesn't merge back parallel loop, what it does is there: https://github.com/mratsim/weave/blob/333375df1e268f42207bc88a4881157de3af054e/weave/parallel_for.nim#L23-L47

template parallelForWrapper(
    idx: untyped{ident},
    prologue, loopBody, epilogue,
    remoteAccum, resultTy,
    returnStmt: untyped): untyped =
  ## To be called within a loop task
  ## Gets the loop bounds and iterate the over them
  ## Also poll steal requests in-between iterations
  ##
  ## Loop prologue, epilogue,
  ## remoteAccum, resultTy and returnStmt
  ## are unused

  block:
    let this = myTask()
    ascertain: this.isLoop
    ascertain: this.start == this.cur

    var idx {.inject.} = this.start
    this.cur += this.stride
    while idx < this.stop:
      loopBody
      idx += this.stride
      this.cur += this.stride
      loadBalance(Weave)

After each loop iteration, the runtime checks for incoming steal requests (unfortunate overhead of using message-passing, the workers that have actual work have to do also the load balancing while in classic Chase-Lev deque work-stealing it's the thief that have nothing to do that incur this overhead.) The this.stop is dynamic, if there are incoming requests, the worker will split the remaining iterations to satisfy all incoming requests and send them to all thieves (very nice side-benefit of using message passing, lazy loop splitting would require synchronization between workers and thieves if the loop was the last task of a chase-lev deque)

There are many benefits to this, it makes writing generic algorithm like map much easier, static scheduling needs inflexible heuristics about when or when not a loop should be split. This is shown in those OpenMP for Pytorch benchmarks https://github.com/zy97140/omp-benchmark-for-pytorch and issue https://github.com/pytorch/pytorch/issues/3146 which in summary, for 3 class of processes tell us from how much elements in the tensor we should start parallelizing:

CPU Model Sockets Cores/Socket Frequency
Intel(R) Xeon(R) CPU E5-2699 v4 2 22 2.20GHz
Intel(R) Xeon(R) Platinum 8180 CPU 2 28 2.50GHz
Intel(R) Core(TM) i7-5960X CPU 1 8 3.00GHz

Tested operations:

copy add div sin exp sum prod

For contiguous tensor operation:

Xeon(R) Platinum 8180 CPU Xeon(R) CPU E5-2699 v4 i7-5960X CPU
copy 80k 20k 8k
add 80k 20k 8k
div 50k 10k 2k
exp 1k 1k 1k
sin 1k 1k 1k
sum 1k 1k 1k
prod 1k 1k 1k

For discontiguous tensor operation:

Xeon(R) Platinum 8180 CPU Xeon(R) CPU E5-2699 v4 i7-5960X CPU
copy 20k 8k 2k
add 20k 8k 2k
div 10k 8k 1k
exp 1k 1k 1k
sin 2k 2k 1k
sum 1k 1k 1k
prod 1k 1k 1k

This lead to optimizing for the common denominator (say a 4~8 cores CPU) instead of properly using the hardware. My own low-level library that I'm building as a linear-algebra/HPC/deep-learning backend struggles with the same thing

const OMP_MEMORY_BOUND_GRAIN_SIZE*{.intdefine.} = 1024
  ## This is the minimum amount of work per physical cores
  ## for memory-bound processing.
  ## - "copy" and "addition" are considered memory-bound
  ## - "float division" can be considered 2x~4x more complex
  ##   and should be scaled down accordingly
  ## - "exp" and "sin" operations are compute-bound and
  ##   there is a perf boost even when processing
  ##   only 1000 items on 28 cores
  ##
  ## Launching 2 threads per core (HyperThreading) is probably desirable:
  ##   - https://medium.com/data-design/destroying-the-myth-of-number-of-threads-number-of-physical-cores-762ad3919880
  ##
  ## Raising the following parameters can have the following impact:
  ##   - number of sockets: higher, more over memory fetch
  ##   - number of memory channel: lower, less overhead per memory fetch
  ##   - RAM speed: lower, less overhead per memory fetch
  ##   - Private L2 cache: higher, feed more data per CPU
  ##   - Hyperthreading and cache associativity
  ##   - Cores, shared L3 cache: Memory contention
  ##
  ## Note that setting num_threads manually might impact performance negatively:
  ##   - http://studio.myrian.fr/openmp-et-num_threads/
  ##     > 2x2ms overhead when changing num_threads from 16->6->16

const OMP_NON_CONTIGUOUS_SCALE_FACTOR*{.intdefine.} = 4
  ## Due to striding computation, we can use a lower grainsize
  ## for non-contiguous tensors

Unfortunately I'm also pretty sure that Intel TBB also suffers from the curse of "how to choose a grain size" as the splitting is eager and there is no feedback loop that would inform the runtime "is it worth it to parallelize if I have 1000 exponentiation", "what if it's 1000 cos?", so it's up to the developer to figure out the grain size and you get issue that cannot be solved: https://github.com/pytorch/pytorch/issues/3146

Now, maybe re-merging would be interesting but I think not splitting in the first place is best ;).

Also lazy loop splitting composes much better with nested parallelism when parallelism is found at a higher level like another for loop (batch matrix multiplication) or graph-level parallelism for neural network with splitted code-paths, see https://github.com/pytorch/glow/issues/1749, https://github.com/cpp-taskflow/cpp-taskflow/issues/97.

Dealing with nested parallelism is actually one of the main Julia focus with PARTR which uses something relatively unknown: a Parallel Depth-First Scheduler instead of a work-stealing scheduler, instead of spawning tasks from the root of the task tree (assuming spawn/sync semantics), it's down depth first to reuse the parallel hard work down by people behind BLAS or FFTW. Actually they went one step beyond that and made the FFTW threadpool backend modular https://github.com/FFTW/fftw3/issues/175 and are in the process of doing the same for OpenBLAS https://github.com/xianyi/OpenBLAS/pull/2255. So this is also an opportunity for Rayon ;)


In short, I think in today's era where:

a static scheduler cannot keep up, even for simple for-loop.


If you like reading about runtimes, I have a treasure chest of papers there: https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md

For dealing with overheads, I have laid out my techniques to deal with multithreaded memory management here:

mratsim commented 4 years ago

Anyhow, good luck with your project, reductions are always the most "interesting" part of parallel programming toolkits due to the amount of communication involved ;)

I'm personally skeptical that using a pure message-passing approach for everything is ever going to approach the performance of a well-tuned deque-based runtime in communication-heavy scenarios, but I'd love to be surprised!

Some good news on my side, I finally understood why I had very low performance on parallel reduction in https://github.com/mratsim/weave/pull/83. It was because my "loadBalance" routine didn't split parallel loop tasks, meaning I had sequential speed + overhead of steal requests that couldn't be satisfied. Following the fix, Weave is now faster than OpenMP on my histogram used-to-be-pathological benchmark.

The last frontier is implementing a matrix multiplication that is as fast as Intel MKL GEMM. On my machine, with a max throughput of 4TFlops, for a 1920x1920 by 1920x1920 matrix multiplication

mratsim commented 4 years ago

I released the v0.3.0 of Weave with several fixes and improvement on nested data parallelism:

On Matrix Multiplication on my 18-core machine with a single-threaded baseline of:

I get the following speedup with Weave (backoff means the worker goes to sleep on steal failure, not just pause::memory).

The main problem with a message-passing scheduler is that in coarse-grained tasks, the thief needs cooperation of the victims which might be in a heavy computation kernel. So despite a worst-case scenario with relatively long inner matmul kernel where the victim cannot reply to thieves and lots of synchronizations due to double or triple nested parallelism, message-passing based runtimes can scale.

I tried looking into pure Rust matrix multiplication as well but: