Open mratsim opened 5 years ago
Thanks for reporting! However I don't think that the trivial Fibonacci implementation is a good benchmark for any runtime system... what is your real usecase?
It's a good benchmark to measure the overhead of runtimes.
I'm currently building my own runtime for the Nim programming language with a focus on very-fine grained parallelism. My goal is to expose parallelism to users even for tasks in the 1000 cycles as while we get more cores the minimum grain size is an issue.
Given that fibonacci is a benchmark with only a single instruction, it's actually a pure overhead benchmark to compare runtimes on.
For reference on fibonacci(40) on my machine:
For reference among the less known C and C++ runtimes
Both achieve those impressively low overheads by taking great attention to cache and memory locality.
On my own WIP runtime I also achieve 180 ms at the moment, however as my runtime is message/channels based instead of atomics/shared-memory based and I want portability across OSes and CPU arch, the main optimization in fibril is not applicable as it relies on both Linux and the x86-64 calling convention (but may be emulable via setjmp/longjmp).
My runtime is being developed here: https://github.com/mratsim/weave
As HPX / ParalleX is from the documentation "message-driven" I was interested in how it fares on the fibonacci overhead front.
As HPX launches a new 'thread' for each invocation of the simple Fibonacci (this includes a stack allocation and a full future instantiation) implementation it will compare very badly against anything that does not do that. The overhead of one task/thread in HPX is about 1 microsecond per task, you can measure that easily using the future-overheads benchmark (https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/future_overhead.cpp).
BTW, if all you do is to run fib(40) sequentially it will most likely not take longer than a couple of milliseconds - so really what's the point...
BTW, if all you do is to run fib(40) sequentially it will most likely not take longer than a couple of milliseconds - so really what's the point...
The point is to get an idea of the worst case scenario of scheduling overhead for tasks that are too fine-grained for the runtime system.
For some algorithms especially the recursive ones, for example a tree algorithm, a user may not or maybe cannot size the tasks into sufficiently large chunks to amortize the runtime overhead because by the time it is known that we are at a leaf computation that is too small to be efficiently scheduled by traditional runtime, it is too late.
This is exacerbated by the growing number of cores available in current systems, which requires developers to create more tasks. A runtime with low overhead will greatly assist that by allowing developers to confidently create new tasks and parallelism opportunities knowing that the runtime will handle it efficiently instead of choking.
Now, optimizing for such scenarios might not be one of HPX priorities, just like GCC OpenMP also cannot handle fibonacci. But given how many of the quickstart examples are based on naive fibonacci, I think it would be valuable to address that overhead or at least mention in the doc that fibonacci is an unrealistic workload and that HPX recommends a minimum task size of x cycles to amortize scheduling overhead (for example TBB mentions 10000 cycles).
@mratsim Sure, I understand where you're coming from. The main question that interests me in the context of this discussion however would be where other runtimes 'cut corners' by operating under well defined constraints that allow for optimizations. We usually can't apply those optimizations to HPX because of its general purpose nature (at least not in the general case). OpenMP (before V5) for instance can rely on the fact that no task will exit the scope it was created in (which is the case for your Fibonacci example). This alone allows for significant optimizations. There are other implicit assumptions that people rely on while implementing their runtimes that give some (possibly non-trivial) benefit.
All of this said - we'd be more than happy to accept pull requests for special case scheduler implementations that a user could explicitly ask for in certain, well defined situations.
Yes of course, for example Staccato requires at the moment to declare the computation as a task and also indicate at compile-time how many tasks an async computation can spawn (for example fibonacci is 1 if we only fork(n-1) or 2 if we fork(n-2). This is used with great effect to be able to use a StackAllocator. More special case: fibril requires x86-64 calling convention + assembly to implement resumable tasks from a cactus stack and also mmap/madvise which doesn't work on Windows (but can be worked around probably).
And obviously both assume shared-memory.
Now regarding my own runtime, there are 2 keys part:
Batching stealed tasks so that the overhead is amortized on many tasks. This is covered in the following thesis "Embracing Explicit Communication in Work-Stealing Runtime Systems" by Andreas Prell (@aprell). C implementation at https://github.com/aprell/tasking-2.0, the important part is an adaptative switch between steal-one and steal-half of the tasks: https://github.com/aprell/tasking-2.0/blob/303ce3e09a7c118241b9d5482219e3e3b848212e/src/runtime.c#L791-L828.
Avoiding allocations, either by caching mechanism or using alloca for nested futures until it is actually required to allocate them because they are about to escape their scopes (for example they are being stolen by another thread).
I'm not aware of any other implicit tradeoffs in my own runtime though the new scheduler from Intel/Julia Partr does raise some intriguing points at 12:30 of this presentation (https://youtu.be/YdiZa0Y3F3c?t=751) and in their forum (https://discourse.julialang.org/t/question-regarding-julias-planned-approach-to-composable-safe-and-easy-multithreading/17692/4)
Thanks for the links!
You're welcome, I actually have plenty more here but I fear there is no enough time in a year to digest them all: https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md
I was re-reading on old comparison of multithreading runtimes from Sandia National Laboratories, a US-led scientific program, at https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf
They had to remove HPX (and GCC OpenMP and Nemesis QThreads) to analyze the other runtimes (Cilk, Intel OpenMP and Intel TBB). It would be nice to address that.
Let's work on improving these particular use cases in HPX first. Then I'd be more than happy to have those comparisons done again. That comparison was made a long time ago, I think we can do much better nowadays.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I've been measuring the overhead of various multithreading runtimes. HPX seems to have a couple of issues with a naive fibonacci.
Expected Behavior
Fibonacci examples complete in a reasonable time
Actual Behavior
HPX either segfaults or takes a very long time for no reason
Steps to Reproduce the Problem
Problem 1: Sudden spike in time required to compute fibonacci.
bin/fibonacci_one --n-value 22
bin/fibonacci_one --n-value 23
fibonacci_direct
changes from 0.218 seconds to 8.263 sProblem 2: Segfaults
Problem 3: a rare segfault will make all the subsequent call take very long-time
I manage to segfault the
fibonacci
binary with a value of 30 and I couldn't reclaim control of my console even with Ctrl+C, following that the performance of everything was very degraded.Before a segfault on fibonacci 30
After a segfault on fibonacci 30
bin/fibonacci --n-value 23
, it takes 12sSpecifications
... Please describe your environment