Open mratsim opened 5 years ago
Did you set JULIA_NUM_THREADS
? It defaults to 1.
Even if your numbers aren't quite accurate, it is true that we haven't done a whole lot of optimization on the runtime overheads at this time; it's a known TODO. I agree that task spawn time can be a bottleneck in some codes. We'll get to it. Probably faster when there's a real code for which it is an issue. :) Regular parallel loops are a greater concern at the moment.
JULIA_NUM_THREADS
That was it, after setting it to 36 (18 cores i9-9980XE) I see that the tasks are properly distributed so issue 1 solved.
To be honest I just copy-pasted the code that was in the intro of this blog post https://julialang.org/blog/2019/07/multithreading and missed the part with JULIA_NUM_THREADS.
The overhead is still quite high:
However if there is already a TODO that tracks this I'm fine with having this issue closed to avoid clobbering the tracker.
Regarding real code and priorities on for loop I completely agree.
In terms of concrete workload that may be impacted by that, I expect Monte-Carlo Tree Search for reinforcement learning may be impacted. For example in the game of go depending on the heaviness of heuristics engine simulate 1000 to 20000 games and while the number is stable throughout the lifetime of the application it's very stressful for the garbage collector if there is no pooling scheme.
I digged around a bit and I found this interesting paper and presentation " Space Efficient Global Scheduler for Cilk" by Hickey and Quentmeyer:
They revisited Cilk and compared it against Narlikar and Blelloch Parallel Depth-First scheduler and implemented an hybrid WS-PDF scheduler. The main insight is that Parallel Depth-First Scheduling as implemented in Narlikar is 400x slower than work-stealing on some workload due to the overhead. They propose a hybrid solution to overcome that.
The 3 issues highlighted are bad locality of threads, global data structures and bookkeeping.
Now AFAIK in Julia:
While researching multithreading runtimes, I've been looking with interest into Julia new multithreading backend PARTR.
However, on a simple Fibonacci example, it seems like the runtime doesn't get to distribute tasks to other threads compared to other runtimes like Clang OpenMP or Intel TBB.
This is clearly visible in
htop
.Source:
Unfortunately I wasn't able to check whether the issue was in
partr
as the proof-of-concept partr code by @kpamnany segfaults onfib 16
(https://github.com/kpamnany/partr/issues/1) or maybe introduced with Julia runtime (GC, JIT, ...)Here is the Clang OpenMP version
An Intel TBB version based on class
An Intel TBB version based on closures
And the perf results:
Given that fibonacci does
2^n
addition which should take only 1/CpuFreq s (so about 0.3 ns on a 3GhZ CPU). This is a benchmark of framework overhead.It shows that the current overhead is 250x compared to C/C++ established runtimes.
Note that this is a very skewed benchmark. I didn't check if partr presents the same issue for reasonable task lengths like 1000~10000 CPU cycles (0.3 to 3 ms on a 3GHz CPU).
Also I intentionally did not benchmark the OpenMP test with GCC because GCC implementation of OpenMP is completely unsuited for fine-grained task parallelism, it uses a global lock queue with no load balancing, it cannot complete fibonacci(40) in a reasonable time (less than 10 minutes).
In summary there are 2 issues that may or may not stem from the same root cause:
A better benchmark to understand where Julia stands with regards to other runtimes would be the unbalanced Tree Search benchmark which is the defacto standard in multithreading runtime evaluation:
Lastly, with the growing number of cores, developers will have to find more and more parallelism opportunities. A runtime with a low overhead that can schedule efficiently tasks that take very little time would enable much more parallelism opportunities otherwise the recent processors with 30+ cores and 60+ hardware threads will not be used to their full extent.