reduce overhead of task creation and switching

mratsim commented 4 years ago

While researching multithreading runtimes, I've been looking with interest into Julia new multithreading backend PARTR.

However, on a simple Fibonacci example, it seems like the runtime doesn't get to distribute tasks to other threads compared to other runtimes like Clang OpenMP or Intel TBB.

This is clearly visible in htop.

Source:

import Base.Threads.@spawn

function fib(n::Int)
    if n < 2
        return n
    end
    x = @spawn fib(n - 1)
    y = fib(n - 2)
    return fetch(x) + y
end

function main()
  n = parse(Int, ARGS[1])
  println(fib(n))
end

main()

DeepinScreenshot_select-area_20191104194318

Unfortunately I wasn't able to check whether the issue was in partr as the proof-of-concept partr code by @kpamnany segfaults on fib 16 (https://github.com/kpamnany/partr/issues/1) or maybe introduced with Julia runtime (GC, JIT, ...)

Here is the Clang OpenMP version

#include <stdio.h>
#include <omp.h>

// Compile with "clang -O3 -fopenmp omp_fib.c"

long long fib(long long n)
{
  long long i, j;
  if (n<2)    // Note: be sure to compare n<2 -> return n
    return n; //       instead of n<=2 -> return 1
  else
    {
       #pragma omp task shared(i) firstprivate(n)
       {
         i=fib(n-1);
       }

       j=fib(n-2);
       #pragma omp taskwait
       return i+j;
    }
}

int main(int argc, char *argv[])
{

    if (argc != 2) {
    printf("Usage: fib <n-th fibonacci number requested>\n");
        exit(0);
    }

  long long n = atoi(argv[1]);

  #pragma omp parallel shared(n)
  {
    #pragma omp single
    printf ("fib(%lld) = %lld\n", n, fib(n));
  }
}

An Intel TBB version based on class

An Intel TBB version based on closures

And the perf results:

DeepinScreenshot_select-area_20191104195612

Given that fibonacci does 2^n addition which should take only 1/CpuFreq s (so about 0.3 ns on a 3GhZ CPU). This is a benchmark of framework overhead.

It shows that the current overhead is 250x compared to C/C++ established runtimes.

Note that this is a very skewed benchmark. I didn't check if partr presents the same issue for reasonable task lengths like 1000~10000 CPU cycles (0.3 to 3 ms on a 3GHz CPU).

Also I intentionally did not benchmark the OpenMP test with GCC because GCC implementation of OpenMP is completely unsuited for fine-grained task parallelism, it uses a global lock queue with no load balancing, it cannot complete fibonacci(40) in a reasonable time (less than 10 minutes).

In summary there are 2 issues that may or may not stem from the same root cause:

The distribution of tasks on multiple cores
The overhead of the runtime (or garbage collector or JIT)

A better benchmark to understand where Julia stands with regards to other runtimes would be the unbalanced Tree Search benchmark which is the defacto standard in multithreading runtime evaluation:

paper: https://www.cs.unc.edu/~olivier/LCPC06.pdf
Comparative runtime analysis by Sandia National Laboratories: https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf

Lastly, with the growing number of cores, developers will have to find more and more parallelism opportunities. A runtime with a low overhead that can schedule efficiently tasks that take very little time would enable much more parallelism opportunities otherwise the recent processors with 30+ cores and 60+ hardware threads will not be used to their full extent.

Keno commented 4 years ago

Did you set JULIA_NUM_THREADS? It defaults to 1.

kpamnany commented 4 years ago

Even if your numbers aren't quite accurate, it is true that we haven't done a whole lot of optimization on the runtime overheads at this time; it's a known TODO. I agree that task spawn time can be a bottleneck in some codes. We'll get to it. Probably faster when there's a real code for which it is an issue. :) Regular parallel loops are a greater concern at the moment.

mratsim commented 4 years ago

JULIA_NUM_THREADS

That was it, after setting it to 36 (18 cores i9-9980XE) I see that the tasks are properly distributed so issue 1 solved.

To be honest I just copy-pasted the code that was in the intro of this blog post https://julialang.org/blog/2019/07/multithreading and missed the part with JULIA_NUM_THREADS.

The overhead is still quite high: DeepinScreenshot_select-area_20191105010319

However if there is already a TODO that tracks this I'm fine with having this issue closed to avoid clobbering the tracker.

Regarding real code and priorities on for loop I completely agree.

In terms of concrete workload that may be impacted by that, I expect Monte-Carlo Tree Search for reinforcement learning may be impacted. For example in the game of go depending on the heaviness of heuristics engine simulate 1000 to 20000 games and while the number is stable throughout the lifetime of the application it's very stressful for the garbage collector if there is no pooling scheme.

mratsim commented 4 years ago

Research on Parallel Depth-First overhead

I digged around a bit and I found this interesting paper and presentation " Space Efficient Global Scheduler for Cilk" by Hickey and Quentmeyer:

They revisited Cilk and compared it against Narlikar and Blelloch Parallel Depth-First scheduler and implemented an hybrid WS-PDF scheduler. The main insight is that Parallel Depth-First Scheduling as implemented in Narlikar is 400x slower than work-stealing on some workload due to the overhead. They propose a hybrid solution to overcome that.

DeepinScreenshot_select-area_20191105132602

The 3 issues highlighted are bad locality of threads, global data structures and bookkeeping.

DeepinScreenshot_select-area_20191105132804

Now AFAIK in Julia:

point 1 is counterbalanced by the fact that data used in tasks have good cache locality and I expect there is a design space tradeoff between locality of the data and locality of the tasks (being passed around in a PDF schedule, and staying mostly on the spawning thread on a WS scheduler assuming a balanced workload).
point 2 is addressed by MultiQueues per thread queues
point 3 is what remains to be addressed. The hybrid scheduler solution is probably a non-starter due to the fact that Julia wants to allow well parallelized libraries to use their parallelism (like BLAS).

JuliaLang / julia

reduce overhead of task creation and switching #33762

Research on Parallel Depth-First overhead