use multiple threads in vectorized operations

stevengj commented 11 years ago

Matlab users who employ "properly" vectorized code may notice a dramatic slowdown in going to Julia because most Matlab vector operations are multithreaded while Julia operations are not. There is an intrinsic difficulty here because Julia code cannot be multithreaded (see issue #1790), but as a stopgap measure I suppose that you could call out to C code that uses OpenMP to parallelize important vector operations.

For example, compare the following Matlab (R2012a) code (on 16-core x86_64 3GHz AMD Opteron 4280, Debian GNU/Linux)

>> x = rand(10000000,1);
Elapsed time is 0.211827 seconds.
>> tic; y = exp(x); toc          
Elapsed time is 0.035116 seconds.
>> tic; y = x.^2 + 3*x.^3; toc
Elapsed time is 0.159840 seconds.

with the Julia equivalent:

julia> x = rand(10000000,1);
julia> tic(); y = exp(x); toc()
elapsed time: 0.3979671001434326 seconds
julia> tic(); y = x.^2 + 3*x.^3; toc()
elapsed time: 2.3250949382781982 seconds

Julia is more than 10x slower. The difference is almost entirely due to multithreading, since if I run Matlab with -singleCompThread I get:

>> tic; y = exp(x); toc
Elapsed time is 0.406924 seconds.
>> tic; y = x.^2 + 3*x.^3; toc   
Elapsed time is 2.005534 seconds.

Also needed would be a way to specify the number of threads to use (e.g. you could use the standard OMP_NUM_THREADS environment variable by default, while providing some way to change it from within Julia). The nice thing about using OpenMP is that you can also use OpenMP for the OpenBLAS library and for FFTW (and possibly other libraries?), and use the same thread-pool for all three.

timholy commented 11 years ago

An approach that might be less counter-Julian than calling out to C for everything might be to rejuvenate Krys' very interesting framework mentioned in this thread: https://groups.google.com/d/topic/julia-dev/PZfOv4h94lk/discussion

IIUC, he designed it explicitly to support multiple "backends," one of which might be kernel threads. @stevengj, any interest in looking into it? It would be a fantastic contribution.

stevengj commented 11 years ago

I have no interest in CUDA, which I see as an architectural dead end. The hard thing about parallel programming is mainly dealing with memory. Shared memory is hard enough (one kind of memory + synchronization), and distributed-memory programming is harder (local + remote memory). The CUDA/OpenCL model requires the programmer to manually manage 4-5 kinds of memory (main memory, GPU memory, thread-block shared memory, thread-local memory, and remote memory in a cluster situation), which is not a reasonable way forward for software engineering except in special cases.

JeffBezanson commented 11 years ago

I think it would be totally reasonable to start by adding some shared-memory loop-parallelism with restricted parallel regions to the core language. That way we can start by just slapping locks on everything in the runtime, still allowing for speedups in simple vector loops. A possible alternative is to add a separate shared heap as in Racket's "places".

timholy commented 11 years ago

@stevengj , my point was, while Krys was interested in CUDA, I seem to remember he explicitly designed it to handle multiple backends, of which kernel threads was another candidate. The main point of his framework was a way to implement lazy evaluation in Julia. He got some nice results even with CPU execution from removing temporaries, e.g., from operations like a = b.*c + d + e, which in standard Julia requires at least two temporaries.

timholy commented 11 years ago

@JeffBezanson , I like that idea a lot!

ViralBShah commented 11 years ago

I like to think of the simple case of implementing sum over the columns of a matrix. Say you have as your primitive a parallel vector sum. Now, the implementation of the matrix column sum itself will be written to be parallel. Thus, we very quickly get instances of parallel functions calling other parallel functions. Then you have cases of fat and skinny matrices, or just small matrices which will not benefit from parallelization. We need a design that can at some level incorporate at least some of these things.

stevengj commented 11 years ago

Something like the OpenMP or Cilk runtime has no problem with parallel functions calling parallel functions, nor with problems that are too small to parallelize.

The key thing is that you don't implement shared-memory parallelism by actually spawning threads to do work, you tell the runtime system that work is available and you let it decide what thread to assign (or whether to just do the work serially).

This is largely a solved problem (unlike distributed-memory parallelism). You just need to make the existing solutions accessible in Julia.

(The experts on this are in Charles Leiserson's group at MIT; I would be happy to help you get in touch with them to bounce around ideas.)

JeffBezanson commented 11 years ago

Yes, I'm familiar with the concepts in Cilk. My biggest concerns are ABI issues and portability. Different compilers seem to have their own openmp implementations. Targeting libgomp is probably enough for non-darwin unices, but I'd rather not have to do multiple backends.

ViralBShah commented 11 years ago

Yes this is certainly a solved problem in shared memory.

stevengj commented 11 years ago

@JeffBezanson In the short run, it sounds like targeting libgomp will be a good idea (it runs fine on MacOS too in my experience, and is supposedly possible to get running on Windows). That will give you a good platform on which to get something working, shake the thread-safety bugs out, and experiment with syntax, while users will get the benefit of multithreaded operations (and hopefully easier parallelism in their own code too).

In the long run (1-2 years at this rate), LLVM will probably support OpenMP directly, and it sounds like they are talking about supporting multiple backends so that difficulty will be taken care of for you.

ViralBShah commented 11 years ago

Is there a non-GPL alternative? There is the Blocks runtime in LLVM, but I suspect it only works well on OS X, and is unlikely to be as well debugged and understood as OpenMP is.

stevengj commented 11 years ago

Not that I know of. The gcc Cilk runtime is also GPL, and there is another OpenMP implementation called OpenUH that also seems to be derived from gcc and hence GPL.

But in the long term, once LLVM supports OpenMP, probably it will target proprietary backends like Intel's too, so Julia users wanting to license a different backend will have options.

ViralBShah commented 11 years ago

http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/

It seems that the mathworks is just using pthreads, and they have hardcoded a bunch of special cases. I am guessing that they disable parallelism if a function is already running in parallel mode. Would be nice if someone knows more about matlab's threading model can chime in, just so that we know what's going on.

Of course, I believe we will do something systematic and better. I just dread putting all these restrictions on size and shape and what not all over our libraries to decide when to use 1 core or multiple cores.

staticfloat commented 11 years ago

@ViralBShah; restrictions on size and shape are highly dependent on machine architecture; why not just have "best guess" values that can be tuned by the end user? E.g. run a dumb optimization search over simple matrix operations, matrix sizes, and number of threads. I know the OpenBLAS people were talking about doing this, because right now they have default thresholds that they use that they acknowledge may not be the optimum on any given machine.

timholy commented 11 years ago

Matlab's threading model seems to have undergone changes over time. In particular, their having taken away the ability to interactively control maxNumCompThreads suggests they're using a more sophisticated model these days.

I don't really know anything about what's happening under the hood. But recently I had a weird experience with their threads, that may provide some hints about what is going on: we were in a situation where one MATLAB process was eating all 32 cores on the machine, and the machine was unusable for anyone else. I tried renice, then I tried taskset, but Matlab was like the undead: temporarily I could force the number of cores down to ~10, but within a few minutes it was right back at 32, even though it was the same PID. It was running lots of FFTs, for which Matlab just uses FFTW, so maybe it was really FFTW's "fault." But it was weird.

I've also seen cases where, when Matlab launches a Julia process via their system command, it's always assigned to CPU 0. So you can have a big machine, lots of people using it, and all the Matlab-launched Julia processes are competing for the same core (and therefore running slowly). This doesn't seem to be entirely reproducible, so I haven't yet dug in to figure out what is wrong.

--Tim

On Saturday, December 22, 2012 10:40:42 AM Viral B. Shah wrote:

http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/

It seems that the mathworks is just using pthreads, and they have hardcoded a bunch of special cases. I am guessing that they disable parallelism if a function is already running in parallel mode. Would be nice if someone knows more about matlab's threading model can chime in, just so that we know what's going on.

Of course, I believe we will do something systematic and better. I just dread putting all these restrictions on size and shape and what not all over our libraries to decide when to use 1 core or multiple cores.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/1802#issuecomment-11639955

ViralBShah commented 11 years ago

A stopgap solution using shmem and mmap.

https://groups.google.com/forum/?fromgroups=#!topic/julia-users/Y_ME9uO5tVQ

stevengj commented 11 years ago

@timholy, processor affinity is the devil here. A lot of software uses it in order to make benchmarks look better, but in the real world it is worse than useless except when running on dedicated hardware. It really needs to be disabled by default (e.g. there is an option in OpenBLAS to disable it; FFTW doesn't use it at all unless you manually do taskset, but maybe Mathworks hacked their FFTW version).

That, plus you need a way to control the total number of worker threads (e.g. OMP_NUM_THREADS or whatever), and last I checked this was a bit dicey with Matlab.

timholy commented 11 years ago

@stevengj, thanks for the explanation. I bet you're right about the hacked version, it's a behavior that seems relatively specific to FFTs (although for us that's a big part of our load on that server, when we're not doing them via CUDA, so maybe I wouldn't see it otherwise).

ViralBShah commented 11 years ago

Related to #1790

rened commented 11 years ago

Does the recent announcement of Intel open-sourcing its LLVM-OpenMP implementation make a difference for Julia in the short term? http://blog.llvm.org/2013/10/openmp-project.html

jongwook commented 9 years ago

Is this something we can expect when the threading branch gets merged? I still see the 4 times difference (on a quad-core laptop) with MATLAB running OP's example.

ViralBShah commented 9 years ago

No. That will just give us threading capabilities. Once that stabilises we will start with threading array operations and other things.

stevengj commented 3 years ago

This should be closed in favor of #19777, since dot calls have replaced vectorized operations for the most part.

JuliaLang / julia

use multiple threads in vectorized operations #1802