Closed stevengj closed 3 years ago
An approach that might be less counter-Julian than calling out to C for everything might be to rejuvenate Krys' very interesting framework mentioned in this thread: https://groups.google.com/d/topic/julia-dev/PZfOv4h94lk/discussion
IIUC, he designed it explicitly to support multiple "backends," one of which might be kernel threads. @stevengj, any interest in looking into it? It would be a fantastic contribution.
I have no interest in CUDA, which I see as an architectural dead end. The hard thing about parallel programming is mainly dealing with memory. Shared memory is hard enough (one kind of memory + synchronization), and distributed-memory programming is harder (local + remote memory). The CUDA/OpenCL model requires the programmer to manually manage 4-5 kinds of memory (main memory, GPU memory, thread-block shared memory, thread-local memory, and remote memory in a cluster situation), which is not a reasonable way forward for software engineering except in special cases.
I think it would be totally reasonable to start by adding some shared-memory loop-parallelism with restricted parallel regions to the core language. That way we can start by just slapping locks on everything in the runtime, still allowing for speedups in simple vector loops. A possible alternative is to add a separate shared heap as in Racket's "places".
@stevengj , my point was, while Krys was interested in CUDA, I seem to remember he explicitly designed it to handle multiple backends, of which kernel threads was another candidate. The main point of his framework was a way to implement lazy evaluation in Julia. He got some nice results even with CPU execution from removing temporaries, e.g., from operations like a = b.*c + d + e
, which in standard Julia requires at least two temporaries.
@JeffBezanson , I like that idea a lot!
I like to think of the simple case of implementing sum
over the columns of a matrix. Say you have as your primitive a parallel vector sum. Now, the implementation of the matrix column sum itself will be written to be parallel. Thus, we very quickly get instances of parallel functions calling other parallel functions. Then you have cases of fat and skinny matrices, or just small matrices which will not benefit from parallelization. We need a design that can at some level incorporate at least some of these things.
Something like the OpenMP or Cilk runtime has no problem with parallel functions calling parallel functions, nor with problems that are too small to parallelize.
The key thing is that you don't implement shared-memory parallelism by actually spawning threads to do work, you tell the runtime system that work is available and you let it decide what thread to assign (or whether to just do the work serially).
This is largely a solved problem (unlike distributed-memory parallelism). You just need to make the existing solutions accessible in Julia.
(The experts on this are in Charles Leiserson's group at MIT; I would be happy to help you get in touch with them to bounce around ideas.)
Yes, I'm familiar with the concepts in Cilk. My biggest concerns are ABI issues and portability. Different compilers seem to have their own openmp implementations. Targeting libgomp is probably enough for non-darwin unices, but I'd rather not have to do multiple backends.
Yes this is certainly a solved problem in shared memory.
@JeffBezanson In the short run, it sounds like targeting libgomp will be a good idea (it runs fine on MacOS too in my experience, and is supposedly possible to get running on Windows). That will give you a good platform on which to get something working, shake the thread-safety bugs out, and experiment with syntax, while users will get the benefit of multithreaded operations (and hopefully easier parallelism in their own code too).
In the long run (1-2 years at this rate), LLVM will probably support OpenMP directly, and it sounds like they are talking about supporting multiple backends so that difficulty will be taken care of for you.
Is there a non-GPL alternative? There is the Blocks runtime in LLVM, but I suspect it only works well on OS X, and is unlikely to be as well debugged and understood as OpenMP is.
Not that I know of. The gcc Cilk runtime is also GPL, and there is another OpenMP implementation called OpenUH that also seems to be derived from gcc and hence GPL.
But in the long term, once LLVM supports OpenMP, probably it will target proprietary backends like Intel's too, so Julia users wanting to license a different backend will have options.
http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/
It seems that the mathworks is just using pthreads, and they have hardcoded a bunch of special cases. I am guessing that they disable parallelism if a function is already running in parallel mode. Would be nice if someone knows more about matlab's threading model can chime in, just so that we know what's going on.
Of course, I believe we will do something systematic and better. I just dread putting all these restrictions on size and shape and what not all over our libraries to decide when to use 1 core or multiple cores.
@ViralBShah; restrictions on size and shape are highly dependent on machine architecture; why not just have "best guess" values that can be tuned by the end user? E.g. run a dumb optimization search over simple matrix operations, matrix sizes, and number of threads. I know the OpenBLAS people were talking about doing this, because right now they have default thresholds that they use that they acknowledge may not be the optimum on any given machine.
Matlab's threading model seems to have undergone changes over time. In particular, their having taken away the ability to interactively control maxNumCompThreads suggests they're using a more sophisticated model these days.
I don't really know anything about what's happening under the hood. But recently I had a weird experience with their threads, that may provide some hints about what is going on: we were in a situation where one MATLAB process was eating all 32 cores on the machine, and the machine was unusable for anyone else. I tried renice, then I tried taskset, but Matlab was like the undead: temporarily I could force the number of cores down to ~10, but within a few minutes it was right back at 32, even though it was the same PID. It was running lots of FFTs, for which Matlab just uses FFTW, so maybe it was really FFTW's "fault." But it was weird.
I've also seen cases where, when Matlab launches a Julia process via their system command, it's always assigned to CPU 0. So you can have a big machine, lots of people using it, and all the Matlab-launched Julia processes are competing for the same core (and therefore running slowly). This doesn't seem to be entirely reproducible, so I haven't yet dug in to figure out what is wrong.
--Tim
On Saturday, December 22, 2012 10:40:42 AM Viral B. Shah wrote:
http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/
It seems that the mathworks is just using pthreads, and they have hardcoded a bunch of special cases. I am guessing that they disable parallelism if a function is already running in parallel mode. Would be nice if someone knows more about matlab's threading model can chime in, just so that we know what's going on.
Of course, I believe we will do something systematic and better. I just dread putting all these restrictions on size and shape and what not all over our libraries to decide when to use 1 core or multiple cores.
Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/1802#issuecomment-11639955
A stopgap solution using shmem and mmap.
https://groups.google.com/forum/?fromgroups=#!topic/julia-users/Y_ME9uO5tVQ
@timholy, processor affinity is the devil here. A lot of software uses it in order to make benchmarks look better, but in the real world it is worse than useless except when running on dedicated hardware. It really needs to be disabled by default (e.g. there is an option in OpenBLAS to disable it; FFTW doesn't use it at all unless you manually do taskset, but maybe Mathworks hacked their FFTW version).
That, plus you need a way to control the total number of worker threads (e.g. OMP_NUM_THREADS
or whatever), and last I checked this was a bit dicey with Matlab.
@stevengj, thanks for the explanation. I bet you're right about the hacked version, it's a behavior that seems relatively specific to FFTs (although for us that's a big part of our load on that server, when we're not doing them via CUDA, so maybe I wouldn't see it otherwise).
Related to #1790
Does the recent announcement of Intel open-sourcing its LLVM-OpenMP implementation make a difference for Julia in the short term? http://blog.llvm.org/2013/10/openmp-project.html
Is this something we can expect when the threading
branch gets merged? I still see the 4 times difference (on a quad-core laptop) with MATLAB running OP's example.
No. That will just give us threading capabilities. Once that stabilises we will start with threading array operations and other things.
This should be closed in favor of #19777, since dot calls have replaced vectorized operations for the most part.
Matlab users who employ "properly" vectorized code may notice a dramatic slowdown in going to Julia because most Matlab vector operations are multithreaded while Julia operations are not. There is an intrinsic difficulty here because Julia code cannot be multithreaded (see issue #1790), but as a stopgap measure I suppose that you could call out to C code that uses OpenMP to parallelize important vector operations.
For example, compare the following Matlab (R2012a) code (on 16-core x86_64 3GHz AMD Opteron 4280, Debian GNU/Linux)
with the Julia equivalent:
Julia is more than 10x slower. The difference is almost entirely due to multithreading, since if I run Matlab with -singleCompThread I get:
Also needed would be a way to specify the number of threads to use (e.g. you could use the standard OMP_NUM_THREADS environment variable by default, while providing some way to change it from within Julia). The nice thing about using OpenMP is that you can also use OpenMP for the OpenBLAS library and for FFTW (and possibly other libraries?), and use the same thread-pool for all three.