JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.7k stars 5.48k forks source link

Parallel Computing Planning Issue #9167

Closed alanedelman closed 9 years ago

alanedelman commented 9 years ago

As we talk about parallel computing in various contexts, it might be handy to keep in mind all the high level moving parts. Here are some that come to mind, just to get the ball rolling.

Hardware

Tools

Programming Models

Communication Layers

Schedulers

Libraries

tknopp commented 9 years ago

Regarding shared memory multithreading the threads branch seems to be the closest thing. #6741 was my experiment on multithreading and this work is superseded by the threads branch.

andreasnoack commented 9 years ago

Faster data movement

This is partly about communication layers and partly about libraries. I'd like to write pure julia implementations of large scale parallel dense linear algebra functions, but it is difficult to get good perfomance in the present state of Julia's parallel functionality because the communication is slow. The graphs below show timings for moving data to a remote process as a function of array size for present @spawn, @spawn with the changes proposed in #6876 and send and Send from MPI.jl. muck svg Notice the log-scale which means that right now @spawns are ten times slower than MPI for large arrays and even slower for small arrays, although the variance is very large for small array @spawns. Also relevant: #9046.

elextr commented 9 years ago

I'd add one to the hardware section - NUMA aware memory strategies

amitmurthy commented 9 years ago

addprocs on demand. What I mean by this is that any user should be able to summon thousands of cores, run a parallel algorithm and pay only for the minutes used.

This will be in a separate package, but changes will be required in Base to support this.

A possible model could be JuliaLang backed docker images that can be run on various cloud infrastructures. The package accepts the users credentials and manages the launch, tear down and billing of julia workers used.

eschnett commented 9 years ago

I'm slightly confused about how Julia's talking model will evolve in the presence of multi-threading. I always assumed that Julia's tasks, which are currently executed serially, would be executed in parallel in the future. However, when I look at the threading branch that may not be the case. Has there been a discussion about this?

tkelman commented 9 years ago

@andreasnoack that is excellent data, thanks for collecting that

I'll note that sadly none of the open-source MPI implementations can build on non-posix platforms right now, but I've got some back-burner plans that involve filling in whatever's missing possibly involving libuv along the way and trying to fix them. Almost noone uses Windows for HPC, but it's mildly annoying that I can't run MPI libraries on 8 local cores if I happen to be using Windows. There is a Microsoft MPI implementation but I doubt many computational MPI libraries work with it. We should still do as much as possible to make Julia play well with MPI, cross-platform or no.

To the Libraries list I'd like to add https://github.com/elemental/Elemental, @poulson has expressed interest in adding Julia bindings and it would be good to know if we need to change anything to make that happen.

eschnett commented 9 years ago

Microsoft's MPI seems to be built on MPICH, one of the mainstream MPI implementations. (I'm reading the description, I haven't tried it.) MPI implementations tends to be very portable (i.e. very close to the standard), and I would thus expect this implementation to work out of the box with any MPI-using application under Windows.

poulson commented 9 years ago

@tkelman I would assume that I can manually expose an IJulia interface in the same way that I did for IPython (see http://web.stanford.edu/~poulson/notes/2014/11/11/Elemental-IPython.html).

I'm currently wrapping up initial dense and sparse-direct distributed-memory implementations of Mehotra Predictor-Corrector Interior Point Methods for LPs (though I am not yet specially handling small numbers of "dense" rows/columns when forming the normal form of the (ideally) sparse KKT system). I hope to add distributed dense versions of QP IPMs soon, with distributed dense and sparse-direct SDPs being my ultimate goal.

ViralBShah commented 9 years ago

Why do the libraries not work with Microsoft MPI? What have they done?

ViralBShah commented 9 years ago

I wonder if much of the time spent is in spawning rather than data movement, at small sizes.

tkelman commented 9 years ago

@poulson the IPython approach does look interesting, did not know about that (see http://ipython.org/ipython-doc/2/parallel/index.html for anyone else who's curious). We should see how much of that also works through IJulia - maybe all of it does?

Regarding the linear algebra details and optimization applications, do you have a roadmap issue? I'm hoping you won't limit yourself to convex problems, the nonlinear programming solvers of the world are desperately in need of redistributable parallel sparse linear algebra libraries, of which Elemental might be the first to actually fit the bill.

@ViralBShah I'm mostly worried about build systems not knowing how to find and use Microsoft's MPI implementation. It's also not open source so I can't cross-compile it so using WinRPM for binaries is out of the picture.

poulson commented 9 years ago

@tkelman I apologize for taking this discussion a bit off-topic. But, somewhat in order, my current roadmap is robust sequential/distributed dense/sparse implementations of LPs, QPs, SOCPs, and SDPs which can make use of accelerators on each node.

On the subject of parallelization with IJulia: I have little doubt that it would be straightforward.

jakebolewski commented 9 years ago

Has the parallel support in IPython been refactored enough to be python agnostic? Last time I used it the implementation was very much tied to using the python kernel.

poulson commented 9 years ago

@jakebolewski I haven't checked. I'm not saying that it should be blackbox, but that it is conceptually straightforward. I would assume that a complete IJulia implementation would duplicate said functionality with Julia workers instead of Python workers. If someone is seriously interested in this, I'm more than happy to discuss further (and possibly considering adding said functionality myself).

amitmurthy commented 9 years ago

@poulson regular addprocs should work within an IJulia session too, right? And my understanding is that the workers will be around till the kernel is shutdown.

poulson commented 9 years ago

@amitmurthy I was honestly not familiar with that function, and it looks like it might. But I wouldn't want to promise anything until I have a working prototype.

poulson commented 9 years ago

@ViralBShah MPICH makes use of pthreads, which Windows does not support.

tkelman commented 9 years ago

winpthreads works reasonably well, to be fair, and it's a small permissively licensed library. LLVM might force us off the win32-threads version of MinGW at some point anyway.

ViralBShah commented 9 years ago

@amitmurthy 's recent work on allowing different transports is worth trying out here on this simple benchmark.

@andreasnoack Can you create a set of benchmarks in test/perf, and put this benchmark in there? We will certainly want to add to that list, and it would be great if everyone can run these.

andreasnoack commented 9 years ago

@ViralBShah I'll try to compile some tests for this.

wildart commented 9 years ago

Here is an article by John Langford, Allreduce (or MPI) vs. Parameter server approaches, where he reviewed various parallelization approaches used in machine learning.

classner commented 9 years ago

Hi everybody! As Julia is making great progress, it's becoming really interesting for image processing. (Local) threading is really important here, though. I dug through the various discussions and found this discussion and the "threads" branch.

Apparently, the implementation in that branch is working. The basic tests are passing, though one test relying on the "Images" module currently doesn't work (since the Images module requires functionality available only in the latest Julia version, and I didn't go through a merge since it couldn't be done automatically). What are the plans regarding local parallelism? Are there any plans to merge the "threads" branch? It looks promising!

timholy commented 9 years ago

Other than a few experiments back in the very earliest days, I have not been involved in the threads branch at all. So take this with a big grain of salt.

I think it's fair to say that there certainly are plans to merge the branch. Guessing here, but presumably what it needs is (1) some free time for its principal developers (including one who needs to defend his thesis) to finish not-yet-implemented features or bug fixes, and (2) people kicking the tires. There's a chance you might be able to help the process along by doing that merge you described above and submitting a PR against the threads branch. If you are brave and willing to start using it in regular work, continued PRs against the threads branch would surely go a long ways to speeding the merger.

StefanKarpinski commented 9 years ago

The plan is to merge the threading work into master soon, but leave it off by default since it's still quite experimental. The main purpose of merging is to avoid continually rebasing when people make changes to the internals on master that break the threading functionality. That should make it easier for people to try out threading functionality even though it will still be experimental during the 0.4 cycle.

classner commented 9 years ago

That sounds very good! I had a quick look on what the main merge problems were. A critical part of changes concerns the garbage collection, which is at the very heart of Julia and cannot be merged without quite some reading/knowledge of the internals.

StefanKarpinski commented 9 years ago

Yes, it's a pretty non-trivial merge.

ViralBShah commented 9 years ago

The GC patch was ready and tested, and merging it first meant that work remains to make it thread safe. As said before, we expect to have this in 0.4, even if disabled by default.

There is a threading branch that is more recent than the threads branch, which is updated to just before the GC merge. Cc: @kpamnany

jakebolewski commented 9 years ago

Does the threading branch work on OSX / Windows or is it Linux only for the time being?

ViralBShah commented 9 years ago

I don't think anyone has tried it on OS X. I wouldn't even dare try windows. I personally have only tried it on Linux, and I also see a few segfaults.

kpamnany commented 9 years ago

Haven't tried OS X or Windows, but the former shouldn't be too hard to get running. The threading infrastructure mostly needs only pthreads and atomics. There's an LLVM dependency, for thread-locals, but @Keno has a functional hack in an llvm-svn branch (I don't know if this is Linux-specific though). Getting this functionality into LLVM is another TODO.

Reentrancy in the runtime (specifically in the new GC) is the main block right now. And there are, of course, lots of functions in the standard library that need to be made thread-safe, or thread-safe versions added. Most crashes you'll find in the threading branch relate to such problems.

Lots more functionality to add, but it mostly works, and scales quite nicely. :-) See test/perf/threads/laplace3d.

Keno commented 9 years ago

The LLVM patch is Linux-only at this point, but give me a week or two.

classner commented 9 years ago

Yes, the scaling is near perfect and the functionality looks very promising and impressive!

Do you have plans already for the variable sharing policy? I.e., what variables will be treated thread-local and when are variables syncronized between threads.

kpamnany commented 9 years ago

Threads run Julia functions (or blocks that are converted into functions). Block-scoped variables are on the stack, hence thread-local. Variables in an outer scope will be shared. Directive-based privatization a la OpenMP is not high on the priority list (to be honest, it isn't on the list at all, but reductions are). Is this important?

Synchronization support will initially be explicit, i.e. locks and atomics. These should eventually be used to build a library of concurrent data structures.

kpamnany commented 9 years ago

In response to @eschnett (from a while ago--sorry, only saw this when I was @mentioned), the interaction between tasks and threads has been the subject of much discussion. Cilk-style parallelism (i.e. fork/join with work stealing) will work best for this, but alternative shared memory parallel programming models are still some ways out. This is simple loop parallelism, and it isn't clear how tasks can/should interact with this model.

classner commented 9 years ago

Thanks for the detailed response! Having a possibility to synchronize, and the simple loop parallelism will be perfectly sufficient. I think the OpenMP directives do make sense to simplify data handling on NUMA architectures. At the same time, they're way out of scope for an experimental feature...

StefanKarpinski commented 9 years ago

off-topic: @kpamnany, I love your avatar :-)

kpamnany commented 9 years ago

@StefanKarpinski, I figured it was the done thing for Julia devs, given yours and Jeff's. :-)

skariel commented 9 years ago

please consider nanomsg it is by the same author and supercedes zeroMQ

pao commented 9 years ago

Julia already supports ZMQ--it's needed to communicate with IPython (Jupyter), so it's sensible to pursue. I think the transports are supposed to be pluggable, so if you're interested in a nanomsg transport, you're welcome to build it!

garborg commented 9 years ago

@skariel A sensible place to start: https://github.com/quinnj/Nanomsg.jl

ChrisRackauckas commented 9 years ago

I was wrong! Good work!

IainNZ commented 9 years ago

@ChrisRackauckas do you mean fusing multiple linear algebra operations? Because aren't all the basic linear algebra operations parallel already? As for for loops, we have @parallel already but soon enough we'll have @threads too.

ChrisRackauckas commented 9 years ago

Wow, turns out I was just wrong. When I last used Julia I don't remember this, but I just ran a simple test with multiplying two random matrices of size 10,000 x 10,000 and it did max all my cores. I think that it should be noted in the documentation (what other functions are parallel?).

Also, did not know about @parallel.

FYI, hearing this (and noticing that MATLAB's native parallelization doesn't come close to using the full power of my machine...) finally makes me a convert. The only thing I feel like I am missing is easy CUDA support (i.e. as simple as send the array over to the GPU and now all matrix operations and standard functions are performed on the GPU) but that's nice but not truly necessary.

IainNZ commented 9 years ago

Not sure which operations exactly, but basically "linear algebra on Float64s" for sure. Definitely check out the manual, it covers the parallel computing functionality of Julia pretty well - PRs welcome for improvements as well. CUDA support isn't a base Julia issue (IMO) given its dependent on an external closed source library, but there is JuliaGPU/CUBLAS.jl which is getting there for higher level abstractions: (also contributions welcome there, I'm sure)

    A = rand(elty,m,n)
    d_A = CudaArray(A)
    # test y = alpha*(A*x)
    x = rand(elty,n)
    d_x = CudaArray(x)
    y1 = alpha*(A*x)
    y2 = A*x
    d_y1 = CUBLAS.gemv('N',alpha,d_A,d_x)
    d_y2 = CUBLAS.gemv('N',d_A,d_x)
    h_y1 = to_host(d_y1)
    h_y2 = to_host(d_y2)
    @test_approx_eq(y1,h_y1)
    @test_approx_eq(y2,h_y2)
ChrisRackauckas commented 9 years ago

The only mention I could find of that functionality is hinted at in the peakflops function on this page: http://julia.readthedocs.org/en/latest/stdlib/linalg/. There is no mention in the parallel computing page http://julia.readthedocs.org/en/latest/manual/parallel-computing/ or the linear algebra page http://julia.readthedocs.org/en/latest/manual/linear-algebra/.

johnmyleswhite commented 9 years ago

Arguably this isn't something that should be documented as Julia's behavior since it's really the behavior of OpenBLAS.

StefanKarpinski commented 9 years ago

@ChrisRackauckas, where did the impression that Julia was not maxing out a system's cores when doing things like matmul come from? Doing so is a matter of using a well-tuned multithreaded BLAS – which we always have, albeit a different one from Matlab (OpenBLAS vs. MKL). There are other vectorized operations where Matlab has hand-coded implementations in C that use multiple cores, while Julia's implementation, being written in Julia, uses a single core. Often that deficit can be compensated for by devectorizing the code and improving the serial performance. In the future, Julia will support threading natively, and that issue will go away too. The distributed computing issue is largely orthogonal to all of this, but it also important if one wants to do HPC-style work.

ViralBShah commented 9 years ago

Once we have multi-threaded Julia, it won't be only hand coded internal functions in C that will be multi-threaded, but the entire Julia Base library. So while we have some distance to cover, we will have something great in the 0.5 timeframe.

ChrisRackauckas commented 9 years ago

Thank you for the explanation, that clears it up. I look forward to multi-threaded Julia.

ViralBShah commented 9 years ago

This planning issue seems too broad. Does it break into smaller tasks that can have their own issues? I am tempted to close this, and have specific issues dealing with specific problems.