JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.42k stars 5.45k forks source link

support shared-memory parallelism (multithreading) #1790

Closed stevengj closed 8 years ago

stevengj commented 11 years ago

I realize that this would require major effort, but I'm hoping that you will keep this on your radar screen: the lack of native support for shared-memory parallelism is a huge shortcoming of Julia.

The reason to support shared-memory parallelism is that it is usually far, far easier to program than distributed-memory parallelism. Especially if you adopt the same infrastructure (already in gcc) as OpenMP or Cilk+, in which the programmer indicates which function calls and loops can be executed in parallel and the runtime system decides how many threads to spawn and how to distribute the load (e.g. by work stealing to parallelize irregular problems, as in Cilk).

Julia is attractive because it combines ease of programming with performance near that of C, but the latter is no longer true if you can simply put "#pragma parallel for" in your C loop and get 10x speedup on your 12-core machine. Distributed arrays cannot compete with this in ease-of-use, especially for problems in which extensive inter-processor communication is required.

(Using the OpenMP runtime system is also nice in that this way the runtime system will share the same worker threads with C libraries like OpenBLAS and FFTW, so that Julia and external C code won't fight over the CPUs.)

JeffBezanson commented 11 years ago

You certainly have a point here. Composability with parallel libraries is especially interesting. Work is underway to provide OpenMP support in LLVM, and if that happens it will be easier to interface with the openmp runtime. Otherwise, would it be sufficient to target a particular popular implementation like libgomp?

At a higher lever, it is true that programming distributed memory is much harder, but our thinking has been that once you have a distributed memory implementation it is relatively easy to make it take advantage of shared memory for performance without changing the programming model. Then if you can somehow make distributed memory programming easier, everything will be fine.

It would be very valuable to have shared memory parallelism available, but not 100% satisfying just to tack on another parallel programming model. We could take the approach of adding runtime support, then punting the rest to the user or future API designs though.

ViralBShah commented 11 years ago

It is fairly easy to support the equivalent of pragma parallel for in julia, and we already have @parallel (reduce) for that. It could be certainly made faster on shared memory. The current inconvenience with @parallel is that it works with darrays and multiple julia processes, and our darray implementation still has ways to go. Making @parallel work on regular sequential arrays would give a lot of immediate parallelism for a lot of folks, without modifying the rest of their code. Of course, that would mean having having some kind of a threading model within julia.

ViralBShah commented 11 years ago

Is the LLVM blocks runtime (used by Apple in Grand Central Dispatch) worth considering as an alternative to openmp?

http://libdispatch.macosforge.org http://compiler-rt.llvm.org http://llvm.org/svn/llvm-project/compiler-rt/trunk/BlocksRuntime/

stevengj commented 11 years ago

Yes, the requirement to use distributed arrays and multiple processes is the whole difficulty -- the key advantage of shared memory is that you have only a single kind of memory to work with. This becomes even more important when you go beyond simple data-parallel circumstances. e.g. parallelizing a sparse-matrix multiply, or a finite-difference time-domain simulation, or an FFT, are all fairly easy in shared memory [at least, before you get into heavy optimization], but are a lot more complicated in distributed memory. And the history of computer science is littered with the corpses of languages that tried to make distributed-memory programming as transparent as shared-memory.

OpenMP seems to be by far the most popular technique for parallelizing compiled C/C++/Fortan code on shared-memory systems, so there is a lot to be said for playing nicely with the OpenMP runtime. But the runtime system may something of an implementation detail that in principle could even be switched at compile-time (or even at runtime). [This may be easier said than done; if you have to pick one, I would go with OpenMP.]

As I understand it, the main issue right now is that Julia's garbage collection is not multithreaded. As a first step, it wouldn't be so terrible if garbage collection were serialized in a multithreaded program, with the understanding that people writing multithreaded code should try to optimize things so that garbage collection can be deferred to happen infrequently.

As Jeff says, I think the key thing is to support shared memory in the runtime with some kind of OpenMP-like semantics via low-level primitives. Julia's macro system is flexible enough that people can then experiment with different syntaxes (although you will probably eventually want to standardize on one syntax to build into Base).

timholy commented 11 years ago

Here's an old issue on a related topic: https://github.com/JuliaLang/julia/pull/1002

For the record, I'd love this too, but not if it creates disasters.

ViralBShah commented 11 years ago

Intel recently open-sourced its OpenMP library under the BSD.

http://www.openmprtl.org

ViralBShah commented 11 years ago

OpenMP support is forthcoming in clang - although the approach taken is to do it in the clang frontend rather than in the LLVM IR. See the presentation linked to in this article.

http://www.phoronix.com/scan.php?page=news_item&px=MTM2NjE

stevengj commented 11 years ago

Since Julia is unlikely to adopt the OpenMP #pragma syntax (I hope), the Clang OpenMP support should not really be relevant; what matters is access to the runtime library.

ViralBShah commented 11 years ago

Right, Julia is unlikely to adopt the OpenMP syntax, but clang having OpenMP capability makes it possible for us to compile FFTW and OpenBLAS with OpenMP on Mac, where we use clang as the default compiler. This is nice to have, when it happens.

stevengj commented 11 years ago

Last I checked, it was a bad idea to compile FFTW with clang; gcc gave significantly better performance. I don't know about OpenBLAS. For your precompiled Mac binaries, you might want to look into this.

(Fortunately, the Intel OpenMP library is supposedly ABI-compatible with libgomp, so you can hopefully compile FFTW with gcc and --enable-openmp and still use the Intel runtime.)

bsilbaugh commented 11 years ago

It seems that perhaps there are really three distinct (but related) issues being raised here:

  1. Levering shared memory communication on multi-core machines.
  2. Supporting a thread-based programming model
  3. Compatibility with OpenMP libraries

In regards to issue 1, I don't see why Julia would need to change it's programming model to support shared memory communication. (Perhaps I've just revealed my own ignorance regarding Julia.) For example, MPI was designed with distributed computing in mind; however, many MPI implementations (e.g. OpenMPI) are still able to pass messages directly in memory when the sender and receiver are on the same machine. Hopefully, Julia will be able to (if it doesn't already) optimize it's communication strategy in a similar manner.

In regards to issue 2, I don't think it's unanimous that thread based programming models are always the best way to write high-performance parallel algorithms. In fact, I believe the designers of the D programming language decided to implement a message passing model to avoid many of the pitfalls of thread-based programming (see http://www.informit.com/articles/article.aspx?p=1609144 for details). Furthermore, with the exception of very simple programs, getting good performance out of OpenMP usually requires more effort than adding a few pragmas. As far as the simple cases are concerned, I think implementing parallel loop macros might be sufficient to make Julia as convenient as OpenMP.

Unfortunately, I don't have much to offer regarding issue 3.

I hope this helps.

cognociente commented 11 years ago

It seems that there has been progress made with OpenMP and CLANG. See below article:

http://www.phoronix.com/scan.php?page=news_item&px=MTQ0NjQ

and

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-August/065169.html

I have actually been watching the progress of Julia and this issue in particular before jumping into learning a whole new toolset. I genuinely believe this is a must-have for a language/tool like Julia and second the reasoning of stevengj above that this is just so much easier to code for than thinking of distributed memory.

stevengj commented 11 years ago

@bsilbaugh, no one is arguing that "thread based programming models are always the best way to write high-performance parallel algorithms," just that they are significantly simpler in many common cases. The whole point of this that in the simple cases where one wants to parallelize an expensive loop (#pragma for) or a recursive tree of calls (#pragma task), you can often get good performance (even if not optimal) just by adding a few pragmas. (Even the difficulty of load-balancing irregular problems has been essentially solved.) These cases come up extremely often in practice in my experience.

And "implementing parallel loop macros" is not sufficient, precisely because it doesn't deal with the issue of memory. The problem is not parallelizing the loop, the problem is that in distributed-memory systems you need to decide in advance where things are stored and deal with explicit communication. This is what shared-memory avoids, and this is where automated language tools have been an abject failure in distributed-memory systems for 20 years. This is not a simple problem.

bsilbaugh commented 11 years ago

@stevengj Breath. Maybe have some Camomile tea. Look, a cute puppy.

First, I think we agree on a few things:

  1. Julia should do whatever she needs to do under the hood (or skirt?) to optimize inter-process communication. This includes exploiting shared memory and/or system threads when two (or more) processes are running on the same machine.
  2. Simple tasks should be kept simple, and hard things doable.
  3. Julia needs to be competitive with existing technologies such as OpenMP, CUDA, and our beloved (or hated) grandpa MPI.
  4. This is not a simple problem.
  5. That was a darn cute puppy.

Now, let's consider the main point:

The whole point of this that in the simple cases where one wants to parallelize an expensive loop (#pragma for) or a recursive tree of calls (#pragma task).

For simple cases (loops), being able to use OpenMP is nice. No argument. But is it worth exposing another parallel programming model to the user? Another model that the user needs to think about when composing parallel libraries? (Developer A may decide he will only use threads for his library, and developer B will decide that he will only use one-sided comm for his library. But someday, developer C is going to need to put library A together with B, and will have to try to reason about their interaction.) I think this is perhaps where we disagree. We disagree about the idea of bolting on yet another parallel programming model to the language itself. And that's okay.

Out curiosity, have you tried automatic parallelism for some of the use cases you're talking about? Maybe compare a few test cases using between OpenMP and Intel's automatic parallelism just to get a sense of what is possible with automatic parallelism. (I would be willing to do some of the leg work if you would be willing to dig up some good test cases.) If the present state-of-the-art in automatic parallelism is good enough for the kind of problems your talking about, then this might be a way to get shared memory support into Julia without actually requiring Julia users to learn multiple programming models. Obviously, automatic parallelism will take some effort to implement, but then again, the parallel programming model of the future is probably no model; i.e. I suspect multi-threading and message passing will eventually go the way of assembly programming.

stevengj commented 11 years ago

@bsilbaugh, Intel's auto-parallelization still requires the language to support shared-memory parallelism (and in fact, Intel's parallelizer is built on top of OpenMP). They probably didn't build it on top of distributed-memory parallelism precisely because it is so difficult to automate memory distribution.

bsilbaugh commented 11 years ago

@stevengj Automatic parallelization requires compiler support (which may use OpenMP libraries internally), but it does not require language support. (Otherwise, it wouldn't be automatic.)

You're right that Intel does not support automatic parallelism for distributed memory architectures (nor does OpenMP). But, for the simple cases you alluded to, perhaps it's enough.

The point of my earlier post (which I think was inline with @ViralBShah and @JeffBezanson original comments) was that you may not need to change the current parallel model (i.e. the one-sided communication API supported by the standard library) simply for performance reasons. For example, calling fetch could just as easily dereference a pointer to a block of shared memory instead of pulling data over the network. Depending on julia's JIT capabilities, perhaps even some of these calls (fetch, remote_call, etc) can get optimized out.

stevengj commented 11 years ago

@bsilbaugh, by "language support", I mean support in the Julia runtime, which is separate from the compiler. The main obstacle in this whole discussion is that the Julia runtime is not thread-safe. (And the simple cases I alluded to are no longer so simple in distributed-memory situations.)

Yes, you can certainly implement distributed-memory primitives on top of shared memory, but that doesn't address the difference in ease of programming between shared and distributed programming models (regardless of implementation).

ArchRobison commented 10 years ago

For shared memory parallelism, it would be worth looking at the Cilk model too. There is ongoing work to add it to LLVM (http://cilkplus.github.io/). Cilk avoids some of the composition problems that OpenMP has (notably nested parallelism). Though there's no free lunch -- OpenMP also has certain advantages. Another candidate worth understanding is Deterministic Parallel Java (http://dpj.cs.uiuc.edu/DPJ/Home.html). Maybe some of its techniques can be applied in Julia. I think the important thing is to understand the tradeoffs.

stevengj commented 10 years ago

@ArchRobison, the semantics of OpenMP have been converging towards those of Cilk for some time now. OpenMP now has #pragma omp task, similar to cilk's spawn model, and it has #pragma omp for schedule(guided) similar to the divide-and-conquer load-balancing technique for loop parallelism in Cilk+. Of course, the syntax is quite different, but no one is proposing to adopt OpenMP syntax in Julia.

So, while I agree that Cilk has a good conceptual model for shared-memory parallelism, that question is somewhat orthogonal to what runtime threading library we use. (LLVM support is apparently somewhat irrelevant here since our syntax is not implemented in LLVM; we just need the runtime library.)

But again, the stumbling block is thread safety in the Julia runtime library.

JeffBezanson commented 10 years ago

That is true, but i'm just as worried about thread safety in every other julia library that might be out there.

timholy commented 10 years ago

I'm not sure I understand the latter concern fully. For example, are there really that many functions in base that make use of non-constant global variables? I'm not saying there aren't any---I've written some of them myself---but I don't tend to think of it as a major feature of our code base. Of course with packages there are additional possibilities for conflict, but at least in my own packages I think it's pretty typical of what's in base---a small percentage might need some redesign for thread-safety.

ArchRobison commented 10 years ago

Though OpenMP has adopted tasking, there are fundamental semantic differences with Cilk that impact composability and performance. Tasking in OpenMP is tied to parallel regions. The big advantage and big disadvantage (depending on context) is that the number of threads executing a parallel region must be bound when the region is entered, before the amount of work or potential parallelism is known. (I work with the Cilk/OpenMP/TBB groups at Intel. We've considered lazy schemes to try to circumvent the issue in OpenMP, but the OpenMP standard has pesky features that get in the way.)

I agree that the big stumbling block is the run-time library and existing Julia libraries. Maybe a lint-like tool could inspect Julia libraries for "okay", "definitely not okay", or "a human should take a look"? From my beginner's experience, Julia seems to have much less of the alias analysis misery that stymies such tools for C/C++.

JeffBezanson commented 10 years ago

I am indeed worried about packages, and the various native libraries they might call. Thread safety is a pretty significant demand to put on all those things, especially since the failure mode is not an error message (or poor performance) but random horrible behavior.

Julia code is designed to compose together quite promiscuously, so it is hard to say "my code is threaded, so I need to make sure I only use thread-safe libraries" --- one could easily pass a function from one package to an implicitly parallel higher-order function from another package.

@ArchRobison great to have somebody from the Cilk group here.

stevengj commented 10 years ago

@ArchRobison, thanks for the clarification, that's very helpful.

ArchRobison commented 10 years ago

Another issue to consider is the generality of the threading model versus ability to detect races. Automatic race detectors can reduce the pain of dealing with threading bugs. Examples are Helgrind, Intel Inspector, and Cilk screen. (See http://supertech.csail.mit.edu/papers/tushara-meng-thesis.pdf for the theory behind one for Cilk.) The efficiency of a race detector is somewhat inverse to the generality of the parallelism model, so it's something to consider in choosing the parallelism model. JIT compilation may be able to reduce the cost somewhat since it can instrument only the memory accesses that might constitute races. (In the jungle of C/C++ binaries that Inspector deals with, we have to instrument just about every access since binary stack frames don't give us much info to do thread escape analysis.)

tknopp commented 10 years ago

Is there a large number of non-thread safe Julia C functions? It would be really cool to know which functions are thread safe and which are not. Then one could make these thread safe using a locking mechanism. The following naiv example segfaults although the gc is off:

  jl_gc_disable();
  #pragma omp parallel for
  for(i=0;i < 100; i++)
  {
    // call some jl_... functions
  } 
Keno commented 10 years ago

Nothing is thread safe.

tknopp commented 10 years ago

Ok, got that. And I am trying to understand how much effort it is to change that. So in how much places are globals touched? If there are only some low level functions that touch globals, one could make these thread safe so that automatically the high level functions get thread safe.

I do not say: Lets make all jl_* functions thread safe. I just want to get some feeling if this feasible to investigate. In general, I think that in order to meet the goal of this issue one has to: a) make libjulia thread safe b) introduce a language feature to make use of this from Julia.

timholy commented 10 years ago

An "easy" step that might suffice for many uses is #4939. The concept is explained more thoroughly in #4580, especially in one of the comments.

Of course, I'm not saying this replaces all possible uses of threading, just that it's a non-disruptive alternative.

tknopp commented 10 years ago

@timholy: Yes this is certainly a very interesting approach. Would be interesting what one "loses" when using a multi-processing approach (with shared memory) compared to a multi-threaded approach. Still, if I get that right, one still as to copy between the SharedArray and the regular Array.

stevengj commented 10 years ago

I don't think #4939 is a solution to this issue (though it is not a bad feature in itself). That proposal doesn't provide any way to parallelize operations on ordinary Arrays and other data structures, so it still has the complication that you need to specify the data distribution in advance, so standard library functions cannot be faster without intervention and dynamic load-balancing is hard etc. It doesn't change the fact that if you want multi-threaded library routines (BLAS etc.) you have to write them in C.

Let me put it another way. If you don't think we need multi-threaded operations on ordinary Array types with no user intervention, how would you feel if we disabled multithreaded BLAS and LAPACK operations? (Conversely, if they are valuable for BLAS, why aren't they valuable elsewhere?)

timholy commented 10 years ago

@tknopp, there's no copying required (unless for some reason you do so as part of the intialization). The key part of the proposal is that it maintains an Array in the local process but serializes just the "handle", so you can pass an arbitrary SharedArray argument to a function but do not have to copy any data.

One significant disadvantage of the multiprocessing approach is more memory consumption, since there will be multiple instances of Julia. There is almost certainly going to be more overhead associated with IPC than there would be threading (the pcall_bw was essentially an attempt to get around that).

@stevengj, agreed with the fact that it doesn't help with Arrays, although if SharedArrays are easy to use then one might just start using them for any data of sufficient size. However, due to the overhead of IPC we definitely don't have an approach in-the-works that is viable for smaller arrays (unless computation time happens to be large even for a small array). With regards to the data distribution being pre-determined, in certain respects this is incidental: since all processes have complete & fast access to all elements, you could pass an alternate set of indices if you want.

But your main point is what I meant by saying that it doesn't replace all possible uses of threading. This proposal is essentially a way of getting a subset of the benefits without requiring invasive changes to Julia's code base.

tknopp commented 10 years ago

@stevengj, @timholy: I know that multithreading is a little bit of an controversal topic (kind of asking a Python hacker to remove the GIL...) and I actually did not want to start a discussion about the pros and cons. I am just interested in whether this is somethings that makes sense to have a look at.

StefanKarpinski commented 10 years ago

I don't think #4939 is a solution to this issue (though it is not a bad feature in itself). That proposal doesn't provide any way to parallelize operations on ordinary Arrays and other data structures, so it still has the complication that you need to specify the data distribution in advance, so standard library functions cannot be faster without intervention and dynamic load-balancing is hard etc. It doesn't change the fact that if you want multi-threaded library routines (BLAS etc.) you have to write them in C.

To me this is the fundamental problem with the SharedArrays approach. There are situations where you want to do lots of operations in parallel on a regular array – in fact, I would argue that this is the most common situation. I've been advocating for beginning by making array comprehensions in Julia implicitly parallel as a first concurrency step for a fairly long time. I know that this means that anything you do in the comprehension step had better be thread-safe, which means that we're going to need a big lock around cg and gc (code gen and garbage collection), but I think that combined with a safer distributed-lite model like SharedArrays would get us pretty far.

timholy commented 10 years ago

@tknopp, it's a very interesting topic and one worth discussing.

It's worth noting that you can, in rare circumstances, get away with ccalls to the pthread routines on functions defined by cfunction. Primarily you have to make sure that (1) all code has already been compiled, and (2) no memory is allocated by any julia routines that you call. The latter is not always easy to achieve.

@stevengj, it looks like your previous comment has been edited since I wrote my reply. To respond to the new version, I didn't say that multithreading of arrays is unnecessary---I was intending to be pretty clear in proposing it as "an alternative approach that might work for some people in some circumstances." Many people and/or applications may not be among those, in which case one should carry on as usual. But for those people that it might help, it's not an irrelevant option to know about, so the cross-links to the other issues are appropriate to have here.

While it's not really relevant to your main point, it may be worth noting that even your example about BLAS is not as far out of reach as it might seem. It would be easy to write a tiled gemm routine using SharedArrays. A subset of BLAS routines are largely or exclusively embarrassingly-parallel, which is where SharedArrays shine. Things that require detailed synchronization among threads/processes would have more overhead and would surely be better done with real threads.

In a crazy world, we would just replace Arrays with SharedArrays. That's crazy, because the memory overhead of a SharedArray is much larger, and Jeff has emphasized in the past how important it is to keep it to a minimum.

StefanKarpinski commented 10 years ago

@tknopp: as an intermediate first step, if you want to try this, you could attempt to "objectify" the Julia global runtime state so that you can have more than one Julia instance in a single process which can run independently of each other, but each one must be single-threaded. @JeffBezanson has voiced concerns about the performance hit that might introduce, but at least it's only a code generation hit, not a code execution hit (since we always run code natively).

tknopp commented 10 years ago

@StefanKarpinski: So by objectifying, you mean that one puts all globals into some struct and uses this as a Julia thread state, right?

My hope is, as you say, that code execution would only rarely need locking so that a global interpreter lock might be feasible without a to large performance hit. In Python, the GIL is touched all the time when executing Python code.

StefanKarpinski commented 10 years ago

Yes, that's correct. There shouldn't be an lock because I'm talking about having multiple independent Julia instances in the same process. If they don't share any of the same code generation infrastructure, then there's no reason to lock unless LLVM itself isn't threadsafe, which I don't think is the case (I may be wrong about that).

tknopp commented 10 years ago

Ok, I am not entirly sure if this is the most simple approach as this would for instance required to synchronize the thread states before performing a parallel operation (make sure that local variable are present in all threads) With a GIL, one would "just" have to introduce locks at all places where globals are touched. But of course the GIL approach has the potential of beeing slower.

StefanKarpinski commented 10 years ago

This would be a necessary step before making things threadsafe anyway, unless you want to throw locks around everything, which strikes me as not the best approach.

ArchRobison commented 10 years ago

I recommend investigating the multiple instances approach. That's the way a good shared-memory parallel run-time works. (E.g., the separate workers in OpenMP, Cilk, or TBB.) A GIL is going down the wrong path. I've seen an inordinate amount of time spent in attempts to circumvent the GIL in Python.

There exist non-standard compilers that can privatize globals. E.g. look at http://mpc.sourceforge.net/ . Though using it would introduce yet another dependence.

tknopp commented 10 years ago

I am not entirely sure if one can compare Julia with Python here. Having a JIT should make an important difference. If the JITed code has only few calls into libjulia (e.g. to invoke the gc) the overhead of the GIL might not be that big. But as my knowledge of the Julia internals is still quite limited I am not so sure about that. Synchronizing Julia thread states also can have an overhead.

But even if one goes for a GIL, it can certainly improve the code structure if it is organized in such a way that the Julia state is held in a struct that could be potentially exchanged.

ViralBShah commented 10 years ago

SharedArray is certainly a stopgap approach, and it would be great to be able to work on Array with multiple processors. I guess we could get closer to this with SharedArray, if we are somehow able to manage a memory pool for medium to large sized arrays that are allocated in a shmem segment. That would make it possible to do away with the copy, and we could perhaps even come up with faster ways to syncrhonize across the processes.

timholy commented 10 years ago

we could perhaps even come up with faster ways to syncrhonize across the processes.

pcall_bw in my original version of SharedArray is exactly that (about 40x faster in my tests). I also have a couple of guesses about where much of the remaining overhead comes from, but I haven't had time to check.

tknopp commented 10 years ago

I have been playing around a little bit with threads and libjulia and got the following code to work:

  jl_eval_string("my_func(x) = 2*x");
  jl_function_t *func = jl_get_function(jl_current_module, "my_func");
  int i;

  // here my_func is precompiled
  jl_value_t* arg = jl_box_float64( 0.0 );
  jl_call1(func, arg);

  jl_gc_disable();
  #pragma omp parallel for
  for(i=0;i < 4; i++)
  {

    jl_value_t* arg = jl_box_float64((double)i);
    jl_value_t* retJ = jl_call1(func, arg);
    double ret = jl_unbox_float64(retJ);

    printf("my_func() = %f\n", ret);
  }

which outputs (in different permutations)

my_func() = 0.000000 my_func() = 4.000000 my_func() = 2.000000 my_func() = 6.000000

In the code, it is essential to call the Julia function once in the serial code before calling it in parallel. Further I had to uncomment JL_GC_POP as I have no idea how to unwind the jl_pgcstack pointer when different threads access it.

So although this is of course a totally trivial experiment, it gives me some confidence that introducing locking in the code generation steps could enable locking-free execution of the actual execution code.

tknopp commented 10 years ago

Thinking a little bit more about it, it could be a feasible approach to first precompile the Julia code before executing a parallel loop.

tknopp commented 10 years ago

@vtjnash @JeffBezanson: I still have not really an idea how to cope with JL_GC_PUSH/POP in a parallel environment. If I understand it correctly, the actual rooting is done in gc.c:670 ( gc_mark_stack(jl_pgcstack, offset, d) ). To make JL_GC_PUSH/POP work in parallel one would somehow need to use an identifier so that POP could pop a certain push. Ideas?

(To make this clear: This is only an experiment and I only want to explore how far I get with respect to multithreading. This is not a concrete proposal)

carnaval commented 10 years ago

As threads have separate stacks I believe you would have to run as a separate task in each threads. That would mean having pgcstack and current_task thread local at least.

tknopp commented 10 years ago

One very important use case for threads besides the "OpenMP" use case is GUI development. In all GUI toolkits I know one has one UI thread and for any computation that takes some time one has to start a new thread and run the computation asynchronously.

I have been experimenting a little with Gtk.jl and parallel tasks and it seems to be in principle allow non-blocking UI. Still, I think threading would be more natural here.

stevengj commented 10 years ago

@tknopp, for the GUI use case, it seems like you could just use Julia's current distributed memory framework, i.e. separate processes, fairly easily (since the division of labor is known ahead of time and does not involve sharing of complicated data structures or dynamic load balancing).