Closed ArchRobison closed 8 years ago
Cc: @carnaval
Yep seems ok. I don't think there will be much trouble porting the write barrier, it should already be robust to duplication (except some heuristic counting code, not very important). gc_wb_buf is used in the case of a julia object (say an Array) owns a pointer to a non-first-class buffer (no identity, no type, like the Array data). Here we only copy the gc state of the parent on the children, so I don't think a harmful race can happen either. I'm not aware of status of the thread branch these days, but I believe the annoying part will be to implement safepoints. A naive implementation (which will of course be fine for the time being) will prohibit reordering of instructions around the safepoint, while we only need for the safepoint not to be inserted between a store and the write barrier, or between getting a pointer and rooting it. Maybe if we switch to a conservative stack scanner and we emit write barriers before the stores, we could go with a zero-overhead approach to safepoints like this :
This would avoid inserting mostly-useless branches in every loop of every function.
(By the way if you run into memory corruption and you suspect the write barrier, run the program under GC_VERIFY which should check that no trigger was missed)
That approach seems pretty extreme. It would be nice not to depend on modifying code.
It's not modifying code. It's in-place re-JITting :-).
If a conservative stack scan is used, why is stopping at safepoints still required? Won't the conservative scan find all possible roots?
The current threading branch, if I understand it correctly, uses memory allocation calls as safe points, which is simple, though that it could make threads stall for a long time and gives weak progress guarantees.
The Go flagship compiler (gc) currently uses function entry as its safepoints, and they piggyback on a conditional branch that used to deal with stack overflow. Go's procedure prolog checks if the stack is big enough, and if not, calls a routine that copies the entire stack. It looks like an attractive scheme for running many tasks without worrying about stack overflow, something we will have to worry about if parallelizing tasking. To stop all threads at a safepoint, the runtime modifies the stack limit so all threads jump to the overflow handler on next call. Of course it doesn't help for long loops.
Wouldn't manually gc-ing also be a first valid approach? I know its not that flexible but performance critical code should ideally not allocate any memory. When I worked in the initial threading code this worked quite well. Combined with manual gc-ing after a threading barrier even allocations should be possible.
@ArchRobison You're right, with a conservative stack scan we don't even need safepoints. Another huge argument in favor of the conservative scanner IMO. If doing nothing works, I can understand that live patching looks a bit overkill :-) @tknopp I'm sure it'll work but it's a very easy source of invisble slowdowns. If your perf critical code starts to allocate memory some day (someone changed a constant in inference.jl, ...) you can have one thread (or even n-1 threads) start to gc and stall until every other thread finishes its piece of work, reverting back to sequential performance.
@carnaval: Yes you are right. I just would be extremely happy if we could get the threading branch opt-in for 0.4 and could live with a solution that is "experts only".
@carnaval Why does a thread have multiple heaps? E.g., why isn't HEAP_COUNT==1
? Is it because region_t
uses fixed-sized arrays? (I'm taking a whack at making the gc thread-safe, and starting to deal with privatizing globals.)
Yes, what I call a heap in the code (well, a region, I've not been very consistent there) is a fixed mmap'd zone. The pool allocating code uses this to ask (and release) aligned 16k chunks of memory via malloc_page() and free_page(). Those should be fairly low-contention so locking around them should be ok. The pool descriptor themselves (pool_t) are high traffic memory so should probably be thread-local (I recall Tobias doing this some time ago). For alloc_big() we could have thread-local big_objects and big_objects_marked linked lists but I'd guess it is a lot less important for perf. (or achieve this with an atomic CAS ?). A bit unrelated but : will the threading branch use an existing work scheduling library ? I've been thinking about making some part of the gc parallel, and managing the queues & work stealing/distribution looks like 80% of the problem. A mature runtime to handle this would be cool to have hanging around the C codebase.
Not sure if one can call it thread local what I did with the memory pool. I generated N
pools and ensured that when several threads entered the pool_alloc function are dispatched on a "free" pool. Each pool had a mutex to ensure that no race conditions happen:
https://github.com/tknopp/julia/blob/jlthreading/src/gc.c#L974
In my tests this pool was a huge improvement. But has this been entirely removed from the threading branch? I thought the the experts replaced this with a proper TLS solution.
Thanks for the comments and history.
@carnaval: We're still working out what sort of scheduler the threading branch will use. Both work-stealing and more loop-centric approaches have merit. I'm still pondering hybrids that the parallel run-times (OpenMP, TBB, Cilk Plus) teams have kicked around.
@tknopp The threading branch was using one pool per thread as far as I could tell. The "TLS" was rolled by hand via an array indexed by thread index.
@ArchRobison: Re thread scheduler: I've begun to use Qthreads https://github.com/Qthreads/qthreads in another project recently. The main difference between Qthreads and, well, almost everything else (OpenMP, TBB, Cilk Plus) is that Qthreads doesn't introduce dependencies between threads. In the other threading architectures, threads can only wait for their children, or a core running a particular thread can only switch to running a child, or threads cannot easily switch between cores. This usually has to do with stack layout. Qthreads introduces a separate stack for each thread, leading to its flexibility.
Qtherads also introduces the notion of "shepherd", probably corresponding to NUMA domains. Threads can move between shepherds, but that happens more rarely than moving between cores that are managed by the same shepherd.
Anyway, since Qthreads seems to be more powerful than most other systems I thought I'd mention it here. Qthreads is used e.g. for Chapel.
See the thread safety tracker in #10421
It's good to bring up QThreads. The one stack per OS thread model has surely done a lot of damage to enabling widespread use of parallelism. Indeed Cilk was an early system to break the one linear stack per OS thread association. It used heap-allocated procedure frames to simulate a cactus stack. Regrettably, that didn't play well in the marketplace since it required a different calling convention, and so the Intel version of Cilk worked out a trick to map a cactus stack onto multiple linear stacks,
The dependencies between parent and child tasks are a two-edged sword. For certain common patterns, they enable very efficient implementation with provable space guarantees, specifically O(S*P) space on P threads if the program takes O(S) space on one thread.. The bound sounds trivial, but it's easy to show space blowup in less constrained schedulers. On the other hand, the dependencies are annoying in other contexts.
I think the overall lesson is to give careful consideration to the options while the calling convention and threading model have some flexibility.
@carnaval: Near here, it looks like there may be a copy-paste error:
if (regions_lb[region_i] < i)
regions_lb[region_i] = i;
if (regions_ub[region_i] < i)
regions_ub[region_i] = i;
Shouldn't the first comparison be >
instead of <
?
I don't think so. The naming may be a bit misleading but those are not the lower/upper bound of the same quantity. We store a lower bound of the first free page and an upper bound of the last non-free page. On allocation, the first one must grow up to where we are allocating since every page before is in use. The second one must also grow if we are allocating outside the previous bound. Did you encounter a related bug ? If this logic is wrong we are bound to get either slower allocation or memory errors.
Thanks for the clarifications.
As of 2fc8d97af4f6cb42a50aed0608f49fdbcbfd9e41 on the jn/threading branch, this is now basically accomplished.
can this be closed?
We should have at least one travis target that builds with multi-threading enabled and runs known passing tests. I don't think this should hold up closing of this particular issue, but it would be nice to have it enabled.
same as with #9336, it's rather difficult to build a newer llvm on travis right now. have to decide exactly which branch or release will be used, then package it.
I guess we can do this once we move to our own branch of llvm 3.7, and use the same branch for threading, gallium, and everything else. Perhaps this is not the right place to discuss it.
We've got a Travis Linux with threading on passed =) https://travis-ci.org/JuliaLang/julia/builds/96090255
:clap:
that's what you said above should be the close criteria for this issue (we have newer ones open tracking this better anyways)
Err, the branch tested is based on https://github.com/JuliaLang/julia/pull/14190
Bravo, @yuyichao. Still, it seems premature to close since this hasn't been merged.
This issue is for discussing how to make the new GC (#8699) thread-safe. A GC expert should check the arguments here. For background, see the elegant state transition diagram in
src/gc.c
.I'm assuming that garbage collection operates in stop-the-world mode, and for now, that the mark/sweep phase is single-threaded. Each thread can have its own thread-local
remset
, with occasional duplication across the sets because of the race described below.Key question: what state transitions can race against each other? If I understand the GC correctly, the only racy transitions arise from a write barrier for the same parent, and that transition is
GC_MARKED --> GC_QUEUED
, which changes the low two bits of the memory location from 01 to 10. If this is indeed the only racy transition, then a plain load, modify, plain store sequence will suffice, since all racing participants will be attempting to write the same value to the location. To be portable C11, we should use relaxed atomic loads/stores instead of the current assignments into bitfields, but that's a purist nitpick.The possibility of a race will require lightening up on assertions that check for duplicate enqueuing of roots. There will need to be a
memory_order_acq_rel
fence between normal operation and garbage collection, but that's probably already implied by synchronization between normal operation and garbage collection.Possible problem:
gc_wb_buf
appears to engaging in other transitions. If so, which of these might race against each other? Are these transitions always on differentgc_bits
than the ones affected bygc_wb
?