Closed shwestrick closed 6 days ago
I've done a bit more testing and fixed some bugs (including a particularly nasty GC bug in https://github.com/MPLLang/mpl/pull/195/commits/018a0096e2901b63a56a58604aed6a2e5a4d4f43)
I've found a couple examples of benchmarks that get slightly slower with this PR, including delaunay
, as expected, but only by a small amount: maybe 10%. After playing with delaunay
a bit, I've found that most of the performance loss can be regained by tuning algorithmic parameters of the benchmark. This is evidence that the original benchmark was overfit for the previous MPL runtime and scheduler.
IMO, the performance gain on many of the "no grain" benchmarks easily outweighs the small loss on `delaunay. Also, performance aside, the previous implementation of the scheduler was simply broken. This patch fixes that. So, I'm going to go ahead and merge.
There are still plenty of opportunities for further performance improvement, and we can investigate that moving forward.
This patch attempts a different approach for handling the eager forking optimization, approximately along these lines:
The idea here is to do the
pcall
unconditionally, and then (after entry into the callee) check whether or not we can trigger a promotion eagerly. If we can, then due to the token invariants, there's only one possible frame which could be promoted (the immediate ancestor).To make this as fast as possible, I implemented an optimization within
GC_HH_forkThread
to promote the youngest promotable frame. This optimization is only valid attryPromoteNow()
, where we know that the youngest frame IS the oldest frame. (This is asserted in the runtime system, so it will be checked automatically on debug builds.) Implementing this optimization required updating thePCall_forkThreadAndSetData
to keep track of whether or not the optimization should be used. I added another primPCall_forkThreadAndSetData_youngest
which immediately elaborates into the generalized prim.The
tryPromoteNow()
above is a bit abstracted from the actual implementation. The idea is to try to trigger a promotion, and silently back out if no ancestor is promotable. And, it turns out the latter behavior is possible, due to concurrency with the heartbeat handler: if a heartbeat arrives immediately after thecurrentSpareHeartbeatTokens() = 0
check, this could go ahead and promote before we get totryPromoteNow()
, in which case the call totryPromoteNow()
will silently fail, with no harm, and the execution continues safely.In other words, this implementation upholds the token invariants, regardless of concurrency with heartbeat handling. It therefore fixes #194.
More efficient compilation, too
A secondary benefit of this implementation is that it results in more efficient compilation along the fast path. This implementation of
par
allows for both of thef
andg
closures to be eliminated, resulting in fewer heap allocations along the fast path.To see this, consider a simple parallel fib:
Prior to this patch, MPL generates the following RSSA-level function:
Here, we see that before we get to the PCall, there is one
NormalObject
allocation; this is a closure for the right-hand side of the call toForkJoin.par
, which is used only along the eager forking path (here); in particular, this closure is pushed onto the scheduler queue.With this patch, MPL now generates this RSSA function:
We can see that this closure allocation has been eliminated. The performance advantage on a single core is significant: approximately 30%.
Before I merge this, I need to do more performance testing. IIRC, this patch may have negative impact on benchmarks that are span-limited on high core counts, in particular because the raw cost of eager forking cost under this approach is more expensive than it was previously. We will see.