MPLLang / mpl

The MaPLe compiler for efficient and scalable parallel functional programming
Other
306 stars 18 forks source link

Avoid closure alloc on `par` fast path, fix broken token invariants, and fix subtle GC bug in `GC_HH_forkThread` #195

Closed shwestrick closed 6 days ago

shwestrick commented 1 week ago

This patch attempts a different approach for handling the eager forking optimization, approximately along these lines:

fun par(f, g) =
  let
    fun f'() =
      ( if currentSpareHeartbeatTokens() = 0 then
          ()
        else
          ( Thread.atomicBegin()
          ; if currentSpareHeartbeatTokens() >= 1 then
              tryPromoteNow()
            else
              ()
          ; Thread.atomicEnd()
          )

      ; f()
      )
  in
    pcallFork(f', g)
  end

The idea here is to do the pcall unconditionally, and then (after entry into the callee) check whether or not we can trigger a promotion eagerly. If we can, then due to the token invariants, there's only one possible frame which could be promoted (the immediate ancestor).

To make this as fast as possible, I implemented an optimization within GC_HH_forkThread to promote the youngest promotable frame. This optimization is only valid at tryPromoteNow(), where we know that the youngest frame IS the oldest frame. (This is asserted in the runtime system, so it will be checked automatically on debug builds.) Implementing this optimization required updating the PCall_forkThreadAndSetData to keep track of whether or not the optimization should be used. I added another prim PCall_forkThreadAndSetData_youngest which immediately elaborates into the generalized prim.

The tryPromoteNow() above is a bit abstracted from the actual implementation. The idea is to try to trigger a promotion, and silently back out if no ancestor is promotable. And, it turns out the latter behavior is possible, due to concurrency with the heartbeat handler: if a heartbeat arrives immediately after the currentSpareHeartbeatTokens() = 0 check, this could go ahead and promote before we get to tryPromoteNow(), in which case the call to tryPromoteNow() will silently fail, with no harm, and the execution continues safely.

In other words, this implementation upholds the token invariants, regardless of concurrency with heartbeat handling. It therefore fixes #194.

More efficient compilation, too

A secondary benefit of this implementation is that it results in more efficient compilation along the fast path. This implementation of par allows for both of the f and g closures to be eliminated, resulting in fewer heap allocations along the fast path.

To see this, consider a simple parallel fib:

fun fib (n: Word64.word) =
  if n < 0w2 then n else
  let val (x, y) = ForkJoin.par (fn _ => fib (n - 0w1), fn _ => fib (n - 0w2))
  in x + y
  end

Prior to this patch, MPL generates the following RSSA-level function:

fun fib_0 (x_3609: Word64, env_8: Objptr (opt_48)): ... = L_1639 ()
  ...
  L_1639 () Jump =
    x_3611: Objptr (opt_47) = OP (env_8, 8): Objptr (opt_47)
    x_3610: Word32 = WordU64_lt (x_3609, global_28)
    switch {test = x_3610,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_5062), (0x1:w32, L_5063))}
  L_5062 () Jump =
    L_1640 ()
  L_5063 () Jump =
    L_1641 ()
  L_1641 () Jump =
    return (x_3609)
  L_1640 () Jump =
    x_8026: Word64 = x_3609
    x_8025: Objptr (opt_48) = env_8
    x_3614: Objptr (opt_49)
      = NormalObject {init = ({offset = 0, src = x_8026},
                              {offset = 8, src = x_8025}),
                      tycon = opt_49}
    x_3613: Word32 = SpareHeartbeatTokens
    x_3612: Word32 = WordU32_lt (x_3613, global_18)
    switch {test = x_3612,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_5060), (0x1:w32, L_5061))}
  L_1643 () Jump =
    PCall leftSide_1 (x_3609, env_8) {cont = L_5058,
                                      parl = L_5059,
                                      parr = L_5057}
  ...

Here, we see that before we get to the PCall, there is one NormalObject allocation; this is a closure for the right-hand side of the call to ForkJoin.par, which is used only along the eager forking path (here); in particular, this closure is pushed onto the scheduler queue.

With this patch, MPL now generates this RSSA function:

fun fib_0 (x_3765: Word64, env_8: Objptr (opt_63)): ... = L_1721 ()
  ...
  L_1721 () Jump = 
    x_3766: Word32 = WordU64_lt (x_3765, global_28)
    switch {test = x_3766,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_4937), (0x1:w32, L_4938))}
  L_4937 () Jump = 
    L_1722 ()
  L_4938 () Jump = 
    L_1723 ()
  L_1723 () Jump = 
    return (x_3765)
  L_1722 () Jump = 
    PCall leftSide_1 (env_8, x_3765) {cont = L_4935,
                                      parl = L_4936,
                                      parr = L_4934}
  ...

We can see that this closure allocation has been eliminated. The performance advantage on a single core is significant: approximately 30%.

# ========= BEFORE ===========
$ bin/fib @mpl procs 1 -- -N 40
fib 40
finished in 2.2737s
result 102334155
$ bin/fib @mpl procs 8 set-affinity -- -N 40
fib 40
finished in 0.3574s
result 102334155

# ========= AFTER ===========
$ bin/fib @mpl procs 1 -- -N 40
fib 40
finished in 1.7898s
result 102334155
$ bin/fib @mpl procs 8 set-affinity -- -N 40
fib 40
finished in 0.2687s
result 102334155

Before I merge this, I need to do more performance testing. IIRC, this patch may have negative impact on benchmarks that are span-limited on high core counts, in particular because the raw cost of eager forking cost under this approach is more expensive than it was previously. We will see.

shwestrick commented 6 days ago

I've done a bit more testing and fixed some bugs (including a particularly nasty GC bug in https://github.com/MPLLang/mpl/pull/195/commits/018a0096e2901b63a56a58604aed6a2e5a4d4f43)

I've found a couple examples of benchmarks that get slightly slower with this PR, including delaunay, as expected, but only by a small amount: maybe 10%. After playing with delaunay a bit, I've found that most of the performance loss can be regained by tuning algorithmic parameters of the benchmark. This is evidence that the original benchmark was overfit for the previous MPL runtime and scheduler.

IMO, the performance gain on many of the "no grain" benchmarks easily outweighs the small loss on `delaunay. Also, performance aside, the previous implementation of the scheduler was simply broken. This patch fixes that. So, I'm going to go ahead and merge.

There are still plenty of opportunities for further performance improvement, and we can investigate that moving forward.