Avoid closure alloc on `par` fast path, fix broken token invariants, and fix subtle GC bug in `GC_HH_forkThread`

This patch attempts a different approach for handling the eager forking optimization, approximately along these lines:

fun par(f, g) =
  let
    fun f'() =
      ( if currentSpareHeartbeatTokens() = 0 then
          ()
        else
          ( Thread.atomicBegin()
          ; if currentSpareHeartbeatTokens() >= 1 then
              tryPromoteNow()
            else
              ()
          ; Thread.atomicEnd()
          )

      ; f()
      )
  in
    pcallFork(f', g)
  end

The idea here is to do the pcall unconditionally, and then (after entry into the callee) check whether or not we can trigger a promotion eagerly. If we can, then due to the token invariants, there's only one possible frame which could be promoted (the immediate ancestor).

To make this as fast as possible, I implemented an optimization within GC_HH_forkThread to promote the youngest promotable frame. This optimization is only valid at tryPromoteNow(), where we know that the youngest frame IS the oldest frame. (This is asserted in the runtime system, so it will be checked automatically on debug builds.) Implementing this optimization required updating the PCall_forkThreadAndSetData to keep track of whether or not the optimization should be used. I added another prim PCall_forkThreadAndSetData_youngest which immediately elaborates into the generalized prim.

The tryPromoteNow() above is a bit abstracted from the actual implementation. The idea is to try to trigger a promotion, and silently back out if no ancestor is promotable. And, it turns out the latter behavior is possible, due to concurrency with the heartbeat handler: if a heartbeat arrives immediately after the currentSpareHeartbeatTokens() = 0 check, this could go ahead and promote before we get to tryPromoteNow(), in which case the call to tryPromoteNow() will silently fail, with no harm, and the execution continues safely.

In other words, this implementation upholds the token invariants, regardless of concurrency with heartbeat handling. It therefore fixes #194.

More efficient compilation, too

A secondary benefit of this implementation is that it results in more efficient compilation along the fast path. This implementation of par allows for both of the f and g closures to be eliminated, resulting in fewer heap allocations along the fast path.

To see this, consider a simple parallel fib:

fun fib (n: Word64.word) =
  if n < 0w2 then n else
  let val (x, y) = ForkJoin.par (fn _ => fib (n - 0w1), fn _ => fib (n - 0w2))
  in x + y
  end

Prior to this patch, MPL generates the following RSSA-level function:

fun fib_0 (x_3609: Word64, env_8: Objptr (opt_48)): ... = L_1639 ()
  ...
  L_1639 () Jump =
    x_3611: Objptr (opt_47) = OP (env_8, 8): Objptr (opt_47)
    x_3610: Word32 = WordU64_lt (x_3609, global_28)
    switch {test = x_3610,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_5062), (0x1:w32, L_5063))}
  L_5062 () Jump =
    L_1640 ()
  L_5063 () Jump =
    L_1641 ()
  L_1641 () Jump =
    return (x_3609)
  L_1640 () Jump =
    x_8026: Word64 = x_3609
    x_8025: Objptr (opt_48) = env_8
    x_3614: Objptr (opt_49)
      = NormalObject {init = ({offset = 0, src = x_8026},
                              {offset = 8, src = x_8025}),
                      tycon = opt_49}
    x_3613: Word32 = SpareHeartbeatTokens
    x_3612: Word32 = WordU32_lt (x_3613, global_18)
    switch {test = x_3612,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_5060), (0x1:w32, L_5061))}
  L_1643 () Jump =
    PCall leftSide_1 (x_3609, env_8) {cont = L_5058,
                                      parl = L_5059,
                                      parr = L_5057}
  ...

Here, we see that before we get to the PCall, there is one NormalObject allocation; this is a closure for the right-hand side of the call to ForkJoin.par, which is used only along the eager forking path (here); in particular, this closure is pushed onto the scheduler queue.

With this patch, MPL now generates this RSSA function:

fun fib_0 (x_3765: Word64, env_8: Objptr (opt_63)): ... = L_1721 ()
  ...
  L_1721 () Jump = 
    x_3766: Word32 = WordU64_lt (x_3765, global_28)
    switch {test = x_3766,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_4937), (0x1:w32, L_4938))}
  L_4937 () Jump = 
    L_1722 ()
  L_4938 () Jump = 
    L_1723 ()
  L_1723 () Jump = 
    return (x_3765)
  L_1722 () Jump = 
    PCall leftSide_1 (env_8, x_3765) {cont = L_4935,
                                      parl = L_4936,
                                      parr = L_4934}
  ...

We can see that this closure allocation has been eliminated. The performance advantage on a single core is significant: approximately 30%.

# ========= BEFORE ===========
$ bin/fib @mpl procs 1 -- -N 40
fib 40
finished in 2.2737s
result 102334155
$ bin/fib @mpl procs 8 set-affinity -- -N 40
fib 40
finished in 0.3574s
result 102334155

# ========= AFTER ===========
$ bin/fib @mpl procs 1 -- -N 40
fib 40
finished in 1.7898s
result 102334155
$ bin/fib @mpl procs 8 set-affinity -- -N 40
fib 40
finished in 0.2687s
result 102334155

Before I merge this, I need to do more performance testing. IIRC, this patch may have negative impact on benchmarks that are span-limited on high core counts, in particular because the raw cost of eager forking cost under this approach is more expensive than it was previously. We will see.

MPLLang / mpl

Avoid closure alloc on `par` fast path, fix broken token invariants, and fix subtle GC bug in `GC_HH_forkThread` #195

More efficient compilation, too