MPLLang / mpl

The MaPLe compiler for efficient and scalable parallel functional programming
Other
329 stars 18 forks source link

Scheduler bug: broken heartbeat token invariant #194

Closed shwestrick closed 3 months ago

shwestrick commented 3 months ago

At each ForkJoin.par, the scheduler has an optimization: if there is at least one spare heartbeat token, then the par is performed eagerly (instead of delayed with a pcall). This is really important for practical performance, as we measured in the APM paper.

fun par(f, g) =
  if currentSpareHeartbeatTokens() = 0 then
    pcall(...)
  else
    eager_par(f, g)

Invariant

To respect the outermost-first promotion policy, we need the following invariant: If the current thread has at least one spare token, then no ancestors are promotable.

The signal handler respects this invariant: it provides the current thread with more tokens, but then immediately attempts to spend as many tokens as possible, restoring the invariant before the handler returns.

Bug breakdown

In the implementation of par above, there is a subtle bug stemming from the lack of atomicity between the call to currentSpareHeartbeatTokens() and the code that executes immediately afterwards.

Suppose a heartbeat arrives immediately after the check, here:

fun par (f, g) =
  if currentSpareHeartbeatTokens() = 0 then
    (* ======== HEARTBEAT ARRIVES ======== *)
    pcall(...)
  else
    eager_par(f, g)

If the heartbeat leaves behind at least one spare token, then the invariant will be violated upon entry into pcall(...).

Note that this is not a case of "true concurrency"; the generated code will only enter a signal handler at a safe point, and whether or not the safe point between the check and the pcall is chosen for a signal check will depend on optimizations and heuristics in the compilation pipeline.

Possible solutions

  1. Prevent the compiler from generating a signal check on the transfer coming out of if currentSpareHeartbeatTokens() = 0 .... Implementing this would likely be pretty hacky.

  2. Wrap the appropriate code section in atomicBegin(); ...; atomicEnd(). This would prevent a heartbeat signal, but incurs a small dynamic cost, to increment and decrement the atomic state.

  3. Unconditionally do a pcall, and then do the token check immediately inside the callee; if there are sufficient tokens, immediately promote the pcall.

    • I've implemented this before, and it has various advantages, but the downside is that it appears to be significantly more expensive than directly doing an eager_par. For benchmarks that are span-limited or close to being span-limited, this can easily swing parallel performance on large core counts by 20% or more.