At each ForkJoin.par, the scheduler has an optimization: if there is at least one spare heartbeat token, then the par is performed eagerly (instead of delayed with a pcall). This is really important for practical performance, as we measured in the APM paper.
fun par(f, g) =
if currentSpareHeartbeatTokens() = 0 then
pcall(...)
else
eager_par(f, g)
Invariant
To respect the outermost-first promotion policy, we need the following invariant:
If the current thread has at least one spare token, then no ancestors are promotable.
The signal handler respects this invariant: it provides the current thread with more tokens, but then immediately attempts to spend as many tokens as possible, restoring the invariant before the handler returns.
Bug breakdown
In the implementation of par above, there is a subtle bug stemming from the lack of atomicity between the call to currentSpareHeartbeatTokens() and the code that executes immediately afterwards.
Suppose a heartbeat arrives immediately after the check, here:
fun par (f, g) =
if currentSpareHeartbeatTokens() = 0 then
(* ======== HEARTBEAT ARRIVES ======== *)
pcall(...)
else
eager_par(f, g)
If the heartbeat leaves behind at least one spare token, then the invariant will be violated upon entry into pcall(...).
Note that this is not a case of "true concurrency"; the generated code will only enter a signal handler at a safe point, and whether or not the safe point between the check and the pcall is chosen for a signal check will depend on optimizations and heuristics in the compilation pipeline.
Possible solutions
Prevent the compiler from generating a signal check on the transfer coming out of if currentSpareHeartbeatTokens() = 0 .... Implementing this would likely be pretty hacky.
Wrap the appropriate code section in atomicBegin(); ...; atomicEnd(). This would prevent a heartbeat signal, but incurs a small dynamic cost, to increment and decrement the atomic state.
Unconditionally do a pcall, and then do the token check immediately inside the callee; if there are sufficient tokens, immediately promote the pcall.
I've implemented this before, and it has various advantages, but the downside is that it appears to be significantly more expensive than directly doing an eager_par. For benchmarks that are span-limited or close to being span-limited, this can easily swing parallel performance on large core counts by 20% or more.
At each
ForkJoin.par
, the scheduler has an optimization: if there is at least one spare heartbeat token, then thepar
is performed eagerly (instead of delayed with apcall
). This is really important for practical performance, as we measured in the APM paper.Invariant
To respect the outermost-first promotion policy, we need the following invariant: If the current thread has at least one spare token, then no ancestors are promotable.
The signal handler respects this invariant: it provides the current thread with more tokens, but then immediately attempts to spend as many tokens as possible, restoring the invariant before the handler returns.
Bug breakdown
In the implementation of
par
above, there is a subtle bug stemming from the lack of atomicity between the call tocurrentSpareHeartbeatTokens()
and the code that executes immediately afterwards.Suppose a heartbeat arrives immediately after the check, here:
If the heartbeat leaves behind at least one spare token, then the invariant will be violated upon entry into
pcall(...)
.Note that this is not a case of "true concurrency"; the generated code will only enter a signal handler at a safe point, and whether or not the safe point between the check and the pcall is chosen for a signal check will depend on optimizations and heuristics in the compilation pipeline.
Possible solutions
Prevent the compiler from generating a signal check on the transfer coming out of
if currentSpareHeartbeatTokens() = 0 ...
. Implementing this would likely be pretty hacky.Wrap the appropriate code section in
atomicBegin(); ...; atomicEnd()
. This would prevent a heartbeat signal, but incurs a small dynamic cost, to increment and decrement the atomic state.Unconditionally do a
pcall
, and then do the token check immediately inside the callee; if there are sufficient tokens, immediately promote thepcall
.eager_par
. For benchmarks that are span-limited or close to being span-limited, this can easily swing parallel performance on large core counts by 20% or more.