Closed shwestrick closed 3 months ago
Very interesting.
I was going to observe that at the end of ClosureConversion (and conversion to SSA), the environment record of a function is a single value. For space safety, that environment record is immediately unpacked upon entry to the function (to ensure that components of the enivronment that are only required on one code path are not unneccesarily kept live when control goes down another code path). What typically happens is that MLton/MPL does argument flattening (http://mlton.org/Flatten) when the tuple is explicitly available. I think the main motivation for tuple flattening is that user-level SML functions that are uncurried (say, an int * int
) are more efficiently implemented by passing those as two arguments, rather than constructing a tuple, especially when all components of the tuple are required along all code paths. But, it can work against efficiency when a large environment record is flattened, especially since the Flatten pass does not care about the control-flow.
In this commit, the ref
is ensures that the tuple is not flattened.
This makes sense! The interaction with flattening and safe-for-space closure conversion is really interesting. It gets me wondering if there's an opportunity here for more refined closure conversion together with data flattening, to generate more efficient code in general... but that's a question for another time.
Problem
We noticed previously that functions which call
ForkJoin.par
effectively take many more arguments than expected. For example, compiling this simple definition a parallel fib...... results in this RSSA:
Here, we see an RSSA-level function called
fib_0
, which (as you might hope) takes exactly two arguments: an inputn
(RSSA variablex_4373
) and an environmentenv_8
containing the necessary data to support the call toForkJoin.par
.However, upon entry,
fib_0
immediately unpacks approximately 20 components of the closure into temporaries.This inefficiency is carried through into codegen, and results in significantly more instructions on the hot path.
Diagnosis
Why is this happening?
In short, because
ForkJoin.par
closes over many (many!) components of the scheduler which are each used differently. Some are used on the fast path, others only on the slow path, and therefore MPL is forced to split the environment into temporaries and handle each temporary separately.It is helpful to consider the code for
ForkJoin.par
, which is implemented bygreedyWorkAmortizedFork
. This code calls a number of functions defined elsewhere, such asmaybeSpawnFunc
,syncEndAtomic
, etc., each of which has its own associated closure withinenv_8
of generated RSSA code above, but MPL can't statically prove that these closures all have the same lifetimes.Solution
This patch creates a single "scheduler package" which manually closes over all of the data that
ForkJoin.par
needs to execute; we then always access this data explicitly through the scheduler package, making it easy for MPL to prove that all of this data has the same lifetime.The advantage is immediately clear in the generated code. This removes approximately 20 instructions off the fast path.
The performance improvement is big! Nearly 2x on parallel fib.
I've similarly measured approximately 60% improvement on
linefit-ng
and 50% improvement onwc-ng
.