MPLLang / mpl

The MaPLe compiler for efficient and scalable parallel functional programming
Other
329 stars 18 forks source link

optimize `par` fast path by packaging all scheduler data into single closure #193

Closed shwestrick closed 3 months ago

shwestrick commented 3 months ago

Problem

We noticed previously that functions which call ForkJoin.par effectively take many more arguments than expected. For example, compiling this simple definition a parallel fib...

fun fib (n: Word64.word) =
  if n < 0w2 then n else
  let val (x, y) = ForkJoin.par (fn _ => fib (n - 0w1), fn _ => fib (n - 0w2))
  in x + y
  end

... results in this RSSA:

fun fib_0 (x_4373: Word64, env_8: Objptr (opt_57)) = L_2037 ()
  ...
  L_2037 () Jump = 
    x_4394: Word32 = OW32 (env_8, 12): Word32
    x_4393: Objptr (opt_43) = OP (env_8, 168): Objptr (opt_43)
    x_4392: Objptr (opt_50) = OP (env_8, 160): Objptr (opt_50)
    x_4391: Objptr (opt_51) = OP (env_8, 152): Objptr (opt_51)
    x_4390: Objptr (opt_42) = OP (env_8, 144): Objptr (opt_42)
    x_4389: Objptr (opt_28) = OP (env_8, 136): Objptr (opt_28)
    x_4388: Objptr (opt_38) = OP (env_8, 128): Objptr (opt_38)
    x_4387: Objptr (opt_30) = OP (env_8, 120): Objptr (opt_30)
    x_4386: Objptr (opt_39) = OP (env_8, 104): Objptr (opt_39)
    x_4385: Objptr (opt_46) = OP (env_8, 96): Objptr (opt_46)
    x_4384: Objptr (opt_25) = OP (env_8, 88): Objptr (opt_25)
    x_4383: Objptr (opt_49) = OP (env_8, 80): Objptr (opt_49)
    x_4382: Objptr (opt_40) = OP (env_8, 72): Objptr (opt_40)
    x_4381: Objptr (opt_40) = OP (env_8, 64): Objptr (opt_40)
    x_4380: Objptr (opt_41) = OP (env_8, 56): Objptr (opt_41)
    x_4379: Objptr (opt_25) = OP (env_8, 48): Objptr (opt_25)
    x_4378: Objptr (opt_56) = OP (env_8, 40): Objptr (opt_56)
    x_4377: Objptr (opt_25) = OP (env_8, 32): Objptr (opt_25)
    x_4376: Objptr (opt_25) = OP (env_8, 24): Objptr (opt_25)
    x_4375: Objptr (opt_25) = OP (env_8, 16): Objptr (opt_25)
    x_4374: Word32 = WordU64_lt (x_4373, global_28)
    switch {test = x_4374,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_5967), (0x1:w32, L_5968))}
  ...

Here, we see an RSSA-level function called fib_0, which (as you might hope) takes exactly two arguments: an input n (RSSA variable x_4373) and an environment env_8 containing the necessary data to support the call to ForkJoin.par.

However, upon entry, fib_0 immediately unpacks approximately 20 components of the closure into temporaries.

This inefficiency is carried through into codegen, and results in significantly more instructions on the hot path.

Diagnosis

Why is this happening?

In short, because ForkJoin.par closes over many (many!) components of the scheduler which are each used differently. Some are used on the fast path, others only on the slow path, and therefore MPL is forced to split the environment into temporaries and handle each temporary separately.

It is helpful to consider the code for ForkJoin.par, which is implemented by greedyWorkAmortizedFork. This code calls a number of functions defined elsewhere, such as maybeSpawnFunc, syncEndAtomic, etc., each of which has its own associated closure within env_8 of generated RSSA code above, but MPL can't statically prove that these closures all have the same lifetimes.

Solution

This patch creates a single "scheduler package" which manually closes over all of the data that ForkJoin.par needs to execute; we then always access this data explicitly through the scheduler package, making it easy for MPL to prove that all of this data has the same lifetime.

The advantage is immediately clear in the generated code. This removes approximately 20 instructions off the fast path.

fun fib_0 (x_3706: Word64, env_8: Objptr (opt_48)) = L_1710 ()
  ...
  L_1710 () Jump = 
    x_3708: Objptr (opt_47) = OP (env_8, 8): Objptr (opt_47)
    x_3707: Word32 = WordU64_lt (x_3706, global_28)
    switch {test = x_3707,
            default = None,
            expect = None,
            cases = ((0x0:w32, L_5058), (0x1:w32, L_5059))}

The performance improvement is big! Nearly 2x on parallel fib.

# ======== BEFORE ========
$ ./fib @mpl procs 1 -- -N 40
fib 40
finished in 4.4652s
result 102334155

# ======== AFTER ========
$ ./fib @mpl procs 1 -- -N 40
fib 40
finished in 2.3381s
result 102334155

I've similarly measured approximately 60% improvement on linefit-ng and 50% improvement on wc-ng.

MatthewFluet commented 3 months ago

Very interesting.

I was going to observe that at the end of ClosureConversion (and conversion to SSA), the environment record of a function is a single value. For space safety, that environment record is immediately unpacked upon entry to the function (to ensure that components of the enivronment that are only required on one code path are not unneccesarily kept live when control goes down another code path). What typically happens is that MLton/MPL does argument flattening (http://mlton.org/Flatten) when the tuple is explicitly available. I think the main motivation for tuple flattening is that user-level SML functions that are uncurried (say, an int * int) are more efficiently implemented by passing those as two arguments, rather than constructing a tuple, especially when all components of the tuple are required along all code paths. But, it can work against efficiency when a large environment record is flattened, especially since the Flatten pass does not care about the control-flow.

In this commit, the ref is ensures that the tuple is not flattened.

shwestrick commented 3 months ago

This makes sense! The interaction with flattening and safe-for-space closure conversion is really interesting. It gets me wondering if there's an opportunity here for more refined closure conversion together with data flattening, to generate more efficient code in general... but that's a question for another time.