MPLLang / mpl

The MaPLe compiler for efficient and scalable parallel functional programming
Other
306 stars 18 forks source link

Implement `structure Universal` with an applicative exception declaration #184

Closed MatthewFluet closed 1 month ago

MatthewFluet commented 1 month ago

@shwestrick @colin-mcd @mikerainey

A simple approach to avoiding the unit ref allocation with structure Universal. Read a4eb2010b2188b6d90e830232a230e609269fc4b's commit message for the interesting details.

It will be interesting to see if this has any impact on performance. A unit ref allocation ought to be just two instructions: *frontier = header(unit ref); frontier += 8;. And, I guess there would be a third instruction to write the object pointer to the freshly allocated unit ref to the stack, in order to be live across the pcall and accessible in the par and spwn continuations.

If it doesn't have any impact on performance, then I wonder whether avoiding the Universal.t entirely and just doing the val rightSideResult = 'b option ref allocation in pcallFork is really as bad as feared. (We would still allocate the rest of the joint point in the signal handler, passed to the par and spwn continuations via the data pointer; then we can just pass the rightSideResult along with the jp fetched via getData to the real synchronization operation. Or, maybe it is simpler for the signal handler to allocate a pre_joinpoint (monomorphic, with no rightSideResult field), which is propagated to the par and spwn continuations, who combine the pre_joinpoint with the rightSideResult to yield a 'b joinpoint.)

shwestrick commented 1 month ago

Beautiful! I'll look into testing performance ASAP.

shwestrick commented 1 month ago

@MatthewFluet there's a couple missing files:

mlton/atoms/exn-dec-elab.fun
mlton/atoms/exn-dec-elab.sig
MatthewFluet commented 1 month ago

@shwestrick Oops, forgot to add the new files. force-pushed with the correct files.

shwestrick commented 1 month ago

A little testing this morning. The largest performance improvement I've found so far is 5-10% on fully parallel fib. On other benchmarks, I'm seeing 0-5% improvement. So, the impact seems fairly small, but measurable.

I looked at the generated code for fib, and confirmed that this change indeed removes a heap allocation at each ForkJoin.par. So, that's good!

Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each par. If I implement fully parallel fib just in terms of pcall directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔

fully par fib(40) timings: procs MPL-v05 This PR MPL-v05 + pcall directly This PR + pcall directly
1 5.44 5.23 3.79 3.57
72 0.1057 0.0982 0.0732 0.0676

By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.

Anyway, a quick summary of my thoughts:

MatthewFluet commented 1 month ago

By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.

That's a very good point. And it would have a big difference on fib, where there ought to be no allocation from the fib computation itself.

Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each par.

I looked at the SSA code and confirmed that the Ref_ref[unit] was removed. But, I noticed that it was coming immediately after a GC_currentSpareHeartbeats C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit] allocation.

I didn't look as deeply as you; those other heap allocations would be worth investigating.

If I implement fully parallel fib just in terms of pcall directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔

That's pretty impressive!

  • This PR is clearly good for performance, with a small but measurable performance improvement due to eliminating one heap allocation along the fast path.
  • There are other overheads along the fast path that still need a closer look.

Agreed!

shwestrick commented 1 month ago

I noticed that it was coming immediately after a GC_currentSpareHeartbeats C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit] allocation.

Yeah, for the overhead of currentSpareHeartbeats, I was thinking it might be worth turning it into a _prim. We could then implement it by either (a) reading directly from the GCState, or (b) caching it, similar to StackTop, Frontier, etc.

I'll go ahead and merge this PR and then we can investigate these other overheads separately!