Closed MatthewFluet closed 1 month ago
Beautiful! I'll look into testing performance ASAP.
@MatthewFluet there's a couple missing files:
mlton/atoms/exn-dec-elab.fun
mlton/atoms/exn-dec-elab.sig
@shwestrick Oops, forgot to add the new files. force-pushed
with the correct files.
A little testing this morning. The largest performance improvement I've found so far is 5-10% on fully parallel fib. On other benchmarks, I'm seeing 0-5% improvement. So, the impact seems fairly small, but measurable.
I looked at the generated code for fib, and confirmed that this change indeed removes a heap allocation at each ForkJoin.par
. So, that's good!
Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each par
. If I implement fully parallel fib just in terms of pcall
directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔
fully par fib(40) timings: | procs | MPL-v05 | This PR | MPL-v05 + pcall directly | This PR + pcall directly |
---|---|---|---|---|---|
1 | 5.44 | 5.23 | 3.79 | 3.57 | |
72 | 0.1057 | 0.0982 | 0.0732 | 0.0676 |
By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.
Anyway, a quick summary of my thoughts:
By the way, one impact of eliminating heap allocations is that it reduces the number of LGC garbage collections. On fully par fib(40), I'm getting approximately half as many LGCs (down to 7500 LGCs on a single core on average, instead of 13000). It's possible that the reduction in number of LGCs accounts for the bulk of the performance improvement, because the cost of a heap allocation itself is incredibly small.
That's a very good point. And it would have a big difference on fib
, where there ought to be no allocation from the fib
computation itself.
Looking closely at the generated code, I noticed that there are (two?) other heap allocations along the fast path that I didn't expect. I've done some digging, and these appear to be related to the eager fork path of the token management algorithm... my best guess is that the compiler is deciding to heap-allocate a couple closures at each
par
.
I looked at the SSA code and confirmed that the Ref_ref[unit]
was removed. But, I noticed that it was coming immediately after a GC_currentSpareHeartbeats
C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit]
allocation.
I didn't look as deeply as you; those other heap allocations would be worth investigating.
If I implement fully parallel fib just in terms of
pcall
directly, these additional heap allocations go away, and I get a ~50% performance improvement. So, there's something funky going on here... still looking into it. 🤔
That's pretty impressive!
- This PR is clearly good for performance, with a small but measurable performance improvement due to eliminating one heap allocation along the fast path.
- There are other overheads along the fast path that still need a closer look.
Agreed!
I noticed that it was coming immediately after a
GC_currentSpareHeartbeats
C call; in terms of raw instructions, I'm sure that the C call (as trivial as it may be) is more than the Ref_ref[unit] allocation.
Yeah, for the overhead of currentSpareHeartbeats
, I was thinking it might be worth turning it into a _prim
. We could then implement it by either (a) reading directly from the GCState
, or (b) caching it, similar to StackTop
, Frontier
, etc.
I'll go ahead and merge this PR and then we can investigate these other overheads separately!
@shwestrick @colin-mcd @mikerainey
A simple approach to avoiding the
unit ref
allocation withstructure Universal
. Read a4eb2010b2188b6d90e830232a230e609269fc4b's commit message for the interesting details.It will be interesting to see if this has any impact on performance. A
unit ref
allocation ought to be just two instructions:*frontier = header(unit ref); frontier += 8;
. And, I guess there would be a third instruction to write the object pointer to the freshly allocatedunit ref
to the stack, in order to be live across thepcall
and accessible in thepar
andspwn
continuations.If it doesn't have any impact on performance, then I wonder whether avoiding the
Universal.t
entirely and just doing theval rightSideResult = 'b option ref
allocation inpcallFork
is really as bad as feared. (We would still allocate the rest of the joint point in the signal handler, passed to thepar
andspwn
continuations via the data pointer; then we can just pass therightSideResult
along with thejp
fetched viagetData
to the real synchronization operation. Or, maybe it is simpler for the signal handler to allocate apre_joinpoint
(monomorphic, with norightSideResult
field), which is propagated to thepar
andspwn
continuations, who combine thepre_joinpoint
with therightSideResult
to yield a'b joinpoint
.)