Open ocharles opened 8 years ago
It's worth noting that things aren't much better with a static FRP network:
fireworks :: Event Double -> MomentIO (Behavior [SparkData])
fireworks stepPhysics = do
sparks <- for (replicate 200 ()) $ \_ -> newSpark stepPhysics
return (sequenceA (map observeSpark sparks))
where
observeSpark Spark {..} =
SparkData <$> particlePosition sparkParticle <*> sparkIntensity
deleteOne i s = deleteAt i <$ sparkFaded s
Gives
benchmarking Sparks
time 825.5 ms (718.7 ms .. 917.2 ms)
0.998 R² (0.993 R² .. 1.000 R²)
mean 805.1 ms (783.2 ms .. 816.3 ms)
std dev 18.99 ms (0.0 s .. 19.34 ms)
variance introduced by outliers: 19% (moderately inflated)
9,378,473,584 bytes allocated in the heap
5,185,469,432 bytes copied during GC
13,970,456 bytes maximum residency (246 sample(s))
471,616 bytes maximum slop
41 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17125 colls, 0 par 4.550s 4.558s 0.0003s 0.0015s
Gen 1 246 colls, 0 par 1.909s 1.912s 0.0078s 0.0154s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.900s ( 7.908s elapsed)
GC time 6.458s ( 6.470s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 14.360s ( 14.378s elapsed)
%GC time 45.0% (45.0% elapsed)
Alloc rate 1,187,168,848 bytes per MUT second
Productivity 55.0% of total user, 55.0% of total elapsed
I don't know what exactly is going on here, but I do know that reactive-banana currently uses monad transformers internally and that last time that I've worked on performance improvements, it turned out that they are responsible for performance problems. Essentially, the problem is that many uses of >>=
are not inlined (though for good reason), which leads to closures being generated on the heap and discarded again, which leads to a high GC load.
I have also thought of a solution, but haven't gotten around to implementing it yet. For the record: The key idea is to remove the EvalP
monad layer (and others) as much as possible, and supply most of the reader/writer/state contained in the monad directly via function arguments to the evaluatePulses
function. The key step for this is a technique called "defunctionalization"; in this case, the step is to replace the higher order argument to newPulse
by an algebraic data type that captures all the possible primitive combinators. In other words, the call to the function _evalP
directly when determining the current value of a pulse node in the graph should be replaced by a pattern match on different node types, essentially moving the formulas for computing node values (e.g. how to compute an fmap, filter etc.) from the Prim.Combinators
to the Prim.Evaluation
module.
In summary: Making an algebraic data type for the different nodes types (e.g. fmap, filter, union etc.) allows us to get rid of EvalP
closures, which should speed up things considerably.
Interesting, and that all loosely makes sense.
I'm also surprised to see pqueue
allocating so much. I briefly tried replacing it with psqueues
, but that requires us to explicitly key every entry, so I didn't have time to explore that more. Do you think it's worth investigating further?
I think https://github.com/ocharles/reactive-banana/commit/37fca51cf6563b44c1aa217e099d95218d387c6c is what you are describing with replacing _evalP
, did I understand correctly?
I think ocharles@37fca51 is what you are describing with replacing _evalP, did I understand correctly?
Yes, that's exactly what I meant! (This is called "defunctionalization", because you replace a general closure (EvalP
) by a data constructor (PulseCalculation
) which is only later mapped to a function/closure.)
Now, it's possible to unwrap the EvalP
type in the result of runPulseCalculation
and evaluateNode
, i.e. to write a new definition
runPulseCalculationNew r calc = unwrapEvalP r $ runPulseCalculationOld calc
and expand the right-hand side of this line. The point would be to make sure that every application of >>=
, <$>
etc. acts on values of IO
type, not on values of EvalP
type. This may not seem like much, but the difference between
(\r1 -> x1 r1) >>= (\a r2 -> x2 r2 a)
and
\r -> x1 r >>= x2 r a
is precisely building a closure on the heap and discarding it again.
Note that it may not be necessary to do all of this tedious rewriting by hand. If you look at the generated core, you can see whether the inliner can already perform it, or at least parts of it. The sign to look for is whether the core contains calls to the function >>=
(bad, because the second argument is a closure), or whether it contains only functions that pass RealWorld
tokens areound (good).
(Also, it may help to write runPulseCalculation
as a big case statement, this way, you don't have to mention the argument r
again and again.)
Do use a performance test case to check whether these changes to the code actually make things faster. :-)
Ok, well https://github.com/ocharles/reactive-banana/commit/72ba09bdbacc8e2426d17e1411f34a53eb5e1539 "pushes" EvalP
further out, but unfortunately this makes no difference on the benchmark :( I'm just surprised at the amount of allocation which would be my first port of call. Running with -pA
points to
deRefWeaks Reactive.Banana.Prim.Util 11.2 34.1 2634 4096401360
minViewWithKey Data.PQueue.Prio.Internals 1.7 12.6 402 1512666920
I successfully swapped out pqueue
for psqueues
, but it doesn't really make a dent. But why is deRefWeaks
allocating so much? Presumably because everything is stored as a Weak Ref
, so to do anything we have to perform a lookup on the weak reference.
Ok, well ocharles@72ba09b "pushes" EvalP further out, but unfortunately this makes no difference on the benchmark :(
I think that goes in the right direction, but it's probably necessary to unwrap everything inside the runPulseCalculation
function as well. In other words, I think it may be necessary to inline all calls to unwrapEvalP
, so that the operators >>=
, <*>
and so on are only applied to values of type IO
, never to values of type EvalP
.
Whether this has to be done or not can only be judged by inspecting the compiled core. The Makefile* contains two (phony) targets profile
and core
.
profile
target will compile and run a small example program, I use this for checking whether my changes are making progress or not. Note that one has to use the -auto-all
option to get more cost centers, as resources are usually not spent where you think they are.core
makefile target will compile the Prim.Evaluation
module and output the Haskell core functions. Inspecting this for let
statements with values of type EvalP
helps to understand where the issues with the monad transformers are.*The Makefile assumes that a cabal sandbox is used.
(By the way, I have written a couple of notes on the issue of optimizing monad transformer stacks, which is relevant to what we are trying to do here. It also shows a Haskell Core output example and what to look for.)
Thanks, I do get what's being suggested and I'll have a dig further. By the way, are you aware that the latest transformers
adds significantly more INLINE
s now? You can quite often use a concrete stack now without any observation of it in the core - not sure if that happened before or after you wrote those notes.
An idle thought that does come out when I stare at Evaluation
though, is that just planning execution is (at worst?) linear with the amount of nodes. In the case of the physics benchmark, everything depends on the tick of a physics clock. Whenever the clock ticks, we replan all the work that needs to be done, building up and tearing down a huge priority search queue. Do you think there is any advantage to having some mutable state somewhere such that firing add handlers become distributing some information, rather than building up a data structure?
Alright, I got a bit more organised and took a stab at doing some serious optimisations.
My master
branch has all my work so far, here's a comparison. In short, I've:
vault
. Instead, Each Pulse
now directly has an IORef
that contains its value at a particular clock tick. I then clear everything at the end of a tick. There's a substantial amount of work happening building up and tearing down the internal HashMap
in vault
, so if we can avoid it - we should. The only thing I'm concerned about is event unions, but all tests pass.RWSIO
- we don't have any state now, so we can just use ReaderWriterIO
INLINE
annotations, and now compile with -O2
.With the "john lato" benchmark, I move from 0.120s total time to 0.046s - with 1/3 bytes allocated.
With my own criterion based benchmark, I move from a mean execution time of 6.456 μs to 1.363 μs.
Unfortunately, my particle physics application is still only 70% productive, so I will continue to see what I can do.
Impressive! :smile:
It would be nice to know which of the changes has the most impact, i.e. which yield the biggest reduction in runtime. After all, each of them has the tendency to make the code more "spaghetti", so I would like to limit myself to those that give the best bang for the buck.
• A lot of INLINE annotations, and now compile with -O2.
Did you have a chance to look at the generated Core? My (admittedly limited) experience was that GHC is pretty good at inlining things, so it's usually not necessary to add INLINE pragmas everywhere.
• Removed the use of vault. Instead, Each Pulse now directly has an IORef that contains its value at a particular clock tick. I then clear everything at the end of a tick.
The nice thing about the Vault
approach is that it is easy to prove that the values are, in fact, all cleared at the end. Proof: The Vault
is simply discarded from scope! So, this would be a change that negatively affects the structure of the code. But I'd be happy to accept if it it brings a significant performance benefit.
• Do not build a queue for evaluation, but just proceed depth-first from the input pulse that's firing. This one is my most recent addition, and again - tests pass, but I'm nervous. Will need to think more about this.
Unfortunately, I'm quite sure that this is not correct. It should be possible to construct a counterexample along the following lines: If e1
is an event, then the pulse corresponding to an event e4
defined as
e2 = fmap something e1
e3 = fmap something' e1
e4 = unionWith (+) e2 e3
might get evaluated by the depth-first approach before the pulse corresponding to e3
has been evaluated, giving an incorrect result.
I think it's hard to beat the explicit priority queue compared to an implict "an IORef per node" approach, so that's probably not a low-hanging fruit.
Unfortunately, my particle physics application is still only 70% productive, so I will continue to see what I can do.
I still suspect that this is due to closures being allocated and destroyed in the tight loop that is go
. Only Core can tell.
Any further luck with this? I have to finish a couple of other things first, but then I'm happy to take a closer look at this as well.
Still working on different approaches. At the moment I'm exploring running with +RTS -xc -K10K
to see if we're using excessive stack, which implies overly lazy data structures. It needs a lot of stack space just to create a network of 200 particles, which is making me suspicious. For example, the Graph
data structure isn't strict in its arguments, and there are many other places where I think a strict tuple is beneficial. No finish work just yet though!
Hello! Any updates on this work?
I don't have any updates I'm afraid.
On Mon, Aug 21, 2017 at 10:24 PM Mitchell Rosen notifications@github.com wrote:
Hello! Any updates on this work?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HeinrichApfelmus/reactive-banana/issues/140#issuecomment-323856858, or mute the thread https://github.com/notifications/unsubscribe-auth/AABRjsE6gA72JIysFQ27ThrEtcFHhheoks5safWCgaJpZM4J7uF_ .
It would be great to get some of your work merged, I think the only thing @HeinrichApfelmus objected to was ripping out the priority queue of nodes to evaluate.
I'll have to dig out the branch again and see if I can make some sensible commits from it. I'll see if I can put some time aside.
I resurrected this and applied all my current fixes. Unfortunately there's still a significant space leak:
As this uses switchB
with stepper
that holds on to a common event, I think this space leak might be #261
As #261 is now fixed — perhaps the situation has improved here as well?
Things seem pretty good on current master, actually! I'm seeing around 88% productivity.
Oh, actually only around 80%; I was getting higher productivity when doing profiled runs, oddly.
Here's the heap graph:
And the worst offenders:
Unfortunately there are nameless things. I believe they're all from base
, but might include boot libraries, too. I'm not totally sure what's up with that - do I have to build ghc
itself with -finfo-table-map -fdistinct-constructor-tables
?
Hi! I've put together a little benchmark of doing some good old 90s particle simulations: https://github.com/ocharles/reactive-banana-fireworks-benchmark.
Essentially every clock tick I spawn a new particle, which has position/velocity/life time, and I move all particles around on each clock tick. This exhibits some pretty disappointing performance, so hopefully we can work out what's going wrong and get it nice and speedy.
I've attached the result of running with
-p
here, but it's not very enlightening (to me) - here's the header: