Gabriella439 / pipes

Compositional pipelines
BSD 3-Clause "New" or "Revised" License
489 stars 72 forks source link

Benchmarks/ reason for new perf claim? #18

Closed cartazio closed 11 years ago

cartazio commented 12 years ago

I see that in the new 2.5 release some interesting performance claims are made. What has changed in the internals to enable this, and what are the associated benchmarks? Thanks!

Gabriella439 commented 12 years ago

The associated post contains the most punishing benchmark which gives the greatest difference between pipes and conduit. All other ones show pipes leading by a significantly smaller margin since they are mostly IO bound. If you want benchmarks for specific operations (i.e. what is the overhead for a request or respond statement) then I can create a criterion benchmark for them. Usually when I do benchmarks I just measure raw throughput since that's what matters to me in my projects.

There were a LOT of things I did to optimize and I will go through them in decreasing order of importance:

In pipes-2.0 I switched to making the base monad mandatory to satisfy the monad transformer laws. This had a huge impact on pure performance because the compiler does not optimize passing values through a polymorphic base monad very well. At the time I had to make this switch because the category laws for Frames required a type that obeyed the monad transformer laws, but now I'm deprecating Frames and the new proxies plus their extensions (i.e. proxy transformers) do not require a functioning monad transformer to operate correctly and satisfy the category laws. Proxies and proxy transformers were the big breakthrough that made this switch back to the old implementation possible.

This change accounts for the vast majority of the increase and put pipes in the same ballpark as conduit. Before the change the underlying implementation was taking something like 8x as long as conduit which made a big difference in the throughput of non-IO bound code segments (what Michael likes to call "micro-benchmarks", so I steal his term). This change put pipes in the same ballpark as conduit although still maybe a factor of 2 slower on micro-benchmarks and the remaining changes are what put pipes over the edge. I did not individually benchmark the following changes step by step so I don't know exactly how much each one individually contributed, but combined I know they contributed the remaining difference.

The library previously defined all utilities and operations using high-level operations like (>=>), forever, liftF, and request/respond. I scrapped all that and inlined everything's definition, not even using basic things like ($) or liftM. I left nothing to chance so that the code would be as close to the generated core as possible. Consequently, the -O2 flag has no effect on library speed anymore as I'm basically hand-writing something very closely resembling the final core. I kept the original high-level definitions in the comments just so that people can still understand how the utilities work.

This is what GHC does a lot of when you turn on the -O2 flag, which in turn enables the -fspec-constr. The basic idea is that if you have a function like this:

mapD f = Request x (\a -> Respond (f a) (mapD f))

It would be slower unoptimized because GHC would be calling mapD from scratch on each recursion since you reapply mapD to f at each step. With the optimization, ghc instead caches the application by using a worker that closes over the f variable, producing something like this:

mapD f = go where
    go x = Request x (\a -> Respond (f a) go)

This not only runs faster but also removes a potential space leak. I got several reports that for the previous versions of the library if you turned off optimizations or used NOINLINE statements you would sometimes get space leaks. It turns out that the above pattern eliminates this same space leak problem, so I now apply it defensively in all library code.

Sometimes GHC doesn't do a really good job of cross-module optimizations so as a precaution I keep the entire core implementation within the same module. Additionally, removing the Free intermediate data structure might have improved the efficiency of pattern matching, maybe.

cartazio commented 12 years ago

cool: so you're doing (in reverse order of relative importance) 1) a worker-wrapper transform, 2) doing a by hand version of the inline pragma, plus presumably some by hand simplification post inlining, 3) you dropped the free monad style representation

Gabriella439 commented 12 years ago

Exactly. I would say I "inlined" the free monad representation, but otherwise that's correct.

Probably the part where simplification by hand mattered the most was the definition of composition. You will really see this if you compare the Channel instance for proxies in Control.Proxy.Core between versions 2.4 and 2.5. Once I dropped the monad transformer version that gave me the freedom to do a lot more clever rewrites of composition that generate great core and that's why composition works so much faster now.

Everything else required very little hand-simplification other than trivial eta reductions and the worker-wrapper transforms.

Gabriella439 commented 11 years ago

I wanted to give another performance update before closing this issue since I thought you might be interested in this since you are curious about performance. The new pipes-3.0 has two new performance-related features:

a) pipes now includes rewrite rules for user-written code just like conduit now that optimize your code to the hand-written free monad. This means that you can now use do notation and other niceties and trust that it will still compile to the optimal tuned code that I would have written.

b) I now type class all the utility functions so that the user can switch between using the fast implementation with the weak monad transformer laws or the correct by construction implementation. This extra polymorphism does not impact efficiency, at all. I've done repeated benchmarks throughout the entire upgrade process to ensure that rewrite rules see through the abstractions very robustly. You don't need to use any compiler pragmas to help the rewrite rules fire and they are very reliable.

I experimented with a wide variety of tricks to get rewrite rules to consistently fire and I found that one trick worked incredibly better than every other approach, which was to ensure that any variable in a rewrite rule was given a unique top-level function of its own with as few type class constraints as possible (and it should not be a method of a type class). By eliminating constraints from the matched variable you dramatically improve the chance that it matches correctly since it no longer matches once the compiler specializes the type class.

The advantages of this trick over every other method I tried were that:

The library uses only one RULES pragma (containing two rules) and no other pragmas, to give you some idea of how reliable this trick is. These two rules sufficed to implement all the necessary optimizations. I've benchmarked all the library utility's throughout the entire development, which initially began as hand-written free monads for the concrete proxy type and ended as polymorphic type-classed utilities using ordinary do notation (just compare "Control.Proxy.Prelude.Base" for versions 2.5 and 3.0 to see what I mean), and there have been no performance regressions at all.

If you have any other questions about performance, feel free to open a new issue.

jwiegley commented 9 years ago

@Gabriel439 How do these results relate to what's in pipes 4.0?

Gabriella439 commented 9 years ago

The only performance improvements since I wrote the above comment were the addition of shortcut fusion rewrite rules as outlined in this post. The only performance regressions were in the Pipes.Lift module, where I chose to simplify the code for clarity at the expense of performance.