Closed bgamari closed 8 years ago
Is this ready to merge or still in progress?
Gabriel Gonzalez notifications@github.com writes:
Is this ready to merge or still in progress?
Still in progress; there is one apparent performance regression that I need to look into.
For the record this will (hopefully) fix #124.
The performance story here is rather mixed,
Pipes/chain
benchmark speeds up by about 20%Folds/findIndex
benchmark slows down by about 14%Fold/all
benchmark appears to speed up by about 4%Fold/fold
benchmark appears to speed up by about 7%Fold/fold
benchmark appears to speed up by about 5%Pipes/concat
benchmark appears to slow down by about 120%; unfortunately I still don't know how to avoid this. It looks like a rule isn't firing, although I was never quite able to work out how this was being optimized previouslyPipes/drop
benchmark regresses by roughly 7%Pipes/filter
benchmark speeds up by round 50%Pipes/mapM
speeds up by roughly 15%Pipes/scanM
speeds up by roughly 30%Zips/zip
and Zips/zipWith
speed up by roughly 40%All-in-all, the arithmetic mean of the deltas comes out to a roughly 1% improvement. However, this patch-set wasn't necessarily intended to improve performance; rather it was to make it clear to GHC what inlinings we expect it to performance (as it now warns when rules are fragile).
Pipes/concat
is indeed worrying; whether the rest of the gains are enough to offset it are up to you. It can no doubt be fixed, but my time for this is rather thin at the moment.
Another interesting axis to this is whether we inline _bind
. Doing so allows GHC to specialise away the Monad
dictionary when it is known; this manifests as some double-digit-percentage performance improvements on Folds/{all, any, find}
, Pipes/{find, mapM}
. Oddly enough, however, there are larger regressions in the *_A
benchmarks in LiftBench
. For this reason I currently don't inline _bind
.
So I had a bit more time to look at the Pipes/concat
issue this morning. It appears that forcing inlining (in phase 0, to avoid short-cutting other rewrites) of the composition operators (namely (//>)
) is sufficient to turn the 120% runtime increase in this benchmark into a 40% runtime reduction.
With this change, the final runtime deltas are,
Folds/all -62.0%
Folds/any -60.0%
Folds/find -56.8%
Folds/findIndex 26.3%
Folds/fold 5.1%
Folds/foldM 5.7%
Folds/head 0.4%
Folds/index 5.6%
Folds/last -5.7%
Folds/length 3.4%
Folds/null 4.2%
Folds/toList 5.2%
Pipes/chain -58.9%
Pipes/concat -43.1%
Pipes/drop 18.3%
Pipes/dropWhile 12.0%
Pipes/filter -48.1%
Pipes/findIndices -22.8%
Pipes/map -23.6%
Pipes/mapM -56.8%
Pipes/scan -30.0%
Pipes/scanM -43.4%
Pipes/take -23.0%
Pipes/takeWhile -29.2%
ReaderT/runReaderP_A 15.9%
ReaderT/runReaderP_B 8.1%
StateT/evalStateP_A 24.1%
StateT/evalStateP_B 15.2%
StateT/execStateP_A 21.1%
StateT/execStateP_B 19.7%
StateT/runStateP_A 13.9%
StateT/runStateP_B 11.4%
Zips/zip -52.1%
Zips/zipWith -54.3%
enumFromTo.vs.each/each -46.6%
enumFromTo.vs.each/enumFromTo -52.1%
The findIndex
and drop
regressions appear to be quite sensitive to context (e.g. I'm unable to reproduce them in minimized benchmarks) so I'm going to call this finished. The StateT
and ReaderT
regressions are troubling, but they appear to be largely dependent upon whether _bind
is inlined, which regresses runtime in some of the other benchmarks for reasons I'm not entirely sure I understand.
@Gabriel439 ping.
Sorry for the delay. Could you go back to inlining _bind
? I don't care very much about the liftBench
benchmarks (I should probably get rid of them) and I'd love to be able to specialize away the Monad
dictionary
I've pushed another patch restoring inlining of _bind
. Indeed I can't actually seem to reproduce the regression that I was seeing earlier, so it appears that this actually has no cost even.
Here's a systematic comparison
0-initial
1-inline
2-no-inline-bind
3-more-inlining
4-inline-bind
Benchmark | 0-initial | 1-inline | 2-no-inline-bind | 3-more-inlining | 4-inline-bind |
---|---|---|---|---|---|
Folds/all | 1.34e-4 | 3.4% | 2.8% | -54.5% | -59.4% |
Folds/any | 1.33e-4 | 5.4% | 6.0% | -60.6% | -60.5% |
Folds/find | 1.35e-4 | 1.9% | -5.8e-2% | -59.1% | -61.1% |
Folds/findIndex | 1.22e-4 | 5.4% | 2.8% | 6.1% | 2.1% |
Folds/fold | 5.14e-5 | 1.9% | 0.4% | 0.4% | 0.9% |
Folds/foldM | 5.16e-5 | -2.3% | -2.2% | -3.5% | -1.3% |
Folds/head | 8.48e-9 | 4.7% | 2.6% | 0.3% | -0.9% |
Folds/index | 1.10e-4 | 6.4% | 4.4% | 1.1e-2% | -1.4% |
Folds/last | 8.50e-5 | 1.8% | 1.6% | -2.8% | -1.8% |
Folds/length | 4.30e-5 | 1.6% | 1.9% | -0.9% | 4.0% |
Folds/null | 7.97e-9 | -1.3% | 6.6% | 3.3% | 1.8% |
Folds/toList | 1.11e-4 | -6.0% | -6.4% | -9.1% | -8.1% |
Pipes/chain | 9.85e-4 | -12.3% | -7.6% | -58.5% | -57.8% |
Pipes/concat | 1.28e-4 | 116.6% | 116.0% | -42.5% | -42.0% |
Pipes/drop | 8.29e-5 | 1.6% | 0.8% | 1.5% | 10.6% |
Pipes/dropWhile | 1.17e-4 | 0.9% | 1.7% | -3.6% | 0.5% |
Pipes/filter | 4.77e-4 | -44.8% | -45.1% | -53.9% | -50.9% |
Pipes/findIndices | 3.01e-4 | -0.4% | -0.3% | -27.2% | -19.8% |
Pipes/map | 2.55e-4 | -1.8% | -1.4% | -37.1% | -31.2% |
Pipes/mapM | 1.05e-3 | -19.2% | -18.5% | -59.8% | -63.3% |
Pipes/scan | 3.05e-4 | -3.5% | -4.3% | -35.6% | -34.8% |
Pipes/scanM | 8.57e-4 | -36.7% | -31.3% | -48.1% | -45.8% |
Pipes/take | 2.70e-4 | -5.8% | -4.3% | -35.9% | -36.4% |
Pipes/takeWhile | 2.81e-4 | 0.2% | -1.7% | -27.5% | -22.3% |
ReaderT/runReaderP_A | 2.24e-4 | 1.5% | -0.7% | -3.8% | -2.0% |
ReaderT/runReaderP_B | 3.94e-3 | -0.8% | -0.8% | 7.7e-2% | 1.0% |
StateT/evalStateP_A | 2.96e-4 | -9.3% | -9.4% | -10.8% | -9.3% |
StateT/evalStateP_B | 4.15e-3 | -0.2% | -1.3% | 3.9% | -0.7% |
StateT/execStateP_A | 2.65e-4 | 2.1% | -0.3% | 6.1% | -0.1% |
StateT/execStateP_B | 4.08e-3 | 2.2% | 1.8% | 3.6% | 2.6% |
StateT/runStateP_A | 2.94e-4 | -8.3% | -7.7% | -4.5% | -10.5% |
StateT/runStateP_B | 3.91e-3 | -1.1% | -0.6% | 6.9% | 2.5% |
Zips/zip | 1.03e-3 | -43.4% | -44.5% | -57.5% | -56.7% |
Zips/zipWith | 1.00e-3 | -40.5% | -43.2% | -55.1% | -51.0% |
enumFromTo.vs.each/each | 1.54e-4 | 2.7% | -1.9% | -53.9% | -47.2% |
enumFromTo.vs.each/enumFromTo | 1.62e-4 | 10.1% | -1.1% | -54.7% | -52.7% |
Perfect! Thanks for taking the time to fix all the rewrite rules. I know how much effort that must have taken so I wanted to let you know that I'm deeply thankful for your work on this
The rewrite rules defined in Pipes are currently quite fragile as the functions which they match on may be inlined before the rules are given an opportunity to fire. Here we delay inlining of various operations to ensure that the rules can fire unimpeded during simplifier phase 2.
These cases are pointed out by a warning which will be introduced in GHC 8.0.1.