Delay inlining to allow rules to fire

bgamari commented 8 years ago

The rewrite rules defined in Pipes are currently quite fragile as the functions which they match on may be inlined before the rules are given an opportunity to fire. Here we delay inlining of various operations to ensure that the rules can fire unimpeded during simplifier phase 2.

These cases are pointed out by a warning which will be introduced in GHC 8.0.1.

Gabriella439 commented 8 years ago

Is this ready to merge or still in progress?

bgamari commented 8 years ago

Gabriel Gonzalez notifications@github.com writes:

Is this ready to merge or still in progress?

Still in progress; there is one apparent performance regression that I need to look into.

bgamari commented 8 years ago

For the record this will (hopefully) fix #124.

bgamari commented 8 years ago

The performance story here is rather mixed,

the Pipes/chain benchmark speeds up by about 20%
the Folds/findIndex benchmark slows down by about 14%
the Fold/all benchmark appears to speed up by about 4%
the Fold/fold benchmark appears to speed up by about 7%
the Fold/fold benchmark appears to speed up by about 5%
the Pipes/concat benchmark appears to slow down by about 120%; unfortunately I still don't know how to avoid this. It looks like a rule isn't firing, although I was never quite able to work out how this was being optimized previously
the Pipes/drop benchmark regresses by roughly 7%
the Pipes/filter benchmark speeds up by round 50%
Pipes/mapM speeds up by roughly 15%
Pipes/scanM speeds up by roughly 30%
Zips/zip and Zips/zipWith speed up by roughly 40%

All-in-all, the arithmetic mean of the deltas comes out to a roughly 1% improvement. However, this patch-set wasn't necessarily intended to improve performance; rather it was to make it clear to GHC what inlinings we expect it to performance (as it now warns when rules are fragile).

Pipes/concat is indeed worrying; whether the rest of the gains are enough to offset it are up to you. It can no doubt be fixed, but my time for this is rather thin at the moment.

Another interesting axis to this is whether we inline _bind. Doing so allows GHC to specialise away the Monad dictionary when it is known; this manifests as some double-digit-percentage performance improvements on Folds/{all, any, find}, Pipes/{find, mapM}. Oddly enough, however, there are larger regressions in the *_A benchmarks in LiftBench. For this reason I currently don't inline _bind.

bgamari commented 8 years ago

So I had a bit more time to look at the Pipes/concat issue this morning. It appears that forcing inlining (in phase 0, to avoid short-cutting other rewrites) of the composition operators (namely (//>)) is sufficient to turn the 120% runtime increase in this benchmark into a 40% runtime reduction.

With this change, the final runtime deltas are,

    Folds/all                              -62.0%
    Folds/any                              -60.0%
    Folds/find                             -56.8%
    Folds/findIndex                        26.3%
    Folds/fold                             5.1%
    Folds/foldM                            5.7%
    Folds/head                             0.4%
    Folds/index                            5.6%
    Folds/last                             -5.7%
    Folds/length                           3.4%
    Folds/null                             4.2%
    Folds/toList                           5.2%
    Pipes/chain                            -58.9%
    Pipes/concat                           -43.1%
    Pipes/drop                             18.3%
    Pipes/dropWhile                        12.0%
    Pipes/filter                           -48.1%
    Pipes/findIndices                      -22.8%
    Pipes/map                              -23.6%
    Pipes/mapM                             -56.8%
    Pipes/scan                             -30.0%
    Pipes/scanM                            -43.4%
    Pipes/take                             -23.0%
    Pipes/takeWhile                        -29.2%
    ReaderT/runReaderP_A                   15.9%
    ReaderT/runReaderP_B                   8.1%
    StateT/evalStateP_A                    24.1%
    StateT/evalStateP_B                    15.2%
    StateT/execStateP_A                    21.1%
    StateT/execStateP_B                    19.7%
    StateT/runStateP_A                     13.9%
    StateT/runStateP_B                     11.4%
    Zips/zip                               -52.1%
    Zips/zipWith                           -54.3%
    enumFromTo.vs.each/each                -46.6%
    enumFromTo.vs.each/enumFromTo          -52.1%

The findIndex and drop regressions appear to be quite sensitive to context (e.g. I'm unable to reproduce them in minimized benchmarks) so I'm going to call this finished. The StateT and ReaderT regressions are troubling, but they appear to be largely dependent upon whether _bind is inlined, which regresses runtime in some of the other benchmarks for reasons I'm not entirely sure I understand.

bgamari commented 8 years ago

@Gabriel439 ping.

Gabriella439 commented 8 years ago

Sorry for the delay. Could you go back to inlining _bind? I don't care very much about the liftBench benchmarks (I should probably get rid of them) and I'd love to be able to specialize away the Monad dictionary

bgamari commented 8 years ago

I've pushed another patch restoring inlining of _bind. Indeed I can't actually seem to reproduce the regression that I was seeing earlier, so it appears that this actually has no cost even.

Here's a systematic comparison

cf9e0aeca39f48a48c0544ccf4166ec606468089: 0-initial
486ff4353fa527dcdf780a587984e600a4b8fd1d: 1-inline
305dc4ac8d0dd024c38b0d7bc472ac08f1844cb2: 2-no-inline-bind
d54d78b8ea30458495ffee85edc8fd32a54d1e04: 3-more-inlining
5973dd8ea668b891c99bd6e91d0cd28d809c1c55: 4-inline-bind

Benchmark	0-initial	1-inline	2-no-inline-bind	3-more-inlining	4-inline-bind
Folds/all	1.34e-4	3.4%	2.8%	-54.5%	-59.4%
Folds/any	1.33e-4	5.4%	6.0%	-60.6%	-60.5%
Folds/find	1.35e-4	1.9%	-5.8e-2%	-59.1%	-61.1%
Folds/findIndex	1.22e-4	5.4%	2.8%	6.1%	2.1%
Folds/fold	5.14e-5	1.9%	0.4%	0.4%	0.9%
Folds/foldM	5.16e-5	-2.3%	-2.2%	-3.5%	-1.3%
Folds/head	8.48e-9	4.7%	2.6%	0.3%	-0.9%
Folds/index	1.10e-4	6.4%	4.4%	1.1e-2%	-1.4%
Folds/last	8.50e-5	1.8%	1.6%	-2.8%	-1.8%
Folds/length	4.30e-5	1.6%	1.9%	-0.9%	4.0%
Folds/null	7.97e-9	-1.3%	6.6%	3.3%	1.8%
Folds/toList	1.11e-4	-6.0%	-6.4%	-9.1%	-8.1%
Pipes/chain	9.85e-4	-12.3%	-7.6%	-58.5%	-57.8%
Pipes/concat	1.28e-4	116.6%	116.0%	-42.5%	-42.0%
Pipes/drop	8.29e-5	1.6%	0.8%	1.5%	10.6%
Pipes/dropWhile	1.17e-4	0.9%	1.7%	-3.6%	0.5%
Pipes/filter	4.77e-4	-44.8%	-45.1%	-53.9%	-50.9%
Pipes/findIndices	3.01e-4	-0.4%	-0.3%	-27.2%	-19.8%
Pipes/map	2.55e-4	-1.8%	-1.4%	-37.1%	-31.2%
Pipes/mapM	1.05e-3	-19.2%	-18.5%	-59.8%	-63.3%
Pipes/scan	3.05e-4	-3.5%	-4.3%	-35.6%	-34.8%
Pipes/scanM	8.57e-4	-36.7%	-31.3%	-48.1%	-45.8%
Pipes/take	2.70e-4	-5.8%	-4.3%	-35.9%	-36.4%
Pipes/takeWhile	2.81e-4	0.2%	-1.7%	-27.5%	-22.3%
ReaderT/runReaderP_A	2.24e-4	1.5%	-0.7%	-3.8%	-2.0%
ReaderT/runReaderP_B	3.94e-3	-0.8%	-0.8%	7.7e-2%	1.0%
StateT/evalStateP_A	2.96e-4	-9.3%	-9.4%	-10.8%	-9.3%
StateT/evalStateP_B	4.15e-3	-0.2%	-1.3%	3.9%	-0.7%
StateT/execStateP_A	2.65e-4	2.1%	-0.3%	6.1%	-0.1%
StateT/execStateP_B	4.08e-3	2.2%	1.8%	3.6%	2.6%
StateT/runStateP_A	2.94e-4	-8.3%	-7.7%	-4.5%	-10.5%
StateT/runStateP_B	3.91e-3	-1.1%	-0.6%	6.9%	2.5%
Zips/zip	1.03e-3	-43.4%	-44.5%	-57.5%	-56.7%
Zips/zipWith	1.00e-3	-40.5%	-43.2%	-55.1%	-51.0%
enumFromTo.vs.each/each	1.54e-4	2.7%	-1.9%	-53.9%	-47.2%
enumFromTo.vs.each/enumFromTo	1.62e-4	10.1%	-1.1%	-54.7%	-52.7%

Gabriella439 commented 8 years ago

Perfect! Thanks for taking the time to fix all the rewrite rules. I know how much effort that must have taken so I wanted to let you know that I'm deeply thankful for your work on this

Gabriella439 / pipes

Delay inlining to allow rules to fire #163