Closed michaelt closed 10 years ago
Thanks! Give me a few days to process this, but I appreciate this a lot.
Of course. I just realized I forgot to link the branch: https://github.com/michaelt/Haskell-Foldl-Library/tree/bench
Ok, I got a chance to study this more closely. I'm noticing very curious results that I cannot explain.
So first I tested this on my work computer and I got consistent results using both your criterion
benchmarks and also time
: Inline.hs
outperforming the original foldl
library by a factor of 2x to 3x.
I went to check the generated core and I find that Inline.hs
was not invoking fold/build
fusion, and the clasic foldl
library was invoking fold/build
fusion, but it was still running slower because it was producing less efficient core for some bizarre reason.
Here's the core that Inline.hs
was generating:
main_$s$wlgo [Occ=LoopBreaker]
:: Int#
-> Int#
-> [Int]
-> (# Int, Int #)
main_$s$wlgo =
\ (sc_s2oA :: Int#)
(sc1_s2oB :: Int#)
(sc2_s2oC :: [Int]) ->
case sc2_s2oC of _ {
[] -> (# I# sc_s2oA, I# sc1_s2oB #);
: x_a14J xs_a14K ->
case x_a14J of _ { I# y_a17o ->
main_$s$wlgo
(+# sc_s2oA y_a17o) (+# sc1_s2oB 1) xs_a14K
}
}
The importing thing to note is that Inline.hs
, while more efficient, is still not fusing away the list.
The original foldl
was producing this ugly core stuck within a letrec
block:
main2 =
case Control.Foldl.$fApplicativeFold_$c<*>
@ Int
@ Int
main3
(Control.Foldl.genericLength_$sgenericLength @ Int)
of _ { Control.Foldl.Fold @ x0_ay5 step1_ay6 begin_ay7 done_ay8 ->
case done_ay8
(letrec {
go_a114 [Occ=LoopBreaker] :: Int# -> x0_ay5 -> x0_ay5
go_a114 =
\ (x_a115 :: Int#) (eta_B1 :: x0_ay5) ->
case step1_ay6 eta_B1 (I# x_a115)
of vx_a10T { __DEFAULT ->
case x_a115 of wild1_Xp {
__DEFAULT -> go_a114 (+# wild1_Xp 1) vx_a10T;
100000000 -> vx_a10T
}
}; } in
go_a114 0 begin_ay7)
of _ { I# ww_a1Xr ->
$wshowSignedInt 0 ww_a1Xr ([] @ Char)
}
}
Here it's fusing away the loop (and ghc-core
confirms that the fold
/build
rule fires) but the inner loop is stuck within a letrec
block which seems to take a toll on performance.
This was weird because I got much better core on my laptop computer, so I checked out your repository on my laptop and redid the comparison and this time the results varied:
criterion
, Inline.hs
outperforms classic foldl
by about 2x to 3xtime
, classic foldl
outperforms Inline.hs
by about 10xI checked their core and this time foldl
was producing the optimal core (the same as in the original blost post), and Inline.hs
was still producing the same core as on my work computer.
This tells me a couple of things:
foldl
is still faster when its inner loop doesn't get stuck within a letrec
blockcriterion
is causing its inner loop to stuck within a letrec
blockfoldl
implementation works it is faster than the foldl'
-based implementation by a factor of 10xThe only difference I'm aware of is that my work computer is using ghc-7.4.2
and base-4.5.1.0
and my laptop is using ghc-7.4.1
and base-4.5.0.0
. Other than that there is no obvious reason why they would produce different core for the exact same code.
However, I think we should strive for code that reliably fuses away the intermediate list without triggering this ugly letrec
slowdown since that always produces a huge speed bost. These analyses show that foldl'
never fuses away the intermediate list so I'm reluctant to change to use foldl'
just yet.
I see, I will have to look into this again (I was hardly thinking I had arrived at any final result). I am slightly suspicious of the criterion approach, wonderful as it is, as I guess you are. I should say, my principal purpose at the point when I mentioned this, was to make sure that complex applicative combinations did not change behavior. These are the only justification for the library. I should also say, I was compiling throughout with -fllvm, which made a considerable difference in many cases, but I forgot to compare plain -02
with -O2 -fllvm
systematically.
The following are somewhat bleary eyed additional remarks.
When you say,'using 'time',' does this mean: you wrote main = print $ fold xys
, and then did time ./main
on the command line? This sort of thing will frequently give different results, since among other things it involves a further composition of expressions which gives the optimization machinery scope to find further simplifications. (Admittedly I don't see how this could matter here.) In general criterion results don't tell us much about composition of the benched functions, thus one is faced with the (quadratic) labor of benching these compositions; this is mostly what my little tests were doing. In any case if you were using something like the above main, a criterion test of show . fold ABC
would be closer to what is wanted, though again its hard to see how that could admit any interesting optimization.
What puzzles me though is that you think that it is obvious that foldr/build optimization is what we want -- or is that what you are saying in the first two paragraphs above? Inlines.hs doesn't fuse away the list because it is fighting the build/foldr machinery, so the list must be generated and consumed. Note that the corresponding vector fold doesn't have this problem Vector.foldl'
fits with the stream optimization machinery, so in
fold_vector (liftA2 (,) all any) $ map even myVector
the vector map even myVector
won't be built, only myVector
will be. This is because Vector.foldl' f z = Stream.foldl' f z . stream
and map g
is unstream . Stream.map f . stream
. By contrast List.foldl'
requires the list, and cant act on the build
thing directly. (A list may be developed from start to finish, but need it ever all be in memory? If not the emphasis on 'building the structure' is misplaced I would think.) Note that no one thinks this means List.foldl'
is not the bees knees in suitable cases -- reading you I thought it was as if you were saying it must be defective since it doesn't fuse. The advantage mostly has to do with space -- and in the present case of course the composibility of the 'reified' folds -- though -O2 and -fllvm may be able to crush any strict left fold further. This is why it isn't a problem that e.g. the Pipe
fold
with runIdentity
is n * slower (for some n); the main trouble is to avoid some sort of buildup of thunks in the accumulator, as was happening when I tried to work with the original Writer versions of these folds. This was especially clear where one was using Endo
: we don't need to look at core to know that we will be accumulating a monster before we ever get the (e.g.) number we were hoping for; it couldn't be otherwise. The pathologies with large lists/producers/sequences-in-general are those familiar with foldr (+) 0
-- or am I in a muddle?
A couple points not on the optimization front, before I forget: In the statistics
package esp https://github.com/bos/statistics/blob/master/Statistics/Sample.hs almost everything he defines by recursion is in the form of fold(vector) (Fold go seed fini)
to use the recurrent expressions from that module and elsewhere; he is forever writing foo vec = fini (Vector.foldl' go special-superstrict-seed vec)
as if waiting for the present type to make the abstraction. Something like this may be the killer app for a 'beautiful folding' library and so might be worth a look. One obvious point comes up in that module that must be familiar to you: a fold ((/) <$> sum <*> genericLength
might make intuitive sense, but give the wrong answer, since sum
overflows, nevertheless mean
makes sense. It is probably as well to export average
and mean
for integral and fractional cases that don't have this problem; he uses one. (I was going to look at these and other things like the histogram functions to see if they could be expressed with Fold
in a clearer way, and perhaps without his unrelenting monomorphism and vectorism.)
I wonder if the functions like all
in the library are such a great idea, at least without comment; the ordinary haskell experience of the cheapness of these functions goes away, since each term in the series will be tested even after we have found a failing case.
Also I was trying a bit ago to think of some way of zooming folds -- or something remotely like that -- so that e.g . a Fold Double Double
could be folded over a Vector (Double,Int)
-- along with e.g. a Fold Int Int
and a Fold (Double, Int) Int
-- as if it were liftA3 (,,) (fzoom _1 fold1) (fold2) (fzoom _2 foldl2)
. It is easy enough to define a function for the pair case, but I wondered if a function Lens -> Fold -> Fold
might not be pleasing, especially where we have e.g. a user defined Unboxed type with various fields. This is perhaps a little obscure, it just occurring to me a bit ago.
Oh, a bunch of remarks in the middle there are a bit muddled; the main thing is simpler than I thought. Maybe you were basically saying this. The reason criterion gives worse results than e.g.
main = getArgs >>= \[n] ->
print $ I.fold (liftA2 (/) I.sum (I.genericLength)) (enumFromTo (1::Double) (read n)
for Control.Foldl
is just that criterion's whnf
comes in between function and argument and thus blocks fusion. Moreover it prepares the 'actual' list in advance. It thus tends to favor a foldl'
version in this sort of case. (The same would occur with stream fusion of course, and this is how e.g. Data.Text is benched.) Nevertheless, I can't get a simple main to be faster with Control.Foldl
; a typical result is
$ time ./Foldl 10000000
5000000.5
real 0m0.902s
$ time ./Inlines 10000000
5000000.5
real 0m0.731s
which of course isn't much. I will study further how to make a more rational comparison.
Right, the Prelude foldl'
requires a list and does not do any sort of stream fusion so no matter what we do it will never fuse away the list. What I was trying to say is that in your benchmarks foldl'
was doing better than the right fold because of reasons having nothing to do with stream fusion, specifically the body of the loop getting trapped in a letrec
block. However, I still think we should stick closely to the foldr
-based solution because that at least has a chance of getting fusion to fire and when it does fire it works amazingly well.
I agree that criterion's splitting of the function from the argument combined with whnf
on the argument is probably interfering with list fusion, which is why we only observe fusion when using the time
utility or studying core.
I also still have no idea why your computer and my work computer never get stream fusion to work. I only observe this on my laptop, which is very surprising. All I know for sure is that ghc is generating drastically different core for the same program, so it may be a different behavior introduced in version 7.4.2
and later that is causing this. Either way, we have to fix this since I can't expect people to use version 7.4.1
.
This decision seems reasonable; I haven't had time to look into it more yet. I wasn't thinking that fusion really enters into the matter in any case; Data.List.foldl'
does well because the compiler knows how to optimize suitable tail-recursive functions whatever the subject, not because it has a special list-optimization and list-elimination scheme like ghc's build/foldr
fusion or the (list) stream fusion of the stream-fusion
package. That is, if you define your own data List a = Nil | Cons a (List a)
, the foldl' you define will be just as awesome as Data.List.foldl'
for this reason, a more general and deeper form of optimization, not a fusion or deforestation scheme. The (totally aprioristic) reason why I was resisting saddling fold
with the build/foldr system, rightly or wrongly, is that it is known to be tailored exactly to potentially infinite lists and foldls of lazy operations over them and so forth. The strict foldl' is really a different world -- and the one this Foldl library is about; I don't believe it is an oversight that there is no attempt to bring Data.List.foldl'
in line with build/foldr
in the standard library; it is a part of the theory. But of course it is a matter of experience and testing in fact. I will look into it more. In any case, the optimization scheme for this particular function is a detail; it was only necessary to make sure there were no gruesome space leaks in the offing.
So it seems that I've gotten the hang of getting fusion to reliably fold. The trick was just adding INLINABLE
pragmas to the Applicative
and Functor
type class methods, using foldr
in the implementation of all folding functions, and attaching INLINE
pragmas wherever foldr
was used. I no longer see any examples where fusion doesn't fire using this general trick.
I made a branch with a bunch of benchmarks. I didn't make a pull request as it contains extraneous material, and a hideously boilerplated Bench.hs.
The benchmarks/Fold directory contains three replicas of the current
Control.Foldl
with different optimization ideas. Everything pertains to folds over lists; I was thinking of adding more pertaining to bytestring, vector and text -- and I suppose pipes, though perhaps pipes foldls should be benchmarked elsewhere. It contains a report.html made withcarter
s template. For some reason this is very large (and also javascripty); it takes a minute for my browser to open it, but it is much clearer than the default (perhaps because I don't know how best to arrange benchmarks)The result of the list-foldling competition, so far, is very much in favor of
benchmarks/Foldl/Inlines
: it combines 1) a simple-minded early timing of the inlining of<*>
, trying to make sure that<*>
has exposed the constructor before fold tries anything, so to speak (I see now that I forgot to delay the inlining offold
in this version, but it seems not to have mattered); and a definition offold
that just importsData.List.foldl'
. I still wonder if a scheme using Dan Burton's idea of using a RULE pertaining tobuild
might do something better, but maybebuild
really just is made forfoldrs
.I meant to record that I ran the benchmark executable with
having built with