Gabriella439 / foldl

Composable, streaming, and efficient left folds
BSD 3-Clause "New" or "Revised" License
159 stars 51 forks source link

benchmarks #2

Closed michaelt closed 10 years ago

michaelt commented 11 years ago

I made a branch with a bunch of benchmarks. I didn't make a pull request as it contains extraneous material, and a hideously boilerplated Bench.hs.

The benchmarks/Fold directory contains three replicas of the current Control.Foldl with different optimization ideas. Everything pertains to folds over lists; I was thinking of adding more pertaining to bytestring, vector and text -- and I suppose pipes, though perhaps pipes foldls should be benchmarked elsewhere. It contains a report.html made with carter s template. For some reason this is very large (and also javascripty); it takes a minute for my browser to open it, but it is much clearer than the default (perhaps because I don't know how best to arrange benchmarks)

The result of the list-foldling competition, so far, is very much in favor of benchmarks/Foldl/Inlines: it combines 1) a simple-minded early timing of the inlining of <*>, trying to make sure that <*> has exposed the constructor before fold tries anything, so to speak (I see now that I forgot to delay the inlining of fold in this version, but it seems not to have mattered); and a definition of fold that just imports Data.List.foldl' . I still wonder if a scheme using Dan Burton's idea of using a RULE pertaining to build might do something better, but maybe build really just is made for foldrs.

I meant to record that I ran the benchmark executable with

 dist/build/prelude-benchmarks/prelude-benchmarks --template=benchmarks/report.tpl -o benchmarks/report.html

having built with

 cabal configure --enable-bench --package-db=../.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d/
Gabriella439 commented 11 years ago

Thanks! Give me a few days to process this, but I appreciate this a lot.

michaelt commented 11 years ago

Of course. I just realized I forgot to link the branch: https://github.com/michaelt/Haskell-Foldl-Library/tree/bench

Gabriella439 commented 11 years ago

Ok, I got a chance to study this more closely. I'm noticing very curious results that I cannot explain.

So first I tested this on my work computer and I got consistent results using both your criterion benchmarks and also time: Inline.hs outperforming the original foldl library by a factor of 2x to 3x.

I went to check the generated core and I find that Inline.hs was not invoking fold/build fusion, and the clasic foldl library was invoking fold/build fusion, but it was still running slower because it was producing less efficient core for some bizarre reason.

Here's the core that Inline.hs was generating:

main_$s$wlgo [Occ=LoopBreaker]
  :: Int#
     -> Int#
     -> [Int]
     -> (# Int, Int #)

main_$s$wlgo =
  \ (sc_s2oA :: Int#)
    (sc1_s2oB :: Int#)
    (sc2_s2oC :: [Int]) ->
    case sc2_s2oC of _ {
      [] -> (# I# sc_s2oA, I# sc1_s2oB #);
      : x_a14J xs_a14K ->
        case x_a14J of _ { I# y_a17o ->
        main_$s$wlgo
          (+# sc_s2oA y_a17o) (+# sc1_s2oB 1) xs_a14K
        }
    }

The importing thing to note is that Inline.hs, while more efficient, is still not fusing away the list.

The original foldl was producing this ugly core stuck within a letrec block:

main2 =
  case Control.Foldl.$fApplicativeFold_$c<*>
         @ Int
         @ Int
         main3
         (Control.Foldl.genericLength_$sgenericLength @ Int)
  of _ { Control.Foldl.Fold @ x0_ay5 step1_ay6 begin_ay7 done_ay8 ->
  case done_ay8
         (letrec {
            go_a114 [Occ=LoopBreaker] :: Int# -> x0_ay5 -> x0_ay5

            go_a114 =
              \ (x_a115 :: Int#) (eta_B1 :: x0_ay5) ->
                case step1_ay6 eta_B1 (I# x_a115)
                of vx_a10T { __DEFAULT ->
                case x_a115 of wild1_Xp {
                  __DEFAULT -> go_a114 (+# wild1_Xp 1) vx_a10T;
                  100000000 -> vx_a10T
                }
                }; } in
          go_a114 0 begin_ay7)
  of _ { I# ww_a1Xr ->
  $wshowSignedInt 0 ww_a1Xr ([] @ Char)
  }
  }

Here it's fusing away the loop (and ghc-core confirms that the fold/build rule fires) but the inner loop is stuck within a letrec block which seems to take a toll on performance.

This was weird because I got much better core on my laptop computer, so I checked out your repository on my laptop and redid the comparison and this time the results varied:

I checked their core and this time foldl was producing the optimal core (the same as in the original blost post), and Inline.hs was still producing the same core as on my work computer.

This tells me a couple of things:

The only difference I'm aware of is that my work computer is using ghc-7.4.2 and base-4.5.1.0 and my laptop is using ghc-7.4.1 and base-4.5.0.0. Other than that there is no obvious reason why they would produce different core for the exact same code.

However, I think we should strive for code that reliably fuses away the intermediate list without triggering this ugly letrec slowdown since that always produces a huge speed bost. These analyses show that foldl' never fuses away the intermediate list so I'm reluctant to change to use foldl' just yet.

michaelt commented 11 years ago

I see, I will have to look into this again (I was hardly thinking I had arrived at any final result). I am slightly suspicious of the criterion approach, wonderful as it is, as I guess you are. I should say, my principal purpose at the point when I mentioned this, was to make sure that complex applicative combinations did not change behavior. These are the only justification for the library. I should also say, I was compiling throughout with -fllvm, which made a considerable difference in many cases, but I forgot to compare plain -02 with -O2 -fllvm systematically.

The following are somewhat bleary eyed additional remarks.

When you say,'using 'time',' does this mean: you wrote main = print $ fold xys, and then did time ./main on the command line? This sort of thing will frequently give different results, since among other things it involves a further composition of expressions which gives the optimization machinery scope to find further simplifications. (Admittedly I don't see how this could matter here.) In general criterion results don't tell us much about composition of the benched functions, thus one is faced with the (quadratic) labor of benching these compositions; this is mostly what my little tests were doing. In any case if you were using something like the above main, a criterion test of show . fold ABC would be closer to what is wanted, though again its hard to see how that could admit any interesting optimization.

What puzzles me though is that you think that it is obvious that foldr/build optimization is what we want -- or is that what you are saying in the first two paragraphs above? Inlines.hs doesn't fuse away the list because it is fighting the build/foldr machinery, so the list must be generated and consumed. Note that the corresponding vector fold doesn't have this problem Vector.foldl' fits with the stream optimization machinery, so in

    fold_vector (liftA2 (,) all any) $ map even myVector

the vector map even myVector won't be built, only myVector will be. This is because Vector.foldl' f z = Stream.foldl' f z . stream and map g is unstream . Stream.map f . stream. By contrast List.foldl' requires the list, and cant act on the build thing directly. (A list may be developed from start to finish, but need it ever all be in memory? If not the emphasis on 'building the structure' is misplaced I would think.) Note that no one thinks this means List.foldl' is not the bees knees in suitable cases -- reading you I thought it was as if you were saying it must be defective since it doesn't fuse. The advantage mostly has to do with space -- and in the present case of course the composibility of the 'reified' folds -- though -O2 and -fllvm may be able to crush any strict left fold further. This is why it isn't a problem that e.g. the Pipe fold with runIdentity is n * slower (for some n); the main trouble is to avoid some sort of buildup of thunks in the accumulator, as was happening when I tried to work with the original Writer versions of these folds. This was especially clear where one was using Endo: we don't need to look at core to know that we will be accumulating a monster before we ever get the (e.g.) number we were hoping for; it couldn't be otherwise. The pathologies with large lists/producers/sequences-in-general are those familiar with foldr (+) 0 -- or am I in a muddle?

A couple points not on the optimization front, before I forget: In the statistics package esp https://github.com/bos/statistics/blob/master/Statistics/Sample.hs almost everything he defines by recursion is in the form of fold(vector) (Fold go seed fini) to use the recurrent expressions from that module and elsewhere; he is forever writing foo vec = fini (Vector.foldl' go special-superstrict-seed vec) as if waiting for the present type to make the abstraction. Something like this may be the killer app for a 'beautiful folding' library and so might be worth a look. One obvious point comes up in that module that must be familiar to you: a fold ((/) <$> sum <*> genericLength might make intuitive sense, but give the wrong answer, since sum overflows, nevertheless mean makes sense. It is probably as well to export average and mean for integral and fractional cases that don't have this problem; he uses one. (I was going to look at these and other things like the histogram functions to see if they could be expressed with Fold in a clearer way, and perhaps without his unrelenting monomorphism and vectorism.)

I wonder if the functions like all in the library are such a great idea, at least without comment; the ordinary haskell experience of the cheapness of these functions goes away, since each term in the series will be tested even after we have found a failing case.

Also I was trying a bit ago to think of some way of zooming folds -- or something remotely like that -- so that e.g . a Fold Double Double could be folded over a Vector (Double,Int) -- along with e.g. a Fold Int Int and a Fold (Double, Int) Int -- as if it were liftA3 (,,) (fzoom _1 fold1) (fold2) (fzoom _2 foldl2). It is easy enough to define a function for the pair case, but I wondered if a function Lens -> Fold -> Fold might not be pleasing, especially where we have e.g. a user defined Unboxed type with various fields. This is perhaps a little obscure, it just occurring to me a bit ago.

michaelt commented 11 years ago

Oh, a bunch of remarks in the middle there are a bit muddled; the main thing is simpler than I thought. Maybe you were basically saying this. The reason criterion gives worse results than e.g.

 main = getArgs >>= \[n] -> 
            print $ I.fold (liftA2 (/) I.sum (I.genericLength))  (enumFromTo (1::Double) (read n)

for Control.Foldl is just that criterion's whnf comes in between function and argument and thus blocks fusion. Moreover it prepares the 'actual' list in advance. It thus tends to favor a foldl' version in this sort of case. (The same would occur with stream fusion of course, and this is how e.g. Data.Text is benched.) Nevertheless, I can't get a simple main to be faster with Control.Foldl; a typical result is

$ time ./Foldl 10000000
5000000.5
real    0m0.902s

$ time ./Inlines 10000000
5000000.5
real    0m0.731s

which of course isn't much. I will study further how to make a more rational comparison.

Gabriella439 commented 11 years ago

Right, the Prelude foldl' requires a list and does not do any sort of stream fusion so no matter what we do it will never fuse away the list. What I was trying to say is that in your benchmarks foldl' was doing better than the right fold because of reasons having nothing to do with stream fusion, specifically the body of the loop getting trapped in a letrec block. However, I still think we should stick closely to the foldr-based solution because that at least has a chance of getting fusion to fire and when it does fire it works amazingly well.

I agree that criterion's splitting of the function from the argument combined with whnf on the argument is probably interfering with list fusion, which is why we only observe fusion when using the time utility or studying core.

I also still have no idea why your computer and my work computer never get stream fusion to work. I only observe this on my laptop, which is very surprising. All I know for sure is that ghc is generating drastically different core for the same program, so it may be a different behavior introduced in version 7.4.2 and later that is causing this. Either way, we have to fix this since I can't expect people to use version 7.4.1.

michaelt commented 11 years ago

This decision seems reasonable; I haven't had time to look into it more yet. I wasn't thinking that fusion really enters into the matter in any case; Data.List.foldl' does well because the compiler knows how to optimize suitable tail-recursive functions whatever the subject, not because it has a special list-optimization and list-elimination scheme like ghc's build/foldr fusion or the (list) stream fusion of the stream-fusion package. That is, if you define your own data List a = Nil | Cons a (List a) , the foldl' you define will be just as awesome as Data.List.foldl' for this reason, a more general and deeper form of optimization, not a fusion or deforestation scheme. The (totally aprioristic) reason why I was resisting saddling fold with the build/foldr system, rightly or wrongly, is that it is known to be tailored exactly to potentially infinite lists and foldls of lazy operations over them and so forth. The strict foldl' is really a different world -- and the one this Foldl library is about; I don't believe it is an oversight that there is no attempt to bring Data.List.foldl' in line with build/foldr in the standard library; it is a part of the theory. But of course it is a matter of experience and testing in fact. I will look into it more. In any case, the optimization scheme for this particular function is a detail; it was only necessary to make sure there were no gruesome space leaks in the offing.

Gabriella439 commented 10 years ago

So it seems that I've gotten the hang of getting fusion to reliably fold. The trick was just adding INLINABLE pragmas to the Applicative and Functor type class methods, using foldr in the implementation of all folding functions, and attaching INLINE pragmas wherever foldr was used. I no longer see any examples where fusion doesn't fire using this general trick.