Bodigrim / tasty-bench

Featherlight benchmark framework, drop-in replacement for criterion and gauge.
https://hackage.haskell.org/package/tasty-bench
MIT License
80 stars 11 forks source link

Output of the benchmark function is retained in memory #39

Closed noughtmare closed 6 months ago

noughtmare commented 1 year ago

See my discourse thread for the full story.

The short version is that this issue is reproduced by this program:

import Test.Tasty.Bench
-- import Criterion.Main
import Control.DeepSeq

drop' :: Int -> [a] -> [a]
drop' n s =
      case n of
        0 ->
          case s of
            [] -> []
            x : xs -> x : drop' 0 xs
        _ ->
          case s of
            [] -> []
            x : xs -> drop' (n - 1) xs

main = do
  let input = replicate 1000000 'a'
  defaultMain
    [ bench "1" $ whnf (rnf . drop' 100000) input
    , bench "2" $ nf (drop' 100000) input
    ]

The result is:

All
  1: OK (2.67s)
    4.88 ms ± 250 μs,  41 MB allocated, 1.3 KB copied,  53 MB peak memory
  2: OK (1.72s)
    26.0 ms ± 2.6 ms,  41 MB allocated,  40 MB copied, 120 MB peak memory

Changing to criterion yields these wildly different results:

benchmarking 1
time                 4.943 ms   (4.867 ms .. 5.039 ms)
                     0.993 R²   (0.987 R² .. 0.998 R²)
mean                 5.130 ms   (5.039 ms .. 5.235 ms)
std dev              309.2 μs   (245.2 μs .. 386.1 μs)
variance introduced by outliers: 35% (moderately inflated)

benchmarking 2
time                 5.058 ms   (4.957 ms .. 5.163 ms)
                     0.996 R²   (0.991 R² .. 0.998 R²)
mean                 5.213 ms   (5.124 ms .. 5.349 ms)
std dev              337.9 μs   (248.5 μs .. 504.7 μs)
variance introduced by outliers: 39% (moderately inflated)

I've tracked it down in Core to this difference:

-- Expression that evaluates benchmark 1
seq# (case $wgo ($wdrop' 100000# x1) of { (# #) -> () }) eta2

-- Helper function for benchmark 2
eta1 :: [Char] -> [Char]
eta1 = \ (s :: [Char]) -> $wdrop' 100000# s

-- Expression that evaluates benchmark 2
seq#
  (let {
    x2 :: [Char]
    x2 = eta1 x1 } 
   in
    case $wgo x2 of { (# #) -> x2 })
  eta2

This shows benchmark 2 retains x2 in memory during the normalization ($wgo).

I think this can be solved by using rnf instead of force:

-nf = funcToBench force
+nf = funcToBench rnf
Bodigrim commented 1 year ago

Should whnf in this case be defined as whnf = funcToBench seq?

noughtmare commented 1 year ago

I don't think whnf needs to be changed. Only nf, nfIO, and nfAppIO.

Bodigrim commented 1 year ago

Well, it would be more symmetric to change both.

I'm sorry, I'm terribly overloaded at the moment. If you can take a bench suite of, say, bytestring or containers and check the effect of nf = funcToBench rnf, I'll be very grateful.

noughtmare commented 1 year ago

OK, I'll make the change and try it out on your existing suites.

noughtmare commented 1 year ago

Here's the raw results for bytestring with GHC 9.4.4:

rnf-bytestring-after.csv rnf-bytestring-before.csv

And the full output of the command line (including the baseline comparisons):

out.log

The most remarkable result is that unpack consistently takes about 200% more time. I will investigate that further.

Update: unpack allocates much more for some reason.

noughtmare commented 1 year ago

I've been convinced by Andreas Klebinger that using force is actually more representative of the (worst case) real world behavior. And it is easy to get the other behavior by just including rnf in your benchmark. So, perhaps all that should be done is adding some documentation to explain this.

Bodigrim commented 1 year ago

Documentation patch is welcome.

Bodigrim commented 1 year ago

I've updated documentation in 461919e.

Bodigrim commented 1 year ago

In the light of #48, I'm thinking that maybe it's worth to change nf after all.

-nf = funcToBench force
+nf = funcToBench rnf

Sometimes you do benchmark operations on lists and it makes sense to measure their allocation as well. However, what happens much more often in my practice is that a list is just an implementation detail to run a computation N times, like in

nf (\n -> map func [1..n]) 100

In such case you really do not want to allocate that list, it just defeats the purpose by introducing more noise to the measurements of func.

Bodigrim commented 1 year ago

However, this change would have to wait until tasty-bench-0.4, which in its turn awaits tasty-1.5 to be released, so probably somewhere in September.

Bodigrim commented 6 months ago

Changed in 3ce8baf8260582d10b0894bc2bbd6b3105b41a5a.