Bodigrim / tasty-bench

Featherlight benchmark framework, drop-in replacement for criterion and gauge.
https://hackage.haskell.org/package/tasty-bench
MIT License
80 stars 11 forks source link

Can we get a probabilistic check of asymptotic complexity? #8

Closed kindaro closed 3 years ago

kindaro commented 3 years ago

For example, consider the following output:

    1:                            OK (0.18s)
       10 ns ± 920 ps
    2:                            OK (0.21s)
       12 ns ± 714 ps
    4:                            OK (0.17s)
       19 ns ± 1.8 ns
    8:                            OK (0.16s)
       37 ns ± 2.7 ns
    16:                           OK (0.32s)
       74 ns ± 2.6 ns
    32:                           OK (0.17s)
      157 ns ±  14 ns
    64:                           OK (0.17s)
      311 ns ±  29 ns
    128:                          OK (0.17s)
      622 ns ±  53 ns
    256:                          OK (0.17s)
      1.3 μs ±  86 ns
    512:                          OK (0.17s)
      2.5 μs ± 203 ns
    1024:                         OK (0.17s)
      4.9 μs ± 352 ns
    2048:                         OK (0.17s)
      9.9 μs ± 800 ns
    4096:                         OK (0.17s)
       20 μs ± 1.9 μs
    8192:                         OK (0.17s)
       40 μs ± 2.8 μs
    16384:                        OK (0.17s)
       80 μs ± 5.6 μs
    32768:                        OK (0.17s)
      158 μs ±  12 μs
    65536:                        OK (0.17s)
      318 μs ±  31 μs

What happens here is the same function is measured with exponentially increasing input. It looks from the numbers as though its time complexity is linear.

Can we make a magical automated check out of these numbers?

Bodigrim commented 3 years ago

It would be nice, but probably too complicated for tasty-bench to accomplish. Is there a Haskell library (snippet / function / whatever), which can guess a form of a function given a list of tabulated values? Is it feasible to distinguish O(n) from O(n log n)?

I was thinking about implementing something in spirit of bcompare, which mysteriously disappeared starting from criterion-1.0.0.0. It could be a good step moving away from absolute numbers in benchmarks.

kindaro commented 3 years ago

We can phrase it differently: is the given function distinguishable from a given asymptote up to a constant multiple within some realistic range of inputs? Of course a pathological example like f (n) + (εn)! may always be invented such that up to a large enough n it looks like f and then suddenly explodes. But in such cases we may as well not have a bench mark in the first place.


It would be nice, but probably too complicated for tasty-bench to accomplish. Is there a Haskell library (snippet / function / whatever), which can guess a form of a function given a list of tabulated values? Is it feasible to distinguish O(n) from O(n log n)?

I am not aware of any ready made solution in Haskell (although a quick search reveals prior art in other settings). But how hard can it be?

  1. Apply inverses of a bunch of functions representing asymptotic classes.
  2. Fit a linear model to each resulting data set.
  3. Select a class where the error is the least.

This can be flaky since some classes are very similar. But we do not really need this much. We only need to tell if a function plausibly may be in a given class, up to this much error. We really only want to automate and make precise the judgement that we already routinely make with a naked eye. And we already put bench marks in a test suite — it is a logical next step to make them actually detect failures.


I understand that this would be experimental and anything can go wrong. And we do not have any doctors of statistics on the pay roll. So maybe you are right and we should not even try.

Bodigrim commented 3 years ago

It does not take a PhD in statistics, but probably involves opinionated design choices and extra dependencies. tasty-bench is a good citizen of tasty ecosystem, so one can potentially write another TestReporter, which will perform this kind of analysis. Additional metadata can be passed via dummy pattern in after. I'm interested to see how this can play out, because I had similar desires in the past.

Actually, even a standalone package, which can guess asymptotics of a function, would be nice to have.

Bodigrim commented 3 years ago

I made some progress with bcompare atop of tasty facilities for test dependencies. Documentation is still absent, but here is an example of usage: https://github.com/Bodigrim/mod/commit/cb4b737e7d76993bfa3e55264e165c29bd1199e6 It generates a report, comparing benchmark results with each other:

  Sum
    Data.Mod:           OK (1.46s)
      483 ms ±  41 ms
    Data.Mod.Word:      OK (2.18s)
      145 ms ± 8.2 ms, 0.30x
    finite-field:       OK (12.76s)
      4.26 s ± 133 ms, 8.81x
    finite-typelits:    OK (13.34s)
      4.46 s ± 254 ms, 9.25x
    modular-arithmetic: OK (8.63s)
      2.87 s ±  91 ms, 5.95x
    modular:            OK (45.02s)
      6.53 s ± 543 ms, 13.52x
Bodigrim commented 3 years ago

With bcompare the original example can look like below, making it easier to guesstimate algorithmic complexity:

All
  folds
    foldl'
      1:     OK (3.86s)
         14 ns ± 434 ps, 0.00x
      2:     OK (4.87s)
         18 ns ± 500 ps, 0.00x
      4:     OK (14.16s)
         26 ns ± 586 ps, 0.01x
      8:     OK (6.25s)
         46 ns ± 1.5 ns, 0.01x
      16:    OK (11.60s)
         86 ns ± 1.8 ns, 0.02x
      32:    OK (11.08s)
        165 ns ± 4.8 ns, 0.04x
      64:    OK (4.95s)
        292 ns ± 8.7 ns, 0.07x
      128:   OK (9.77s)
        582 ns ±  12 ns, 0.14x
      256:   OK (9.34s)
        1.1 μs ±  34 ns, 0.26x
      512:   OK (9.43s)
        2.2 μs ±  70 ns, 0.53x
      1024:  OK (4.45s)
        4.2 μs ± 132 ns
      2048:  OK (19.41s)
        8.9 μs ± 313 ns, 2.11x
      4096:  OK (4.64s)
         18 μs ± 654 ns, 4.15x
      8192:  OK (9.36s)
         35 μs ± 677 ns, 8.38x
      16384: OK (4.68s)
         71 μs ± 2.3 μs, 16.90x
      32768: OK (9.45s)
        144 μs ± 3.1 μs, 34.11x
      65536: OK (37.45s)
        283 μs ± 3.5 μs, 67.20x
kindaro commented 3 years ago

So, it is comparing everything to the value at 1024? Nice!

But there are 2 drawbacks:

  1. It is even more numbers on the screen for my tired eyes to look at.
  2. The way to define it is via some obscure reference language? I use awk occasionally but this is, like, anti-Haskell.

So, it is sort of great, but also sort of not.

Bodigrim commented 3 years ago
  1. Yes, tasty uses awk expressions for references between tests. The output above was generated from
bgroup "foldl'" $
  map (\s -> (if S.length s == 1024 then id else bcompare "$NF == \"1024\" && $(NF-1) == \"foldl'\"") $ 
    bench (show $ S.length s) $
      nf (S.foldl' (\acc x -> acc + fromIntegral x) (0 :: Int)) s) foldInputs

This is arguably messy, because we stretch tasty to its limits. It would be nice to provide more ergonomical combinators.

UnkindPartition commented 3 years ago

I came here with the same feature request :-)

I think what's missing in tasty-bench is a way to time something without wrapping it in a TestTree/Benchmark.

Something like measure :: Benchmarkable -> IO Timings, where Timings would contain the summary statistics (mean, sd) as well as the raw measurements that were used to come up with these statistics.

Then downstream packages could use this to do some more intelligent analyses. (I'd have a go at one.)

@kindaro: see here for the explanation of why awk is used for patterns in tasty.

Bodigrim commented 3 years ago

@UnkindPartition what if I expose something like this? https://github.com/Bodigrim/tasty-bench/blob/1aac590eae5ed6259eed44a7403403a3b07d26be/Test/Tasty/Bench.hs#L788-L790 Would it be enough?

UnkindPartition commented 3 years ago

It would also be nice to have a way to estimate a "good enough" number of runs, as described in the "Statistical model" section of the README. Otherwise that logic would have to be duplicated between packages.

Bodigrim commented 3 years ago

@UnkindPartition how about this? https://github.com/Bodigrim/tasty-bench/blob/ee6b2d6eb2686f4020ebc06b8ce43fd15dc50d67/Test/Tasty/Bench.hs#L847-L850

UnkindPartition commented 3 years ago

This looks good. Would be good to document what the timeout does, exactly. (Would it use the max number of iterations that fit into the timeout even if RelStDev is not satisfied? Would it throw an exception?)

Bodigrim commented 3 years ago

Timeout takes a soft priority over RelStDev. No exceptions are thrown. I've elaborated a bit more in haddocks: https://github.com/Bodigrim/tasty-bench/blob/06fc7c1e2706942c4581e13d28d7ac9c901a7b11/Test/Tasty/Bench.hs#L881-L889

kindaro commented 3 years ago

Does this get to be wielded automatically when a bench mark runs in the presence of saved results of the previous bench mark, so that the results of a bench mark can be committed to version control, thereby at once turning it into a true test suite? Also, can we have default name for bench mark files, so that cabal test but works? Or is this a fantasy yet?

Bodigrim commented 3 years ago

@kindaro not sure what exactly you refer by "this". Since measureCpuTime is now publicly exposed, one can write a tasty provider to detect and check asymptotic complexity. I think that such provider is unlikely to be merged into tasty-bench itself, but the beauty of tasty is that you can freely combine providers and ingredients at your own discretion.

You can specify default name for CSV report like this:

import Data.Maybe
import Test.Tasty.Bench
import Test.Tasty.Ingredients
import Test.Tasty.Options
import Test.Tasty.Runners

main = do
  opts <- parseOptions benchIngredients benches
  let opts' = changeOption (Just . fromMaybe (CsvPath "foo.csv")) opts
  case tryIngredients benchIngredients opts' benches of
    Nothing -> print "no ingredient"
    Just mb -> do
      b <- mb
      print $ if b then "Success" else "Failure"