Bodigrim / tasty-bench

Featherlight benchmark framework, drop-in replacement for criterion and gauge.
https://hackage.haskell.org/package/tasty-bench
MIT License
80 stars 11 forks source link

Add debug mode #21

Closed harendra-kumar closed 2 years ago

harendra-kumar commented 3 years ago

I am using tasty-bench to benchmark a piece of code that forks a process. Not sure how it interacts with the benchmarking process. But I get figures that are wildly off. See below, the figures reported by "tasty-bench", "time" and "+RTS -s":

$ time -v cabal run Benchmark.System.Process -- --stdev 1000000 +RTS -s
...
Generating files...
Running benchmarks...
All
  processChunks tr: OK (3.25s)
    312 μs ± 765 μs, 129 KB allocated, 3.7 KB copied

All 1 tests passed (3.25s)
  13,403,301,168 bytes allocated in the heap
      24,131,192 bytes copied during GC
         828,192 bytes maximum residency (48 sample(s))
         101,600 bytes maximum slop
               4 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     12772 colls,     0 par    0.071s   0.078s     0.0000s    0.0004s
  Gen  1        48 colls,     0 par    0.026s   0.026s     0.0005s    0.0008s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    3.170s  (  3.213s elapsed)
  GC      time    0.098s  (  0.104s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    3.268s  (  3.318s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,227,761,766 bytes per MUT second

  Productivity  97.0% of total user, 96.9% of total elapsed

        Command being timed: "cabal run Benchmark.System.Process -- --stdev 1000000 +RTS -s"
        User time (seconds): 7.98
        System time (seconds): 0.53
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.49
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 320096
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 13
        Minor (reclaiming a frame) page faults: 153321
        Voluntary context switches: 1694
        Involuntary context switches: 67
        Swaps: 0
        File system inputs: 24
        File system outputs: 27752
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

There is some more code outside the benchmarks which is contributing to the "+RTS -s" and "time" figures but even after accounting for that figures are wildly off.

The code to reproduce this is available here: https://github.com/composewell/streamly-process/tree/tasty-bench-issue .

Even gauge has the same issue. Maybe this is some fundamental issue with this type of code? Or am I doing something wrong in the measurements/code?

Bodigrim commented 3 years ago

I think something's wrong with your benchmark itself: the first iteration takes ~2 seconds, while each following is < 1 ms.

Using tasty-bench-0.3, this is just a single iteration:

$ cabal run --constraint='tasty-bench >= 0.3' Benchmark.System.Process -- --stdev Infinity
All
  processChunks tr: OK (1.88s)
    1.872 s

Next tasty-bench measures 1 iteration (t1 = 2) and 2 iterations (t2 < 0.001) and returns (t1 + 2 t2) / 5. Stdev is really huge here (16499266 1000 / 2 / 377 * 100 = 2e9 percents), but since we were targeting 1e12, no more iterations are evaluated:

$ cabal run --constraint='tasty-bench >= 0.3' Benchmark.System.Process -- --stdev 1e12
All
  processChunks tr: OK (1.90s)
    377  ms ± 16499266.09 s

To conform with a smaller stdev, we need to measure 2 and 4 iterations. Both of them are in millisecond range, the result is in millisecond range also:

$ cabal run --constraint='tasty-bench >= 0.3' Benchmark.System.Process -- --stdev 1e6
All
  processChunks tr: OK (2.11s)
    452  μs ± 852 μs

Results from gauge are similar, because AFAIR it skips the very first iteration, and measures only subsequent ones.

harendra-kumar commented 3 years ago

I know, its probably because of not rewinding the file handle.

Is it possible to generate a warning in such cases? The error margin reported was not bad (312 μs ± 765 μs above and much smaller in many other runs) so it did not raise any alarms. Also, it will be great to have some debug or verbose option to report more details that can help finding the issue.

Bodigrim commented 3 years ago

Not sure how to trigger a warning here: it is expected for the very first iteration to take longer because of forcing thunks. Debug mode could be helpful indeed, need to think how to wire it in better.

harendra-kumar commented 3 years ago

Maybe we can always report the difference between the first iteration and the second iteration to give an idea of the cost of one time evaluations. If it seems too high to the user based on their knowledge of the benchmarked code they can decide to investigate.

Bodigrim commented 2 years ago

I've added a package flag to emit all ongoing results to stderr.