Bodigrim / tasty-bench

Featherlight benchmark framework, drop-in replacement for criterion and gauge.
https://hackage.haskell.org/package/tasty-bench
MIT License
80 stars 11 forks source link

Subtract benchmark baseline #50

Closed 1Jajen1 closed 1 year ago

1Jajen1 commented 1 year ago

I have a benchmark for mutable hashtables. Its my understanding that env re-uses whatever is created for every benchmark iteration, so benchmarks that change the table (insert/delete) cannot use env to setup a table. So to mitigate this I setup a baseline benchmark which only does the setup + iterating the keys I insert/delete. The actual numbers I would be interested in are the following benchmark times minus this baseline benchmark. It would be nice to have a bsubtract (or similar) which does just that. Oh and to add: If the setup cost is much larger than the function itself (ie huge table vs 1k inserts) the scale with which the time is output makes subtracting the baseline manually useless as it looses precision.

Another thing I'd love in the same context is being able to divide the result by some number. I am running all the methods over a few thousand keys to get stable results and while dividing by that number to get the actual runtime isn't difficult, it would be nice to have the framework provide a combinator for it.

Bodigrim commented 1 year ago

I think env allows you to store IORef and pass as much mutable state as you like, if desirable.

Subtraction is unlikely to work well because of imprecision of measurements. If you subtract benchmarks with close measurements from each other, you'll get a standard deviation much larger than the mean.

Scaling results is possible if you define a new type wrapper for Benchmarkable, but I think it's too niche for core functionality. The idea behind tasty-bench is that one should never assign much meaning to the absolute measurements anyway.

1Jajen1 commented 1 year ago

I think env allows you to store IORef and pass as much mutable state as you like, if desirable.

Well I already have a mutable structure, but what I need is a freshly setup one each iteration. Consider an insert benchmark: If I reuse the same one with the same keys, the first iteration is going to be costly, everything else is inserting existing keys, which is effectively just a lookup benchmark in disguise.

Subtraction is unlikely to work well because of imprecision of measurements. If you subtract benchmarks with close measurements from each other, you'll get a standard deviation much larger than the mean.

I may be doing something conceptually wrong: I am benchmarking inserting n keys into a hashtable (n goes from 10-10^7). There are two fixed costs: Creating a new hashtable and iterating the keys (in an unboxed vector). I setup a baseline benchmark which does only those two operations so that I can later derive the cost of inserting alone. The setup cost also varies (vector-hashtables is very expensive when creating a new hashtable) but this is mostly irrelevant for mid-large numbers of keys to insert. However until ~100k elements the baseline (new + traverse) still amounts to ~20% of the insert benchmark. The actual cost of inserting (and the growth operations) is what I am interested in, which is the insert benchmark time - the baseline. At low numbers of keys this is also necessary to fairly compare to vector-hashtables because of setup costs.

Scaling results is possible if you define a new type wrapper for Benchmarkable, but I think it's too niche for core functionality. The idea behind tasty-bench is that one should never assign much meaning to the absolute measurements anyway.

Scaling is really the least important one to me. It is nice to derive a somewhat accurate average single cost from a benchmark that needs to run over a lot of different inputs to be comparable, but to compare against other libraries its not that necessary.

Bodigrim commented 1 year ago

Well I already have a mutable structure, but what I need is a freshly setup one each iteration. Consider an insert benchmark: If I reuse the same one with the same keys, the first iteration is going to be costly, everything else is inserting existing keys, which is effectively just a lookup benchmark in disguise.

You might have better luck with criterion:perRunEnv. I'm reluctant to add such function to tasty-bench: start-stopping clock before every iteration and doing non-negligible amount of work in between negatively affect quality of measurements and likely makes them less useful.

The actual cost of inserting (and the growth operations) is what I am interested in, which is the insert benchmark time - the baseline.

I wonder if this is a use case for tasty-bench-fit:fits.

1Jajen1 commented 1 year ago

You might have better luck with criterion:perBatchEnv. I'm reluctant to add such function to tasty-bench: start-stopping clock before every iteration and doing non-negligible amount of work in between negatively affect quality of measurements and likely makes them less useful.

Thanks for the suggestions. Its really nice how easy it is to move from tasty-bench to criterion or gauge and back. Sadly criterion takes way longer to complete my benchmark suite. I mean 60-90 minutes for tasty-bench to 10+ hours for criterion (started it yesterday night, its still not finished. On a rather powerful desktop with little to no other activity. Edit: It now finished. After 12hours and 50 minutes). So I guess I'll stick to tasty-bench. Also tasty-bench structured console output is a lot nicer.

However when switching back and forth one thing caught my eye: criterion on my bulk insert benchmarks for the hashtable (which does a lot of large allocations, and many small ones for boxed storage) reports very different numbers. This does not change with a different time mode setting. For example bulk inserting a million boxed integers as keys and values is reported as: 146ms (with a baseline of just 174 microseconds) by criterion and 393ms with a 10.9ms baseline by tasty-bench. This difference carries through all table sizes, even for just 25 elements (which does one table resize) the difference is quite large: 209ns with 37ns baseline (criterion) and 515ns with 376ns baseline for tasty-bench. This happens only for this benchmark, every other benchmark (lookups, inserts + deletes at constant table size) is pretty much equal between both frameworks. Any ideas why this difference happens here specifically? (I compile without -threaded and basically only pass -02). The most significant deviation is at 10^7 elements: criterion ~5s, tasty-bench ~15s.

I wonder if this is a use case for tasty-bench-fit:fits.

How so? I am mostly trying to fairly compare bulk inserts between hashtables, vector-hashtables and my library. After thinking about this some more, I've pretty much ditched the idea of trying to isolate the insert+resize cost. I should probably just keep a baseline of creating the table (because that influences small benchmarks) and nothing else. Then at least the medium to large benchmark sizes don't really need this step anymore.

Oh and I guess I'll close this because my original question/suggestion has been answered. (The addition to the env docs also looks good :+1:).

Bodigrim commented 1 year ago

I think it's https://github.com/Bodigrim/tasty-bench/issues/39 biting again.

1Jajen1 commented 1 year ago

I've been digging in criterion and tasty-bench source and found one difference that may be significant. criterion uses NOINLINE pragmas on their whnfIO function (and the other variants). Quickly changed that in tasty-bench and the results are now much closer. For now I have only tested the largest benchmark size and only for the boxed table variant, and the smallest size for that variant too. The largest is now on par with criterions results (changes from 15s to 5s with NOINLINE on whnfIO) and the smallest changes from 540ns down to 330ns. criterion still reports around 200ns here though, so there may be more going on here.

criterion source here also mentions some github threads with more information. I'll probably investigate this further and let you know in a separate issue if I come up with something useful.

Oh and the benchmarks are at https://github.com/1Jajen1/Brokkr/tree/main/brokkr-hashtables (but I'll probably have to come up with something smaller :sweat_smile: )