Closed 1Jajen1 closed 1 year ago
I think env
allows you to store IORef
and pass as much mutable state as you like, if desirable.
Subtraction is unlikely to work well because of imprecision of measurements. If you subtract benchmarks with close measurements from each other, you'll get a standard deviation much larger than the mean.
Scaling results is possible if you define a new type wrapper for Benchmarkable
, but I think it's too niche for core functionality. The idea behind tasty-bench
is that one should never assign much meaning to the absolute measurements anyway.
I think
env
allows you to storeIORef
and pass as much mutable state as you like, if desirable.
Well I already have a mutable structure, but what I need is a freshly setup one each iteration. Consider an insert benchmark: If I reuse the same one with the same keys, the first iteration is going to be costly, everything else is inserting existing keys, which is effectively just a lookup benchmark in disguise.
Subtraction is unlikely to work well because of imprecision of measurements. If you subtract benchmarks with close measurements from each other, you'll get a standard deviation much larger than the mean.
I may be doing something conceptually wrong: I am benchmarking inserting n keys into a hashtable (n goes from 10-10^7). There are two fixed costs: Creating a new hashtable and iterating the keys (in an unboxed vector). I setup a baseline benchmark which does only those two operations so that I can later derive the cost of inserting alone. The setup cost also varies (vector-hashtables
is very expensive when creating a new hashtable) but this is mostly irrelevant for mid-large numbers of keys to insert. However until ~100k elements the baseline (new + traverse) still amounts to ~20% of the insert benchmark. The actual cost of inserting (and the growth operations) is what I am interested in, which is the insert benchmark time - the baseline. At low numbers of keys this is also necessary to fairly compare to vector-hashtables
because of setup costs.
Scaling results is possible if you define a new type wrapper for
Benchmarkable
, but I think it's too niche for core functionality. The idea behindtasty-bench
is that one should never assign much meaning to the absolute measurements anyway.
Scaling is really the least important one to me. It is nice to derive a somewhat accurate average single cost from a benchmark that needs to run over a lot of different inputs to be comparable, but to compare against other libraries its not that necessary.
Well I already have a mutable structure, but what I need is a freshly setup one each iteration. Consider an insert benchmark: If I reuse the same one with the same keys, the first iteration is going to be costly, everything else is inserting existing keys, which is effectively just a lookup benchmark in disguise.
You might have better luck with criterion:perRunEnv
. I'm reluctant to add such function to tasty-bench
: start-stopping clock before every iteration and doing non-negligible amount of work in between negatively affect quality of measurements and likely makes them less useful.
The actual cost of inserting (and the growth operations) is what I am interested in, which is the insert benchmark time - the baseline.
I wonder if this is a use case for tasty-bench-fit:fits
.
You might have better luck with criterion:perBatchEnv. I'm reluctant to add such function to tasty-bench: start-stopping clock before every iteration and doing non-negligible amount of work in between negatively affect quality of measurements and likely makes them less useful.
Thanks for the suggestions. Its really nice how easy it is to move from tasty-bench
to criterion
or gauge
and back. Sadly criterion
takes way longer to complete my benchmark suite. I mean 60-90 minutes for tasty-bench
to 10+ hours for criterion
(started it yesterday night, its still not finished. On a rather powerful desktop with little to no other activity. Edit: It now finished. After 12hours and 50 minutes). So I guess I'll stick to tasty-bench
. Also tasty-bench
structured console output is a lot nicer.
However when switching back and forth one thing caught my eye: criterion
on my bulk insert benchmarks for the hashtable (which does a lot of large allocations, and many small ones for boxed storage) reports very different numbers. This does not change with a different time mode setting. For example bulk inserting a million boxed integers as keys and values is reported as: 146ms (with a baseline of just 174 microseconds) by criterion
and 393ms with a 10.9ms baseline by tasty-bench
. This difference carries through all table sizes, even for just 25 elements (which does one table resize) the difference is quite large: 209ns with 37ns baseline (criterion
) and 515ns with 376ns baseline for tasty-bench
. This happens only for this benchmark, every other benchmark (lookups, inserts + deletes at constant table size) is pretty much equal between both frameworks. Any ideas why this difference happens here specifically? (I compile without -threaded
and basically only pass -02
). The most significant deviation is at 10^7 elements: criterion
~5s, tasty-bench
~15s.
I wonder if this is a use case for tasty-bench-fit:fits.
How so? I am mostly trying to fairly compare bulk inserts between hashtables
, vector-hashtables
and my library. After thinking about this some more, I've pretty much ditched the idea of trying to isolate the insert+resize cost. I should probably just keep a baseline of creating the table (because that influences small benchmarks) and nothing else. Then at least the medium to large benchmark sizes don't really need this step anymore.
Oh and I guess I'll close this because my original question/suggestion has been answered. (The addition to the env
docs also looks good :+1:).
I think it's https://github.com/Bodigrim/tasty-bench/issues/39 biting again.
I've been digging in criterion
and tasty-bench
source and found one difference that may be significant. criterion
uses NOINLINE
pragmas on their whnfIO
function (and the other variants). Quickly changed that in tasty-bench
and the results are now much closer. For now I have only tested the largest benchmark size and only for the boxed table variant, and the smallest size for that variant too. The largest is now on par with criterions results (changes from 15s to 5s with NOINLINE
on whnfIO
) and the smallest changes from 540ns down to 330ns. criterion
still reports around 200ns here though, so there may be more going on here.
criterion
source here also mentions some github threads with more information. I'll probably investigate this further and let you know in a separate issue if I come up with something useful.
Oh and the benchmarks are at https://github.com/1Jajen1/Brokkr/tree/main/brokkr-hashtables (but I'll probably have to come up with something smaller :sweat_smile: )
I have a benchmark for mutable hashtables. Its my understanding that
env
re-uses whatever is created for every benchmark iteration, so benchmarks that change the table (insert/delete) cannot useenv
to setup a table. So to mitigate this I setup a baseline benchmark which only does the setup + iterating the keys I insert/delete. The actual numbers I would be interested in are the following benchmark times minus this baseline benchmark. It would be nice to have absubtract
(or similar) which does just that. Oh and to add: If the setup cost is much larger than the function itself (ie huge table vs 1k inserts) the scale with which the time is output makes subtracting the baseline manually useless as it looses precision.Another thing I'd love in the same context is being able to divide the result by some number. I am running all the methods over a few thousand keys to get stable results and while dividing by that number to get the actual runtime isn't difficult, it would be nice to have the framework provide a combinator for it.