bheisler / criterion.rs

Statistics-driven benchmarking library for Rust
Apache License 2.0
4.46k stars 300 forks source link

Statistical questions... #601

Open samuelcolvin opened 2 years ago

samuelcolvin commented 2 years ago

Thanks so much for criterion.rs, please excuse the open-ended question.

I've been reading the Analysis Process docs with interest.

Could you explain (or more likely, point me to an existing explanation) why criterion.rs uses iterations = [d, 2d, 3d, ... Nd] rather than a constant number of iterations per sample?

I'm also wondering about CPU bound benchmarks and the best way measure variation - with CPU bound benchmarks (assuming no threading or competing throughput), the general assumption is that test time is made up of

"true time to execute the tested function" + "random variation due to CPU interruptions & context switching"

(or at least that's always been my assumption)

With this model, it's not the mean time that you'd be looking for, but rather the minimum time.

This does not seem to be the assumption employed by criterion, is that because:

?


Background for my question FWIW: I have a simple benchmarking library in python and I'm trying to get the most consistent results and in particular give the most useful output when comparing runs.

I'd also like to find a statistically rigorous way to:

bheisler commented 2 years ago

Hello!

The most basic answer to why Criterion-rs increases the sample count in each batch is because that's what Criterion (the original Haskell library) did, and that's what japaric implemented before I took over maintenance. However, it is convenient in that it produces a nice sloping sequence of points on a chart which you can then fit a line to as another way to estimate the typical performance per iteration. I'm afraid I don't actually know what statistical properties motivated this design in Criterion, but since it shouldn't really impede a more straightforward calculation of the mean time per iteration it seems like a straightforward improvement. Well, except that it can tend to increase measurement time, I suppose.

As for your other question, I think it's all of the above, really. This idea that benchmarks measure some constant "true time" plus some "random noise" is, I think, misguided. There are many algorithms and data structures that are themselves fundamentally noisy - even widely-used data structures like hash tables. It wouldn't do to report the one time you got lucky and hit no collisions in your hash table as the "true" performance of a function. You assume away multi-threading and IO, but we'd like to be able to benchmark multi-threaded systems that do IO. Any benchmark analysis that doesn't simply assume away multithreading, or variation in IO timings, or indeed variation in the actual performance of the system being benchmarked has to be able to report on and help the developer understand and measure the variation in their code's performance. Reporting only the minimum runtime discards all that useful information.

But also, I think this question gets to the heart of what it is we're even trying to measure. Even if we could construct some hypothetical oracle that would report the "true" runtime of a function on an input, how relevant is that to the thing we as working programmers actually want to know, which is "how will this run in actual production, on messy real-world computers?" My feeling is that the more we do to isolate the measurements from the variability inherent to real operating systems and real machines, the more layers of separation we introduce from the measurement we actually want to make.

As an aside, I have also seen it argued that a benchmark should report the statistical mode of the measurements (ie. the most-common value) instead of the mean or minimum or what have you. I found that rather more convincing, and Criterion-rs does sort of report an estimate of the mode in that the user can just eyeball the Probability Density Function chart to see where the line peaks. Really, all of the summary statistics - mean, mode, min, what-have-you, they all discard important information - what the sophisticated user wants is (a good approximation of) the PDF curve. The summary numbers are useful for boiling the detail down for computers to make decisions about, though.

On the subject of detecting results that don't look valid... I (and a statistically-inclined friend I've asked for advice over the years) are skeptical that such a thing could work. You can't even really define an outlier, let alone detect one robustly, without starting from the assumption that the data should fit some particular distribution. Which brings us back to "trying not to make unnecessary assumptions."

As for trying to determine how long data-collection should take, let me know if you find one. I lean towards turning the question inside out - let the benchmark author tell you how long they're willing to wait and then do the best you can in that time. It takes a lot more statistical knowledge on the part of users to say "I want my results to be robust to within a p-value of 0.01" than "I can run these benchmarks for five seconds each."

samuelcolvin commented 2 years ago

Hi @bheisler thanks so much for your response.

cc'ing @art049 as I know he's interested in this.

Overall I get entirely where you're coming from, feedback on a few specific points:

Well, except that it can tend to increase measurement time, I suppose.

I think that's my concern about this approach, to get a given number of samples clearly takes a lot longer with a ramp up than without, hence I was looking for a strong argument in favour of it.

This idea that benchmarks measure some constant "true time" plus some "random noise" is, I think, misguided.

This seems to be at odds with python's timeit standard lib module, which uses "best" as the main number returned.

This is not to say you're wrong, I guess they've found over the years that this gives the best approximation for python, also seem my next point.

how will this run in actual production, on messy real-world computers?

This is not quite I (or I think many people) run benchmarks. I am exclusively interested in benchmarking to find out if a change to code makes it faster or slower. Others are sometimes interested in comparing different libraries performing the same task, which is a special (often less valid) case of the same thing.

I think criterion has a similar viewpoint as it shows the change relative to the last run.

With this "relative only" viewpoint, being isolated from real world conditions is okay, as long as the affect of the change being profiled is accurately captured.

Of course this doesn't make the "best is the oracle" approach universally valid, but I think it's why libraries like timeit use best - they want something which can be compared between runs which is as insensitive as possible to noise.

There are many algorithms and data structures that are themselves fundamentally noisy - even widely-used data structures like hash tables.

I think this is a really good point I hadn't full appreciated. Even "Completely CPU bound" tasks in reality rely on data-structures with inherently variable performance.

Even ignoring those data structures, the different CPU caches mean that only comparing the best (minimum execution) time does not necessarily show you which implementation is best.

On the subject of detecting results that don't look valid... I (and a statistically-inclined friend I've asked for advice over the years) are skeptical that such a thing could work.

That's interesting, I have a theory on how this should be achievable, it's based on my theory that benchmark time variable should follow a Poisson distribution, or at least a Gamma distribution. If that is the case (and I need to do some more research to confirm it), you should be able to detect if the benchmark results deviate from the set of expected distributions and give a warning about inaccurate results regardless of how much variation there is.

Of course in complex cases where your implementation has random or round-robin or other task distribution built in, these assumptions won't hold. But I think they might be helpful in the majority of cases.

As for trying to determine how long data-collection should take, let me know if you find one.

My idea here comes back to my use case - to compare two implementations.

Want I actually want to know is:

So in theory to know how long we need to run out benchmarks for we need to know:

More generally, this rely on p-values approach is attractive to me since it doesn't rely on any of the assumptions we've talked about above:

You just take your dataset, perhaps do some bootstrap resampling, calculate the p-value and tell the user "this seems to be faster", etc.

bheisler commented 2 years ago

Yes, I'm aware of that comment in timeit's documentation. I've had to answer this question many times over the years, mostly as a result of that comment in timeit's documentation. As indicated above, I think there are some aspects of the problem that don't seem to have been considered by the authors of that comment.

Though your point that sometimes we're interested in absolute measurements and sometimes in relative measurements is a good one.

Your idea to compare two different implementations (sample both implementations, bootstrap the sample, use the bootstrap to estimate properties of the sampled distribution) is more or less what Criterion-rs does, except that the "two implementations" are really "the implementation as it existed when you recorded the baseline" and "the implementation as it exists now". I think it would be burdensome to working programmers to expect them to write two wholly-separate (or even mostly-separate) implementations of their code to be able to benchmark them. For a more scientific context, it may be appropriate. And, as you point out, "has this code gotten faster or slower and by how much" is a much more common question.

Of course, Criterion-rs doesn't try to adjust the measurement time to reach statistical significance - doing that requires very careful statistics, because it's possible for non-significant effects to drift across the significance threshold by random chance, and then drift back under additional sampling. It probably is possible, though, at least making the assumption that the data "should" look like a certain distribution. We do something analogous with the software I work on at $DAYJOB, though of course that part is designed by people with better knowledge of statistics than I myself have.

However, once you've done your sampling and estimation you need to present it to the user and now you're faced with the choice of which of the numbers you've calculated to show at the top, which to move further down, and which to hide entirely except in a verbose mode. This is where the framework author's opinions about which statistics are most informative are relevant. I'm not really confident Criterion-rs's design here is optimal, but it seems to be at least serviceable.