Benchmark noise - Githubissues

gvanrossum commented 1 year ago

I am despairing at the nosiness of our benchmarks. For example, here are two different runs from identical code (I forgot to run pystats, so I re-ran the benchmark, and since yesterday's run one commit was added and then reverted):

The commit merge base is the same too.

How can we derive useful signal from this data?

markshannon commented 1 year ago

While not as noisy as the example above, the benchmarks do seem to be noisier than they used to be. @mdboom thoughts?

mdboom commented 1 year ago

This example is a little noisier than usual, but not much. Here is the comparison of the two commits

This is around +/- 3% for the 5th and 95th percentiles. When I did the original A/A test, I saw +/- 2.5% variance. (The better 2% variance there is only if you use the exact same binary -- PGO and recompilation in general always seems to introduce some variance). It probably wouldn't hurt to automate the A/A test so we are always aware of these limits and include this information in every plot.

In short, I don't think this is new. The work we can do is to improve the variance of individual benchmarks where possible (as I've done recently for mypy, for example). I've been tackling them 1 by 1 starting with the most variable.

How can we derive useful signal from this data?

Short of improving the benchmarks themselves, we could flag known-problematic ones, or reduce the weight of ones that perform poorly in A/A testing.

mdboom commented 1 year ago

https://github.com/faster-cpython/benchmarking/issues/79 https://github.com/faster-cpython/benchmarking/issues/78

markshannon commented 1 year ago

Given the typical time to run the benchmarks is about an hour and only 20 minutes or so are actually running benchmarks (IIRC), we could run more iterations. Don't know how much that would help.

mdboom commented 1 year ago

The cool thing is the new infra can tell us that:

It's around 50 minutes for the benchmark runs -- maybe if we reduced the number of redundant benchmarks we could afford to increase the number of runs for everything else.

markshannon commented 1 year ago

The majority of that 50 minutes is installing dependencies, not running benchmarks.

markshannon commented 1 year ago

I think running the benchmarks may only be 15 minutes or so.

mdboom commented 1 year ago

The majority of that 50 minutes is installing dependencies, not running benchmarks.

The reminds me -- I think there may be a bug in pyperformance: It creates a seperate virtual environment for each benchmark (fine), but I think each has the union of all of the dependencies for all of the benchmarks. I'm going to put together a proper bug report and see if anyone can confirm if that's a bug. If so, it seems like a lot of extra work.

mdboom commented 1 year ago

I think running the benchmarks may only be 15 minutes or so.

Unfortunately, no. The log is helpfully annotated with timestamps. It looks from this like 13:56 to 14:01 (5 minutes) is spent installing dependencies (dependencies are cached so it's basically just copying files within the same SSD). The rest until 14:52 is spent running the benchmarks.

mdboom commented 1 year ago

The majority of that 50 minutes is installing dependencies, not running benchmarks.

The reminds me -- I think there may be a bug in pyperformance: It creates a seperate virtual environment for each benchmark (fine), but I think each has the union of all of the dependencies for all of the benchmarks. I'm going to put together a proper bug report and see if anyone can confirm if that's a bug. If so, it seems like a lot of extra work.

I don't think it's a bug. It tries to use the same venv whenever it can, and only creates a unique venv if there are dependency conflicts. Seems like an optimization on purpose, actually.

mdboom commented 1 year ago

Some observations from playing with the data:

For the overall number, it's actually remarkably stable (0.9992 vs 0.9968), but the different x scales between the plots certainly obscures that. pyperf only shows 2 decimal places, so it would call both of these 1.00x slower.
Yesterday, I committed a change to ignore benchmarks where the comparison is statistically insignificant. This was to match what pyperf does (you've probably seen "Benchmark hidden because not significant (33): ..." in its results), and was also after a conversation with Jake Bailey where he mentioned they do a similar thing for the Typescript compiler and it helps with variability (though they use a slightly different statistical test -- I decided to start by matching pyperf). I started a changelog of analysis changes here [1]. Interestingly, in the case of these commits, it actually makes the variability slightly worse but not enough that it's meaningful -- I think there are advantages to having the plots match pyperf. Without insignificant benchmark rejection the difference between the results is 0.15% vs. 0.23%.

I also played around with some different heuristics for creating the "master distribution" from all of the benchmarks.

ALL The current behavior just puts all of the samples together. Some benchmarks have more samples than others, so this does lead to an uneven weighting of the benchmarks.
FLAT_WEIGHTS gives each benchmark the same weight by taking 100 randomly selected samples from each.
WEIGHTED weights each benchmark by how variable it was observed to be in an A/A test. Benchmarks that are more variable are weighted less in the overall number than those that are known to be stable.
EXCLUDE just excludes the worst benchmarks, where the variability is over 5%. The list of these is: ["pycparser", "generators", "mdp", "pidigits", "pickle_dict", "async_tree_memoization", "gc_traversal"]

The results of these for Guido's two runs are:

The diffs are:

heursistic	diff
ALL	0.0024
FLAT_WEIGHT	0.0034
WEIGHTED	0.0028
EXCLUDED	0.0002

The bottom line is that even when the benchmarks leap all over the place, the "one big number" is remarkably stable, and the exact heuristic doesn't matter too much. Given how much EXCLUDED helps, I think that rather than getting two tricky and magical, I think continuing to focus removing some of the randomness from specific benchmarks (which may just be adding more reps, or possibly removing I/O) is probably the best course.

As a side effort, I think automating the collection of A/A test data and surfacing that in these plots would be helpful. For example, we can add error bars (in red) to each benchmark and change the color if the mean is inside or outside of it:

This would be particularly useful if a change is targetting a specific benchmark to change.

[1] https://github.com/faster-cpython/benchmarking#analysis-changes

gvanrossum commented 1 year ago

Wow, thanks for doing those experiments! Indeed I didn't realize that the x scales were different in the two plots. It looks like my change actually makes things 0.4% slower (not the end of the world, but I had hoped there wouldn't be a difference). It also looks like the first run was just a lot noisier than the second. I guess we'll never know why.

If we're agreeing that EXCLUDED is the best heuristic to get more stable numbers, should we perhaps just delete those seven benchmarks? Or give them a separate tag that excludes them from the "geometric mean" calculation? (We'd have to indicate that somehow in how we display the data.)

mdboom commented 1 year ago

If we're agreeing that EXCLUDED is the best heuristic to get more stable numbers, should we perhaps just delete those seven benchmarks? Or give them a separate tag that excludes them from the "geometric mean" calculation? (We'd have to indicate that somehow in how we display the data.)

Yes, I think the best solution is to either delete or ignore high-variability benchmarks. I suppose ignoring is better than not running at all because they are still a signal of something just a very imprecise one.

With everything above, I was just comparing the two commits (with identical source) you listed in the OP. It occurred to me over the weekend, that we actually have a bunch of results from CPython main during which the overall change is very close to 1.0 that would serve as a much larger and less anecdotal A/A test. I took the 57 commits from main that we've run since October 22, and did a pairwise comparison of each, and then summed the results. When you have this much data, each benchmark has a mean extremely close to 1.0, which is nice. Sorting by standard deviation (largest at the top), you can see which benchmarks tend to have more variation. It's a slightly different set than what I got in Guido's two commits, but is probably a much better way to determine which benchmarks are just adding noise.

output

itamaro commented 1 year ago

Anecdotally, we have observed "unexplained" variability in Instagram perf experiments in A/A' tests (with A and A' being independent builds from the same commit). While the experimental setup for Instagram perf is extremely different from pyperformance, maybe one of the insights that helped with reducing some of the variability would apply to pyperformance as well - namely - applying BOLT on top of PGO helped a lot. I wasn't personally involved in that investigation, so it might not be accurate, but I think the theory was that the linker makes decisions about where in the binary to put what code in ways that can be sensitive to minor changes in the code or environment, which can end up being more or less favorable in unstable ways, and BOLT tends to counter these effects in a pretty stable manner.

mdboom commented 1 year ago

Thanks. I certainly haven't experimented with BOLT at all, but it would be worth trying to run the above experiment with BOLT and see how it compares.

gvanrossum commented 1 year ago

IIUC @corona10 has BOLT experience.

corona10 commented 1 year ago

IIUC @corona10 has BOLT experience.

I will try to help @mdboom if he needs some help. And please note that BOLT of CPython is not a stable feature at this moment. (https://github.com/python/cpython/issues/101525)

While the experimental setup for Instagram perf is extremely different from pyperformance, maybe one of the insights that

For CPython, currently, we are using subsets of standard unit test (same as PGO), so if @itamaro means that we need to get the profile from the pyperformance(LBR sampling), please refer to my experimentation note of the first attempts. They are fully different scenarios and different outputs.

note: https://github.com/faster-cpython/ideas/issues/224

mdboom commented 1 year ago

Thanks for the help and pointers, @corona10!

I'm fine with using the subset of unit tests for optimization for now. I'm mostly interested in seeing if the variability of the tests decreases when using BOLT, not in absolute performance improvements. I think using a different set of benchmarks for optimization training is a separate issue that can be looked at independently.

itamaro commented 1 year ago

right, the question of training workload is independent (and was discussed in #99) - here I'm just thinking about variability. caveat - BOLT can't magically fix inherently unstable or indeterministic benchmarks :) so if this is the root cause of variability, BOLT is not a relevant solution - the benchmarks should be made more stable, or deleted from the suite.

corona10 commented 1 year ago

@mdboom @itamaro

Now I understand what you intended. Thanks for the pointing out :)

corona10 commented 1 year ago

cc @aaupov (You may have interests in this topic)

aaupov commented 1 year ago

right, the question of training workload is independent (and was discussed in #99) - here I'm just thinking about variability. caveat - BOLT can't magically fix inherently unstable or indeterministic benchmarks :) so if this is the root cause of variability, BOLT is not a relevant solution - the benchmarks should be made more stable, or deleted from the suite.

Agree. If anything, BOLT could be a source of extra variability, which can be countered by using deterministic profile (simply saving/checking in profile that was used).

faster-cpython / ideas

Benchmark noise #551